Is Bypassing Kafka in Datahub Deployment Feasible or Complicated?

Original Slack Thread

Hi folks! We’ve been playing with the Datahub quickstart and it’s great :muscle:

We have a question about infrastructure for a full deployment: is it theoretically possible to bypass Kafka? We’re wondering if it’s mainly to enable push-based metadata changes - and so if we were willing to lose those features could we replace it with a cheaper/simpler system? Or is it quite deeply embedded and would require a lot of changes?

DataHub requires Kafka to operate. Kafka is used as a durable log that can be used to store inbound requests to update the Metadata Graph (Metadata Change Proposal), or as a change log detailing the updates that have been made to the Metadata Graph (Metadata Change Log).

Yes, I read that in the docs - thank you though.

To clarify, we realise the system uses Kafka. But how Kafka-specific is the way it’s used - would trying to replace be a huge piece of work, or something relatively doable once we got started?

I realise this kind of question is difficult and vague and opinion-based, but any thoughts would be great

<@UV5UEC3LN> <@U01GCJKA8P9> for their take

So the MCP side can be skipped by disabling the async default ingestion option and only ingesting directly through API synchronously, albeit with a significant loss in ability to scale, but the MCL side does not really have a way to be skipped. It’s possible you could deploy with something like RedPanda which still will work with Kafka APIs, but is simpler from an infrastructure perspective, but we have not tested with any of these options. Generally we don’t recommend trying to bypass having Kafka in your deployment as when you get into some of the simpler options you lose some of the strong guarantees Kafka provides.