Is Bypassing Kafka in Datahub Deployment Feasible or Complicated?

user-3 · March 13, 2024, 10:46am

Hi folks! We’ve been playing with the Datahub quickstart and it’s great

We have a question about infrastructure for a full deployment: is it theoretically possible to bypass Kafka? We’re wondering if it’s mainly to enable push-based metadata changes - and so if we were willing to lose those features could we replace it with a cheaper/simpler system? Or is it quite deeply embedded and would require a lot of changes?

user-2 · March 13, 2024, 10:46am

DataHub requires Kafka to operate. Kafka is used as a durable log that can be used to store inbound requests to update the Metadata Graph (Metadata Change Proposal), or as a change log detailing the updates that have been made to the Metadata Graph (Metadata Change Log).
https://datahubproject.io/docs/how/kafka-config/

user-3 · March 13, 2024, 10:46am

Yes, I read that in the docs - thank you though.

To clarify, we realise the system uses Kafka. But how Kafka-specific is the way it’s used - would trying to replace be a huge piece of work, or something relatively doable once we got started?

I realise this kind of question is difficult and vague and opinion-based, but any thoughts would be great

user-2 · March 13, 2024, 10:46am

<@UV5UEC3LN> <@U01GCJKA8P9> for their take

user-1 · March 13, 2024, 10:46am

So the MCP side can be skipped by disabling the async default ingestion option and only ingesting directly through API synchronously, albeit with a significant loss in ability to scale, but the MCL side does not really have a way to be skipped. It’s possible you could deploy with something like RedPanda which still will work with Kafka APIs, but is simpler from an infrastructure perspective, but we have not tested with any of these options. Generally we don’t recommend trying to bypass having Kafka in your deployment as when you get into some of the simpler options you lose some of the strong guarantees Kafka provides.

Topic		Replies	Views
Diagnosing and Improving DataHub Kafka Topic Lag Post-Upgrade from 0.12.0 to 0.13.0 ingestion	3	47	May 27, 2024
Ingesting Data from Kafka into DataHub ingestion	6	41	November 4, 2024
Ingesting Metadata from Kafka Producer Without Using Kafka Connect ingestion	3	28	June 17, 2024
Ingesting Kafka Metadata into Datahub: Understanding Extracted Data getting-started	1	68	March 4, 2024
Understanding Metadata Change Events in Ingestion Process and DataHub's Architecture ingestion	3	61	March 4, 2024

Is Bypassing Kafka in Datahub Deployment Feasible or Complicated?

Related topics