Understanding DataHub's Architecture and Key Components

Original Slack Thread

<@U06TM7M40RK> can you please give me an indication of the architecture of datahub ?

DataHub’s architecture is designed to support data discovery, collaboration, governance, and observability within the modern data stack. Here are the key components and highlights of its architecture:

  1. Schema-first Metadata Modeling: DataHub employs a model-first approach using a serialization-agnostic language. It supports both REST and GraphQL APIs, and an AVRO-based API over Kafka for metadata changes. This allows for interoperability between different tools and systems (<https://datahubproject.io/docs/0.13.1/architecture/architecture|DataHub Architecture Overview>).

  2. Stream-based Real-time Metadata Platform: The platform is stream-oriented, enabling real-time communication and reflection of metadata changes. This allows for building systems that can react to metadata changes in real-time, such as access control systems (<https://datahubproject.io/docs/0.13.1/architecture/architecture|DataHub Architecture Overview>).

  3. Federated Metadata Serving: DataHub supports federated metadata services, allowing different teams to own and operate their metadata services. These services communicate with a central search index and graph using Kafka, supporting global search and discovery while enabling decoupled ownership of metadata (<https://datahubproject.io/docs/0.13.1/architecture/architecture|DataHub Architecture Overview>).

  4. Metadata Storage: Metadata is stored in a document store, typically an RDBMS like MySQL, Postgres, or Cassandra. The Metadata Service exposes REST and GraphQL APIs for CRUD operations, search, and graph queries (<https://datahubproject.io/docs/0.14.0/architecture/metadata-serving|DataHub Serving Architecture>).

  5. Metadata Change Log Stream (MCL): When a metadata change is committed, a Metadata Change Log event is emitted over Kafka. This allows external systems to react in real-time to metadata changes (<https://datahubproject.io/docs/0.14.0/architecture/metadata-serving|DataHub Serving Architecture>).

  6. Metadata Index Applier: The Metadata Change Log events are consumed by a job that applies changes to the graph and search index, ensuring metadata changes are processed in the correct order (<https://datahubproject.io/docs/0.14.0/architecture/metadata-serving|DataHub Serving Architecture>).

Would you like more detailed information on any specific aspect of DataHub’s architecture?

Sources:

Hi <@U08A2TALZE1>, how did my answer work out for you? Need any more assistance? You can also tag <@U07FV3UHE4D> to let your fellow community members know you’re still looking for guidance.

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><