DataHub: Implementing Custom XML Metadata Ingestion via MAE Consumer

user-1 · June 10, 2024, 12:04am

<https://github.com/datahub-project/datahub/blob/master/docs/components.md|DataHub Components Overview>: This document provides an overview of the various components in DataHub, including the Metadata Store.

Sources:

<https://datahubproject.io/docs/0.13.1/architecture/metadata-serving|DataHub Serving Architecture>
<https://github.com/datahub-project/datahub/blob/master/docs/components.md|DataHub Components Overview>
<Slack Slack Discussion>
<https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/datahub-serving.png|DataHub Serving Architecture> Image 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-3 · June 10, 2024, 12:04am

datahub-gms stands for?

user-3 · June 10, 2024, 12:04am

<@U06TM7M40RK> datahub-gms stands for?

user-1 · June 10, 2024, 12:04am

DataHub GMS stands for Generalized Metadata Service. It is a core component of the DataHub architecture responsible for managing and serving metadata. Here’s a detailed explanation of its role and functionality:

Generalized Metadata Service (GMS)

Metadata Management: GMS is responsible for the ingestion, storage, and retrieval of metadata. It provides APIs for creating, reading, updating, and deleting metadata entities and their aspects.
APIs:
- GraphQL API: GMS exposes a GraphQL API that serves as the primary public API for interacting with the metadata. This API allows for flexible and efficient querying and mutation of metadata.
- Rest.li API: GMS also provides a Rest.li API, which is used for system-specific ingestion of metadata. This API is considered internal and is primarily consumed by other DataHub components.
Storage Backend: GMS interacts with various storage backends, including relational databases (e.g., MySQL, PostgreSQL), search indices (e.g., Elasticsearch), and optionally graph databases (e.g., Neo4j).
Metadata Change Log (MCL): GMS emits commit events to Kafka when metadata changes are committed. These events can be consumed by other systems for real-time processing and reactions.
Metadata Change Proposal (MCP): GMS can ingest metadata via Metadata Change Proposals, which are processed and stored in the relational database.

Example Architecture Diagram

Here is a high-level system diagram for DataHub’s Serving Tier, which includes the GMS component:

Relevant Documentation

<https://github.com/datahub-project/datahub/blob/master/metadata-service/README.md|DataHub Metadata Service (GMS)>: This document provides an overview of the GMS component, including its APIs and how to set it up.
<https://datahubproject.io/docs/0.13.1/what/gms|What is Generalized Metadata Service (GMS)?>: This document provides a detailed explanation of the GMS component and its role in the DataHub architecture.

Sources:

<https://github.com/datahub-project/datahub/blob/master/metadata-service/README.md|DataHub Metadata Service (GMS)>
<https://datahubproject.io/docs/0.13.1/what/gms|What is Generalized Metadata Service (GMS)?>
<https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/datahub-serving.png|DataHub Serving Architecture> Image 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-3 · June 10, 2024, 12:04am

what is MAE and MAE CONSUMER, I need a detailed explanation please

user-3 · June 10, 2024, 12:05am

<@U06TM7M40RK> what is MAE and MAE CONSUMER, I need a detailed explanation please

user-1 · June 10, 2024, 12:05am

Metadata Audit Event (MAE)

Metadata Audit Event (MAE) is a type of event in DataHub that captures changes made to metadata aspects associated with a particular entity. It provides a way to track and audit changes to metadata over time. Here’s a detailed explanation:

Purpose: MAEs are used to capture and audit changes to metadata. They provide a before-and-after snapshot of the metadata, allowing for a complete history of changes.
Structure: An MAE contains the following key components:
- Old Snapshot: The state of the metadata before the change.
- New Snapshot: The state of the metadata after the change.
- Change Metadata: Information about the change, such as the actor who made the change and the timestamp.
Emission: MAEs are emitted whenever a metadata change is committed to DataHub’s storage layer. This ensures that any listener of MAEs can construct a complete view of the latest state for all aspects.
Deprecation: In recent versions of DataHub (mid-2022), MAEs are no longer actively emitted and are being replaced by Metadata Change Logs (MCLs). The MCLs provide a more streamlined and efficient way to track metadata changes.

Metadata Audit Event Consumer (MAE Consumer)

Metadata Audit Event Consumer (MAE Consumer) is a component in DataHub responsible for consuming and processing MAEs. Here’s a detailed explanation:

Function: The primary function of the MAE Consumer is to listen to change log events emitted as a result of changes made to the metadata graph. It converts these changes into updates against secondary search and graph indexes.
Kafka Topics: The MAE Consumer consumes from two important Kafka topics:
- MetadataChangeLog_Versioned_v1
- MetadataChangeLog_Timeseries_v1
Processing: The MAE Consumer processes the MAEs and applies the changes to the search index (e.g., Elasticsearch) and the graph index (if applicable). This ensures that the metadata changes are reflected in the search and graph indices.
Deployment: The MAE Consumer can be deployed as a standalone Spring job or as part of the Metadata Service. It requires backend services like Kafka and Elasticsearch to be up and running.
Endpoints: The MAE Consumer exposes several endpoints for health checks, metrics, and information:
- healthcheck: http://localhost:9091/actuator/health
- metrics: http://localhost:9091/actuator/metrics
- info: http://localhost:9091/actuator/info

Example Workflow

Here’s an example workflow of how MAEs and the MAE Consumer work together:

Metadata Change: A metadata change is made (e.g., updating the ownership of a dataset).
Emission of MAE: An MAE is emitted, capturing the old and new state of the metadata.
Consumption by MAE Consumer: The MAE Consumer listens to the Kafka topic and consumes the MAE.

user-1 · June 10, 2024, 12:05am

Processing: The MAE Consumer processes the MAE and updates the search and graph indices accordingly.
Search and Query: The updated metadata is now available for search and query through DataHub’s APIs.

Relevant Documentation

<https://github.com/datahub-project/datahub/blob/master/docs/what/mxe.md|Metadata Audit Event (MAE)>: This document provides an overview of MAEs, including their structure and examples.
<https://datahubproject.io/docs/0.13.1/metadata-jobs/mae-consumer-job|Metadata Audit Event Consumer Job>: This document provides an overview of the MAE Consumer, including its function, deployment, and endpoints.

Sources:

<https://github.com/datahub-project/datahub/blob/master/docs/what/mxe.md|Metadata Audit Event (MAE)>
<https://datahubproject.io/docs/0.13.1/metadata-jobs/mae-consumer-job|Metadata Audit Event Consumer Job>
<https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/datahub-serving.png|DataHub Serving Architecture> 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-3 · June 10, 2024, 12:05am

<@U06TM7M40RK> How the MAE modules were implemented? they are coded in python or something else?

user-1 · June 10, 2024, 12:05am

The Metadata Audit Event (MAE) modules in DataHub are primarily implemented in Java, using the Spring framework. Here’s a detailed explanation of their implementation:

Implementation Details

Language and Framework: The MAE modules are implemented in Java, leveraging the Spring framework for dependency injection, configuration, and other enterprise features.
Kafka Integration: The MAE Consumer is designed to consume events from Kafka topics. It uses Kafka clients to subscribe to the relevant topics (MetadataChangeLog_Versioned_v1 and MetadataChangeLog_Timeseries_v1) and process the events.
Spring Boot Application: The MAE Consumer is a Spring Boot application, which provides a convenient way to package and run the application. It includes built-in support for health checks, metrics, and other operational features.
Processing Logic: The core processing logic of the MAE Consumer involves:
- Consuming events from Kafka.
- Parsing the events to extract metadata changes.
- Applying the changes to the search index (e.g., Elasticsearch) and the graph index (if applicable).
Configuration: The MAE Consumer can be configured using Spring’s configuration mechanisms, such as application properties or YAML files. This allows for flexible configuration of Kafka topics, Elasticsearch endpoints, and other settings.

Example Code Snippets

Here are some example code snippets to illustrate the implementation of the MAE Consumer:

Kafka Consumer Configuration

import org.springframework.kafka.annotation.KafkaListener;
import org.springframework.stereotype.Service;

@Service
public class MetadataChangeLogConsumer {

    @KafkaListener(topics = "MetadataChangeLog_Versioned_v1", groupId = "mae-consumer-group")
    public void consumeVersionedLog(String message) {
        // Process the versioned metadata change log
    }

    @KafkaListener(topics = "MetadataChangeLog_Timeseries_v1", groupId = "mae-consumer-group")
    public void consumeTimeseriesLog(String message) {
        // Process the timeseries metadata change log
    }
}

Processing Logic

import com.linkedin.metadata.models.MetadataChangeLog;
import com.linkedin.metadata.models.MetadataChangeEvent;
import org.springframework.stereotype.Service;

@Service
public class MetadataChangeLogProcessor {

    public void processMetadataChangeLog(String message) {
        // Parse the message to extract MetadataChangeLog
        MetadataChangeLog changeLog = parseMessage(message);

        // Apply changes to the search index
        updateSearchIndex(changeLog);

        // Apply changes to the graph index (if applicable)
        updateGraphIndex(changeLog);
    }

    private MetadataChangeLog parseMessage(String message) {
        // Implement parsing logic
        return new MetadataChangeLog();
    }

    private void updateSearchIndex(MetadataChangeLog changeLog) {
        // Implement search index update logic

user-1 · June 10, 2024, 12:05am


    }

    private void updateGraphIndex(MetadataChangeLog changeLog) {
        // Implement graph index update logic
    }
}

Spring Boot Application

import org.springframework.boot.SpringApplication;
import org.springframework.boot.autoconfigure.SpringBootApplication;

@SpringBootApplication
public class MaeConsumerApplication {

    public static void main(String[] args) {
        SpringApplication.run(MaeConsumerApplication.class, args);
    }
}

Relevant Documentation

<https://datahubproject.io/docs/0.13.1/metadata-jobs/mae-consumer-job|Metadata Audit Event Consumer Job>: This document provides an overview of the MAE Consumer, including its function, deployment, and endpoints.
<https://github.com/datahub-project/datahub|DataHub GitHub Repository>: The source code for the MAE Consumer and other DataHub components can be found in the DataHub GitHub repository.

Sources:

<https://datahubproject.io/docs/0.13.1/metadata-jobs/mae-consumer-job|Metadata Audit Event Consumer Job>
<https://github.com/datahub-project/datahub|DataHub GitHub Repository>
<https://raw.githubusercontent.com/datahub-project/static-assets/main/imgs/datahub-serving.png|DataHub Serving Architecture> 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-3 · June 10, 2024, 12:05am

<@U06TM7M40RK> in which folder of github they live?

user-1 · June 10, 2024, 12:05am

The Metadata Audit Event (MAE) Consumer and related components are part of the DataHub project on GitHub. They are primarily located in the metadata-jobs directory. Here’s a detailed breakdown of where you can find the relevant code:

Location in GitHub Repository

MAE Consumer Job: The MAE Consumer Job is located in the metadata-jobs directory. This directory contains the implementation of the MAE Consumer, including its configuration, processing logic, and integration with Kafka and Elasticsearch.
Relevant Directories and Files:
- metadata-jobs/mae-consumer-job: This directory contains the main implementation of the MAE Consumer Job.
- metadata-jobs/mae-consumer-job/src/main/java/com/linkedin/metadata/kafka/MetadataAuditEventConsumer.java: This file contains the Java class responsible for consuming and processing Metadata Audit Events.

Example Directory Structure

Here is an example directory structure to help you locate the relevant files:

datahub
├── metadata-jobs
│   ├── mae-consumer-job
│   │   ├── src
│   │   │   ├── main
│   │   │   │   ├── java
│   │   │   │   │   ├── com
│   │   │   │   │   │   ├── linkedin
│   │   │   │   │   │   │   ├── metadata
│   │   │   │   │   │   │   │   ├── kafka
│   │   │   │   │   │   │   │   │   ├── MetadataAuditEventConsumer.java
│   │   │   │   │   │   │   │   │   ├── MetadataChangeLogProcessor.java
│   │   │   │   │   │   │   │   │   ├── ...
│   │   │   │   │   │   │   │   ├── ...
│   │   │   │   │   │   │   ├── ...
│   │   │   │   │   │   ├── ...
│   │   │   │   │   ├── ...
│   │   │   │   ├── resources
│   │   │   │   │   ├── application.properties
│   │   │   │   │   ├── ...
│   │   │   ├── ...
│   ├── ...
├── ...

Relevant Documentation

<https://datahubproject.io/docs/0.13.1/metadata-jobs/mae-consumer-job|Metadata Audit Event Consumer Job>: This document provides an overview of the MAE Consumer, including its function, deployment, and endpoints.
<https://github.com/datahub-project/datahub|DataHub GitHub Repository>: The source code for the MAE Consumer and other DataHub components can be found in the DataHub GitHub repository.

Sources:

<https://datahubproject.io/docs/0.13.1/metadata-jobs/mae-consumer-job|Metadata Audit Event Consumer Job>
<https://github.com/datahub-project/datahub|DataHub GitHub Repository> 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-3 · June 10, 2024, 12:05am

<@U06TM7M40RK> why are you link me documents on old data hub version?

user-1 · June 10, 2024, 12:05am

I apologize for that. Let me provide you with the most up-to-date information and links regarding the Metadata Audit Event (MAE) Consumer in DataHub.

Updated Information on MAE Consumer

The MAE Consumer is a critical component in DataHub, responsible for processing metadata change events and updating secondary indexes like Elasticsearch and Neo4j. Here’s where you can find the latest implementation details in the GitHub repository:

Location in GitHub Repository

MAE Consumer Job: The MAE Consumer Job is located in the metadata-jobs directory. This directory contains the implementation of the MAE Consumer, including its configuration, processing logic, and integration with Kafka and Elasticsearch.
Relevant Directories and Files:
- metadata-jobs/mae-consumer-job: This directory contains the main implementation of the MAE Consumer Job.
- metadata-jobs/mae-consumer-job/src/main/java/com/linkedin/metadata/kafka/MetadataChangeLogProcessor.java: This file contains the Java class responsible for consuming and processing Metadata Change Log events.

Example Directory Structure

Here is an example directory structure to help you locate the relevant files:

datahub
├── metadata-jobs
│   ├── mae-consumer-job
│   │   ├── src
│   │   │   ├── main
│   │   │   │   ├── java
│   │   │   │   │   ├── com
│   │   │   │   │   │   ├── linkedin
│   │   │   │   │   │   │   ├── metadata
│   │   │   │   │   │   │   │   ├── kafka
│   │   │   │   │   │   │   │   │   ├── MetadataChangeLogProcessor.java
│   │   │   │   │   │   │   │   │   ├── ...
│   │   │   │   │   │   │   │   ├── ...
│   │   │   │   │   │   │   ├── ...
│   │   │   │   │   │   ├── ...
│   │   │   │   ├── resources
│   │   │   │   │   ├── application.properties
│   │   │   │   │   ├── ...
│   │   │   ├── ...
│   ├── ...
├── ...

Key Files and Classes

MetadataChangeLogProcessor.java: This class is responsible for processing Metadata Change Log events and updating the search and graph indexes.
- <https://github.com/datahub-project/datahub/blob/master/metadata-jobs/mae-consumer-job/src/main/java/com/linkedin/metadata/kafka/MetadataChangeLogProcessor.java|MetadataChangeLogProcessor.java on GitHub>
MaeConsumerApplication.java: This is the main application class for the MAE Consumer.
- <https://github.com/datahub-project/datahub/blob/master/metadata-jobs/mae-consumer-job/src/main/java/com/linkedin/metadata/kafka/MaeConsumerApplication.java|MaeConsumerApplication.java on GitHub>

Relevant Documentation

<https://datahubproject.io/docs/metadata-jobs/mae-consumer-job|Metadata Audit Event Consumer Job>: This document provides an overview of the MAE Consumer, including its function, deployment, and endpoints.
<https://github.com/datahub-project/datahub|DataHub GitHub Repository>: The source code for the MAE Consumer and other DataHub components can be found in the DataHub GitHub repository. 0 button 0 button Hint: Mention @runllm in the thread for followups.

Topic		Replies	Views
Datahub: Ingesting Metadata from APIs through OpenAPI ingestion	5	12	February 10, 2025
Understanding Metadata Change Events in DataHub ingestion	5	55	June 10, 2024
Creating a Custom Ingestion Source in DataHub with CSV Data ingestion	5	32	August 12, 2024
Adding Metadata about Data Source URL to Datasets during Ingestion ingestion	3	108	July 1, 2024
Creating a Custom Source for Describing Apps and Updating Metadata Manually ingestion	4	41	May 27, 2024

DataHub: Implementing Custom XML Metadata Ingestion via MAE Consumer

Related topics