DataHub GMS stands for Generalized Metadata Service. It is a core component of the DataHub architecture responsible for managing and serving metadata. Here’s a detailed explanation of its role and functionality:
Generalized Metadata Service (GMS)
Metadata Management: GMS is responsible for the ingestion, storage, and retrieval of metadata. It provides APIs for creating, reading, updating, and deleting metadata entities and their aspects.
APIs:
GraphQL API: GMS exposes a GraphQL API that serves as the primary public API for interacting with the metadata. This API allows for flexible and efficient querying and mutation of metadata.
Rest.li API: GMS also provides a Rest.li API, which is used for system-specific ingestion of metadata. This API is considered internal and is primarily consumed by other DataHub components.
Storage Backend: GMS interacts with various storage backends, including relational databases (e.g., MySQL, PostgreSQL), search indices (e.g., Elasticsearch), and optionally graph databases (e.g., Neo4j).
Metadata Change Log (MCL): GMS emits commit events to Kafka when metadata changes are committed. These events can be consumed by other systems for real-time processing and reactions.
Metadata Change Proposal (MCP): GMS can ingest metadata via Metadata Change Proposals, which are processed and stored in the relational database.
Example Architecture Diagram
Here is a high-level system diagram for DataHub’s Serving Tier, which includes the GMS component:
Metadata Audit Event (MAE) is a type of event in DataHub that captures changes made to metadata aspects associated with a particular entity. It provides a way to track and audit changes to metadata over time. Here’s a detailed explanation:
Purpose: MAEs are used to capture and audit changes to metadata. They provide a before-and-after snapshot of the metadata, allowing for a complete history of changes.
Structure: An MAE contains the following key components:
Old Snapshot: The state of the metadata before the change.
New Snapshot: The state of the metadata after the change.
Change Metadata: Information about the change, such as the actor who made the change and the timestamp.
Emission: MAEs are emitted whenever a metadata change is committed to DataHub’s storage layer. This ensures that any listener of MAEs can construct a complete view of the latest state for all aspects.
Deprecation: In recent versions of DataHub (mid-2022), MAEs are no longer actively emitted and are being replaced by Metadata Change Logs (MCLs). The MCLs provide a more streamlined and efficient way to track metadata changes.
Metadata Audit Event Consumer (MAE Consumer)
Metadata Audit Event Consumer (MAE Consumer) is a component in DataHub responsible for consuming and processing MAEs. Here’s a detailed explanation:
Function: The primary function of the MAE Consumer is to listen to change log events emitted as a result of changes made to the metadata graph. It converts these changes into updates against secondary search and graph indexes.
Kafka Topics: The MAE Consumer consumes from two important Kafka topics:
MetadataChangeLog_Versioned_v1
MetadataChangeLog_Timeseries_v1
Processing: The MAE Consumer processes the MAEs and applies the changes to the search index (e.g., Elasticsearch) and the graph index (if applicable). This ensures that the metadata changes are reflected in the search and graph indices.
Deployment: The MAE Consumer can be deployed as a standalone Spring job or as part of the Metadata Service. It requires backend services like Kafka and Elasticsearch to be up and running.
Endpoints: The MAE Consumer exposes several endpoints for health checks, metrics, and information:
The Metadata Audit Event (MAE) modules in DataHub are primarily implemented in Java, using the Spring framework. Here’s a detailed explanation of their implementation:
Implementation Details
Language and Framework: The MAE modules are implemented in Java, leveraging the Spring framework for dependency injection, configuration, and other enterprise features.
Kafka Integration: The MAE Consumer is designed to consume events from Kafka topics. It uses Kafka clients to subscribe to the relevant topics (MetadataChangeLog_Versioned_v1 and MetadataChangeLog_Timeseries_v1) and process the events.
Spring Boot Application: The MAE Consumer is a Spring Boot application, which provides a convenient way to package and run the application. It includes built-in support for health checks, metrics, and other operational features.
Processing Logic: The core processing logic of the MAE Consumer involves:
Consuming events from Kafka.
Parsing the events to extract metadata changes.
Applying the changes to the search index (e.g., Elasticsearch) and the graph index (if applicable).
Configuration: The MAE Consumer can be configured using Spring’s configuration mechanisms, such as application properties or YAML files. This allows for flexible configuration of Kafka topics, Elasticsearch endpoints, and other settings.
Example Code Snippets
Here are some example code snippets to illustrate the implementation of the MAE Consumer:
Kafka Consumer Configuration
import org.springframework.kafka.annotation.KafkaListener;
import org.springframework.stereotype.Service;
@Service
public class MetadataChangeLogConsumer {
@KafkaListener(topics = "MetadataChangeLog_Versioned_v1", groupId = "mae-consumer-group")
public void consumeVersionedLog(String message) {
// Process the versioned metadata change log
}
@KafkaListener(topics = "MetadataChangeLog_Timeseries_v1", groupId = "mae-consumer-group")
public void consumeTimeseriesLog(String message) {
// Process the timeseries metadata change log
}
}
Processing Logic
import com.linkedin.metadata.models.MetadataChangeLog;
import com.linkedin.metadata.models.MetadataChangeEvent;
import org.springframework.stereotype.Service;
@Service
public class MetadataChangeLogProcessor {
public void processMetadataChangeLog(String message) {
// Parse the message to extract MetadataChangeLog
MetadataChangeLog changeLog = parseMessage(message);
// Apply changes to the search index
updateSearchIndex(changeLog);
// Apply changes to the graph index (if applicable)
updateGraphIndex(changeLog);
}
private MetadataChangeLog parseMessage(String message) {
// Implement parsing logic
return new MetadataChangeLog();
}
private void updateSearchIndex(MetadataChangeLog changeLog) {
// Implement search index update logic
The Metadata Audit Event (MAE) Consumer and related components are part of the DataHub project on GitHub. They are primarily located in the metadata-jobs directory. Here’s a detailed breakdown of where you can find the relevant code:
Location in GitHub Repository
MAE Consumer Job: The MAE Consumer Job is located in the metadata-jobs directory. This directory contains the implementation of the MAE Consumer, including its configuration, processing logic, and integration with Kafka and Elasticsearch.
Relevant Directories and Files:
metadata-jobs/mae-consumer-job: This directory contains the main implementation of the MAE Consumer Job.
metadata-jobs/mae-consumer-job/src/main/java/com/linkedin/metadata/kafka/MetadataAuditEventConsumer.java: This file contains the Java class responsible for consuming and processing Metadata Audit Events.
Example Directory Structure
Here is an example directory structure to help you locate the relevant files:
I apologize for that. Let me provide you with the most up-to-date information and links regarding the Metadata Audit Event (MAE) Consumer in DataHub.
Updated Information on MAE Consumer
The MAE Consumer is a critical component in DataHub, responsible for processing metadata change events and updating secondary indexes like Elasticsearch and Neo4j. Here’s where you can find the latest implementation details in the GitHub repository:
Location in GitHub Repository
MAE Consumer Job: The MAE Consumer Job is located in the metadata-jobs directory. This directory contains the implementation of the MAE Consumer, including its configuration, processing logic, and integration with Kafka and Elasticsearch.
Relevant Directories and Files:
metadata-jobs/mae-consumer-job: This directory contains the main implementation of the MAE Consumer Job.
metadata-jobs/mae-consumer-job/src/main/java/com/linkedin/metadata/kafka/MetadataChangeLogProcessor.java: This file contains the Java class responsible for consuming and processing Metadata Change Log events.
Example Directory Structure
Here is an example directory structure to help you locate the relevant files:
<https://github.com/datahub-project/datahub|DataHub GitHub Repository>: The source code for the MAE Consumer and other DataHub components can be found in the DataHub GitHub repository. 0 button 0 button Hint: Mention @runllm in the thread for followups.