Understanding Metadata Change Proposals in DataHub

user-3 · October 7, 2024, 12:03am

<@U06TM7M40RK> please give me summary on MetadataChangeProposal, its definition, purpose and the way it works

user-1 · October 7, 2024, 12:03am

A Metadata Change Proposal (MCP) in DataHub is a request to change a specific aspect of an entity within an organization’s Metadata Graph. It is a core component of DataHub’s metadata ingestion framework, allowing for flexible and scalable metadata management.

Definition
An MCP is defined as a structured request to update a particular aspect of an entity. The structure of an MCP includes fields such as entityType, entityUrn, changeType, aspectName, and aspect. The aspect field contains the new value for the aspect being changed, serialized in JSON format.

Purpose
The primary purpose of an MCP is to propose changes to the metadata of an entity. This can include updates to ownership, documentation, domains, deprecation status, and more. MCPs are designed to be flexible and can be emitted by various clients during the metadata ingestion process.

How It Works

Emission: MCPs can be emitted by clients using DataHub’s low-level ingestion APIs. They can be sent over Kafka for asynchronous processing or directly to an HTTP endpoint for synchronous responses.
Consumption: DataHub’s storage layer listens for new MCPs and attempts to apply the proposed changes to the Metadata Graph.
Schema: The structure of an MCP is defined using PDL (Pegasus Data Language) and includes fields such as entityUrn, entityType, changeType, aspectName, and aspect.

Example
An example of an MCP to update the ‘ownership’ aspect of a dataset:

{
  "entityType": "dataset",
  "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:hdfs,SampleHdfsDataset,PROD)",
  "changeType": "UPSERT",
  "aspectName": "ownership",
  "aspect": {
    "value": "{\"owners\":[{\"type\":\"DATAOWNER\",\"owner\":\"urn:li:corpuser:datahub\"}],\"lastModified\":{\"actor\":\"urn:li:corpuser:datahub\",\"time\":1651516640488}}",
    "contentType": "application/json"
  },
  "systemMetadata": {
    "lastObserved": 1651516640493,
    "runId": "no-run-id-provided",
    "registryName": "unknownRegistry",
    "registryVersion": "0.0.0.0-dev",
    "properties": null
  }
}

Would you like more detailed guidance on any specific aspect of Metadata Change Proposals?

Sources:

<https://datahubproject.io/docs/0.14.0/what/mxe|Metadata Events | DataHub 0.14.0>
<https://datahubproject.io/docs/0.14.0/advanced/mcp-mcl|MetadataChangeProposal & MetadataChangeLog Events | DataHub 0.14.0>
<https://datahubproject.io/docs/0.14.0/architecture/metadata-ingestion|Ingestion Framework | DataHub 0.14.0> 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-3 · October 7, 2024, 12:03am

<@U06TM7M40RK> give basic MPC structure in JSON - fields not filled

user-1 · October 7, 2024, 12:03am

Here is the basic structure of a Metadata Change Proposal (MCP) in JSON format with fields not filled:

{
  "entityType": "",
  "entityUrn": "",
  "changeType": "",
  "aspectName": "",
  "aspect": {
    "value": "",
    "contentType": "application/json"
  },
  "systemMetadata": {
    "lastObserved": null,
    "runId": "",
    "registryName": "",
    "registryVersion": "",
    "properties": null
  },
  "headers": {}
}

Field Descriptions:

entityType: The type of the entity (e.g., dataset, chart).
entityUrn: The unique identifier for the entity being changed.
changeType: The type of change being proposed (e.g., UPSERT, CREATE, DELETE, PATCH).
aspectName: The name of the aspect being changed.
aspect: Contains the new value for the aspect.
- value: The serialized aspect value.
- contentType: The serialization type of the aspect (currently supports “application/json”).
systemMetadata: Additional metadata about the proposal.
- lastObserved: Timestamp of the last observation.
- runId: Identifier for the ingestion run.
- registryName: Name of the registry.
- registryVersion: Version of the registry.
- properties: Additional properties.
headers: Optional headers for implementing conditional write logic.

Would you like more detailed information on any specific field?

Sources:

<https://datahubproject.io/docs/0.14.0/what/mxe|Metadata Events | DataHub 0.14.0> 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-3 · October 7, 2024, 12:03am

<@U06TM7M40RK> add type info to this schema an mark optional fields

user-1 · October 7, 2024, 12:03am

Here is the basic structure of a Metadata Change Proposal (MCP) in JSON format with type information and optional fields marked:

{
  "entityType": "string",  // Type of the entity (e.g., dataset, chart)
  "entityUrn": "string",   // Unique identifier for the entity being changed
  "changeType": "string",  // Type of change (e.g., UPSERT, CREATE, DELETE, PATCH)
  "aspectName": "string",  // Name of the aspect being changed
  "aspect": {
    "value": "string",     // Serialized aspect value
    "contentType": "string" // Serialization type of the aspect (e.g., application/json)
  },
  "systemMetadata": {      // Optional
    "lastObserved": "number", // Timestamp of the last observation (optional)
    "runId": "string",        // Identifier for the ingestion run (optional)
    "registryName": "string", // Name of the registry (optional)
    "registryVersion": "string", // Version of the registry (optional)
    "properties": "object"    // Additional properties (optional)
  },
  "headers": "object"       // Optional headers for implementing conditional write logic (optional)
}

Field Descriptions:

entityType: string - The type of the entity (e.g., dataset, chart).
entityUrn: string - The unique identifier for the entity being changed.
changeType: string - The type of change being proposed (e.g., UPSERT, CREATE, DELETE, PATCH).
aspectName: string - The name of the aspect being changed.
aspect: object - Contains the new value for the aspect.
- value: string - The serialized aspect value.
- contentType: string - The serialization type of the aspect (currently supports “application/json”).
systemMetadata: object (optional) - Additional metadata about the proposal.
- lastObserved: number (optional) - Timestamp of the last observation.
- runId: string (optional) - Identifier for the ingestion run.
- registryName: string (optional) - Name of the registry.
- registryVersion: string (optional) - Version of the registry.
- properties: object (optional) - Additional properties.
headers: object (optional) - Optional headers for implementing conditional write logic.

Would you like more detailed information on any specific field?

Sources:

<https://datahubproject.io/docs/0.14.0/what/mxe|Metadata Events | DataHub 0.14.0> 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-3 · October 7, 2024, 12:03am

<@U06TM7M40RK> why do I need DatasetAssertionInfo and AssertionInfo

user-1 · October 7, 2024, 12:03am

DatasetAssertionInfo and AssertionInfo are both components of DataHub’s metadata model, specifically designed to handle assertions, which are programmatic validations, checks, or tests performed periodically against datasets or other entities.

AssertionInfo
AssertionInfo is a comprehensive structure that encapsulates all the details about an assertion. It includes various fields that describe the assertion, such as its type, description, and external URL. It can also contain specific information about different types of assertions, such as dataset assertions, freshness assertions, volume assertions, SQL assertions, and more.

DatasetAssertionInfo
DatasetAssertionInfo is a specialized structure within AssertionInfo that provides detailed information specifically about assertions related to datasets. It includes fields such as the dataset URN, the scope of the assertion, the fields involved, the aggregation and operator used, and any parameters required for the assertion.

Why You Need Both

AssertionInfo: This is the overarching structure that provides a unified way to handle various types of assertions. It allows you to define and manage assertions in a standardized manner, regardless of the specific type of assertion.
DatasetAssertionInfo: This is a specialized component within AssertionInfo that provides detailed information specific to dataset assertions. It allows you to capture and manage the nuances and specifics of dataset-related assertions, such as which fields are involved and what kind of checks are being performed.

Example
When you create an assertion for a dataset, you would use AssertionInfo to define the general properties of the assertion and then use DatasetAssertionInfo to provide the specific details related to the dataset.

{
  "type": "DATASET",
  "description": "Check if the dataset has no null values in the 'age' column",
  "datasetAssertion": {
    "datasetUrn": "urn:li:dataset:(urn:li:dataPlatform:hdfs,SampleHdfsDataset,PROD)",
    "scope": "DATASET_COLUMN",
    "fields": [
      {
        "urn": "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:hdfs,SampleHdfsDataset,PROD),age)"
      }
    ],
    "aggregation": "IDENTITY",
    "operator": "NOT_NULL",
    "parameters": {
      "value": "true"
    }
  }
}

In this example:

AssertionInfo provides the general information about the assertion, such as its type and description.
DatasetAssertionInfo provides the specific details about the dataset assertion, such as the dataset URN, the scope, the fields involved, and the parameters for the assertion.

Would you like more detailed information or examples on how to use these components?

Sources:

<Assertion | DataHub | DataHub 0.14.0> 0 button 0 button Hint: Mention @runllm in the thread for followups.

datahub_team · October 7, 2024, 12:03am

Hi ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look

Hope this helps ! ><

Topic		Replies	Views
Ingesting Metadata Directly from Code into DataHub without Connecting to Datasource ingestion	7	41	July 22, 2024
Error Encountered when Emitting List of Aspects in MetadataChangeProposal ingestion	3	26	July 22, 2024
Understanding Metadata Change Events in DataHub ingestion	5	77	June 10, 2024
Adding Metadata about Data Source URL to Datasets during Ingestion ingestion	3	130	July 1, 2024
Best Practices for Writing Unit Tests for Inspecting Metadata Change Proposals ingestion	3	69	April 22, 2024

Understanding Metadata Change Proposals in DataHub

Related topics