Understanding Metadata Change Proposals in DataHub

Original Slack Thread

<@U06TM7M40RK> please give me summary on MetadataChangeProposal, its definition, purpose and the way it works

A Metadata Change Proposal (MCP) in DataHub is a request to change a specific aspect of an entity within an organization’s Metadata Graph. It is a core component of DataHub’s metadata ingestion framework, allowing for flexible and scalable metadata management.

Definition
An MCP is defined as a structured request to update a particular aspect of an entity. The structure of an MCP includes fields such as entityType, entityUrn, changeType, aspectName, and aspect. The aspect field contains the new value for the aspect being changed, serialized in JSON format.

Purpose
The primary purpose of an MCP is to propose changes to the metadata of an entity. This can include updates to ownership, documentation, domains, deprecation status, and more. MCPs are designed to be flexible and can be emitted by various clients during the metadata ingestion process.

How It Works

  1. Emission: MCPs can be emitted by clients using DataHub’s low-level ingestion APIs. They can be sent over Kafka for asynchronous processing or directly to an HTTP endpoint for synchronous responses.
  2. Consumption: DataHub’s storage layer listens for new MCPs and attempts to apply the proposed changes to the Metadata Graph.
  3. Schema: The structure of an MCP is defined using PDL (Pegasus Data Language) and includes fields such as entityUrn, entityType, changeType, aspectName, and aspect.

Example
An example of an MCP to update the ‘ownership’ aspect of a dataset:

{
  "entityType": "dataset",
  "entityUrn": "urn:li:dataset:(urn:li:dataPlatform:hdfs,SampleHdfsDataset,PROD)",
  "changeType": "UPSERT",
  "aspectName": "ownership",
  "aspect": {
    "value": "{\"owners\":[{\"type\":\"DATAOWNER\",\"owner\":\"urn:li:corpuser:datahub\"}],\"lastModified\":{\"actor\":\"urn:li:corpuser:datahub\",\"time\":1651516640488}}",
    "contentType": "application/json"
  },
  "systemMetadata": {
    "lastObserved": 1651516640493,
    "runId": "no-run-id-provided",
    "registryName": "unknownRegistry",
    "registryVersion": "0.0.0.0-dev",
    "properties": null
  }
}

Would you like more detailed guidance on any specific aspect of Metadata Change Proposals?

Sources:

<@U06TM7M40RK> give basic MPC structure in JSON - fields not filled

Here is the basic structure of a Metadata Change Proposal (MCP) in JSON format with fields not filled:

{
  "entityType": "",
  "entityUrn": "",
  "changeType": "",
  "aspectName": "",
  "aspect": {
    "value": "",
    "contentType": "application/json"
  },
  "systemMetadata": {
    "lastObserved": null,
    "runId": "",
    "registryName": "",
    "registryVersion": "",
    "properties": null
  },
  "headers": {}
}

Field Descriptions:

  • entityType: The type of the entity (e.g., dataset, chart).
  • entityUrn: The unique identifier for the entity being changed.
  • changeType: The type of change being proposed (e.g., UPSERT, CREATE, DELETE, PATCH).
  • aspectName: The name of the aspect being changed.
  • aspect: Contains the new value for the aspect.
    • value: The serialized aspect value.
    • contentType: The serialization type of the aspect (currently supports “application/json”).
  • systemMetadata: Additional metadata about the proposal.
    • lastObserved: Timestamp of the last observation.
    • runId: Identifier for the ingestion run.
    • registryName: Name of the registry.
    • registryVersion: Version of the registry.
    • properties: Additional properties.
  • headers: Optional headers for implementing conditional write logic.

Would you like more detailed information on any specific field?

Sources:

<@U06TM7M40RK> add type info to this schema an mark optional fields

Here is the basic structure of a Metadata Change Proposal (MCP) in JSON format with type information and optional fields marked:

{
  "entityType": "string",  // Type of the entity (e.g., dataset, chart)
  "entityUrn": "string",   // Unique identifier for the entity being changed
  "changeType": "string",  // Type of change (e.g., UPSERT, CREATE, DELETE, PATCH)
  "aspectName": "string",  // Name of the aspect being changed
  "aspect": {
    "value": "string",     // Serialized aspect value
    "contentType": "string" // Serialization type of the aspect (e.g., application/json)
  },
  "systemMetadata": {      // Optional
    "lastObserved": "number", // Timestamp of the last observation (optional)
    "runId": "string",        // Identifier for the ingestion run (optional)
    "registryName": "string", // Name of the registry (optional)
    "registryVersion": "string", // Version of the registry (optional)
    "properties": "object"    // Additional properties (optional)
  },
  "headers": "object"       // Optional headers for implementing conditional write logic (optional)
}

Field Descriptions:

  • entityType: string - The type of the entity (e.g., dataset, chart).
  • entityUrn: string - The unique identifier for the entity being changed.
  • changeType: string - The type of change being proposed (e.g., UPSERT, CREATE, DELETE, PATCH).
  • aspectName: string - The name of the aspect being changed.
  • aspect: object - Contains the new value for the aspect.
    • value: string - The serialized aspect value.
    • contentType: string - The serialization type of the aspect (currently supports “application/json”).
  • systemMetadata: object (optional) - Additional metadata about the proposal.
    • lastObserved: number (optional) - Timestamp of the last observation.
    • runId: string (optional) - Identifier for the ingestion run.
    • registryName: string (optional) - Name of the registry.
    • registryVersion: string (optional) - Version of the registry.
    • properties: object (optional) - Additional properties.
  • headers: object (optional) - Optional headers for implementing conditional write logic.

Would you like more detailed information on any specific field?

Sources:

<@U06TM7M40RK> why do I need DatasetAssertionInfo and AssertionInfo

DatasetAssertionInfo and AssertionInfo are both components of DataHub’s metadata model, specifically designed to handle assertions, which are programmatic validations, checks, or tests performed periodically against datasets or other entities.

AssertionInfo
AssertionInfo is a comprehensive structure that encapsulates all the details about an assertion. It includes various fields that describe the assertion, such as its type, description, and external URL. It can also contain specific information about different types of assertions, such as dataset assertions, freshness assertions, volume assertions, SQL assertions, and more.

DatasetAssertionInfo
DatasetAssertionInfo is a specialized structure within AssertionInfo that provides detailed information specifically about assertions related to datasets. It includes fields such as the dataset URN, the scope of the assertion, the fields involved, the aggregation and operator used, and any parameters required for the assertion.

Why You Need Both

  • AssertionInfo: This is the overarching structure that provides a unified way to handle various types of assertions. It allows you to define and manage assertions in a standardized manner, regardless of the specific type of assertion.
  • DatasetAssertionInfo: This is a specialized component within AssertionInfo that provides detailed information specific to dataset assertions. It allows you to capture and manage the nuances and specifics of dataset-related assertions, such as which fields are involved and what kind of checks are being performed.

Example
When you create an assertion for a dataset, you would use AssertionInfo to define the general properties of the assertion and then use DatasetAssertionInfo to provide the specific details related to the dataset.

{
  "type": "DATASET",
  "description": "Check if the dataset has no null values in the 'age' column",
  "datasetAssertion": {
    "datasetUrn": "urn:li:dataset:(urn:li:dataPlatform:hdfs,SampleHdfsDataset,PROD)",
    "scope": "DATASET_COLUMN",
    "fields": [
      {
        "urn": "urn:li:schemaField:(urn:li:dataset:(urn:li:dataPlatform:hdfs,SampleHdfsDataset,PROD),age)"
      }
    ],
    "aggregation": "IDENTITY",
    "operator": "NOT_NULL",
    "parameters": {
      "value": "true"
    }
  }
}

In this example:

  • AssertionInfo provides the general information about the assertion, such as its type and description.
  • DatasetAssertionInfo provides the specific details about the dataset assertion, such as the dataset URN, the scope, the fields involved, and the parameters for the assertion.

Would you like more detailed information or examples on how to use these components?

Sources:

  • <Assertion | DataHub | DataHub 0.14.0> :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><