Adding Metadata about Data Source URL to Datasets during Ingestion

Original Slack Thread

Hello, I am ingesting via OpenApi. I would like each dataset that is created to include some metadata about where the data came from. Is this possible? For instance, if a dataset is created from data that is fetched from <http://localhost/books>, I’d like to have something in the UI or atleast in the json representation of the dataset showing that it was created by data fetched from that url

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?

:information_source: We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a :thumbsup: or :thumbsdown: so other folks in the Community know if you’re still looking for help on the topic!

To help you more effectively, could you please provide the following details?

  1. Are you using the UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g., 0.12.0)?
  3. What data source(s) are you integrating with DataHub? (e.g., BigQuery)?

Based on your initial query, it sounds like you want to add metadata about the data source URL to each dataset during ingestion. This can be achieved by adding custom properties to the dataset. Custom properties are key-value pairs that can capture additional information about assets, such as the URL from which the data was fetched.

According to the <|DataHub Properties documentation>, custom properties can be added to assets during ingestion or programmatically via API. Here’s an example of how you can add custom properties using the Python SDK:

from datahub.emitter.mce_builder import make_dataset_urn, make_user_urn
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph
from datahub.metadata.schema_classes import DatasetPropertiesClass

# Define the dataset URN
dataset_urn = "urn:li:dataset:(urn:li:dataPlatform:your_platform,your_dataset,PROD)"

# Initialize the DataHubGraph client
gms_endpoint = "http://localhost:8080"
graph = DataHubGraph(DatahubClientConfig(server=gms_endpoint))

# Fetch the current properties of the dataset
current_prop = graph.get_aspect(entity_urn=dataset_urn, aspect_type=DatasetPropertiesClass)

# Define the new custom properties
new_prop = {"source_url": "http://localhost/books"}

# Update the current properties with the new custom properties

# Create a Metadata Change Proposal (MCP) to update the dataset properties
event = MetadataChangeProposalWrapper(

# Emit the MCP to update the dataset properties in DataHub

This code snippet demonstrates how to add a custom property (source_url) to a dataset. You can adapt this approach to include any metadata you need.

Would you like more detailed instructions or have any specific questions about this process?


CLI ingestion

Models: bundled
Python version: 3.8.10 (default, Nov 22 2023, 10:22:35) 
[GCC 9.4.0]```