Hi <@U06TM7M40RK> i need a sample yaml file with the correct format to ingest list of domains from github
I’m not able to find any data sources to answer your question. Could you specify the exact structure or fields you need in the YAML file? If this doesn’t help, please reach out to the DataHub support team for further assistance. 0 button 0 button Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> i would like to how to import domains into datahub
To import domains into DataHub, you can use the DataHub CLI or the DataHub Python SDK. Below are the steps for both methods:
Method 1: Using DataHub CLI
-
Create a YAML file for the ingestion recipe:
This file will define the source of the metadata and the sink (where the metadata will be ingested).# example-recipe.yml source: type: file config: filename: "./domains.json" format: "json" sink: type: "datahub-rest" config: server: "http://localhost:8080" # Replace with your DataHub instance URL token: "your-auth-token" # Replace with your DataHub auth token if required
-
Prepare the JSON file with domain metadata:
Create a JSON file (domains.json
) that contains the domain metadata you want to ingest.[ { "proposedSnapshot": { "urn": "urn:li:domain:engineering", "aspects": [ { "com.linkedin.domain.DomainProperties": { "name": "Engineering", "description": "Domain for engineering data assets" } } ] } }, { "proposedSnapshot": { "urn": "urn:li:domain:hr", "aspects": [ { "com.linkedin.domain.DomainProperties": { "name": "Human Resources", "description": "Domain for HR data assets" } } ] } } ]
-
Run the ingestion command:
Use the DataHub CLI to run the ingestion based on your recipe file.datahub ingest -c example-recipe.yml
Method 2: Using DataHub Python SDK
-
Install the DataHub Python SDK:
Ensure you have Python 3.6+ installed. Then, install the DataHub Python SDK.python3 -m pip install --upgrade pip wheel setuptools python3 -m pip install --upgrade acryl-datahub
-
Create a Python script to emit domain metadata:
Use the DataHub Python SDK to create and send Metadata Change Proposals (MCPs) for the domains.
from datahub.emitter.mce_builder import make_domain_urn
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import MetadataChangeEventClass, DomainPropertiesClass, DomainSnapshotClass
# Initialize the DataHub REST emitter
emitter = DatahubRestEmitter(gms_server="<http://localhost:8080>")
# List of domains to ingest
domains = [
{
"urn": "urn:li:domain:engineering",
"name": "Engineering",
"description": "Domain for engineering data assets"
},
{
"urn": "urn:li:domain:hr",
"name": "Human Resources",
"description": "Domain for HR data assets"
}
]
# Emit metadata for each domain
for domain in domains:
domain_urn = make_domain_urn(domain["urn"])
domain_properties = DomainPropertiesClass(
name=domain["name"],
description=domain["description"]
)
domain_snapshot = DomainSnapshotClass(
urn=domain_urn,
aspects=[domain_properties]
)
mce = MetadataChangeEventClass(proposedSnapshot=domain_snapshot)
emitter.emit_mce(mce)
Would you like more detailed steps or additional examples on any of these methods?
Sources:
- https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/docs/transformer/dataset_transformer.md|https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/docs/transformer/dataset_transformer.md
- https://datahubspace.slack.com/archives/CUMUWQU66/p1722427040.739189|https://datahubspace.slack.com/archives/CUMUWQU66/p1722427040.739189 0 button 0 button Hint: Mention @runllm in the thread for followups.