Creating a YAML File for Ingesting a List of Domains from GitHub

Original Slack Thread

Hi <@U06TM7M40RK> i need a sample yaml file with the correct format to ingest list of domains from github

I’m not able to find any data sources to answer your question. Could you specify the exact structure or fields you need in the YAML file? If this doesn’t help, please reach out to the DataHub support team for further assistance. :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

<@U06TM7M40RK> i would like to how to import domains into datahub

To import domains into DataHub, you can use the DataHub CLI or the DataHub Python SDK. Below are the steps for both methods:

Method 1: Using DataHub CLI

  1. Create a YAML file for the ingestion recipe:
    This file will define the source of the metadata and the sink (where the metadata will be ingested).

    # example-recipe.yml
    
    source:
      type: file
      config:
        filename: "./domains.json"
        format: "json"
    
    sink:
      type: "datahub-rest"
      config:
        server: "http://localhost:8080"  # Replace with your DataHub instance URL
        token: "your-auth-token"  # Replace with your DataHub auth token if required
    
  2. Prepare the JSON file with domain metadata:
    Create a JSON file (domains.json) that contains the domain metadata you want to ingest.

    [
      {
        "proposedSnapshot": {
          "urn": "urn:li:domain:engineering",
          "aspects": [
            {
              "com.linkedin.domain.DomainProperties": {
                "name": "Engineering",
                "description": "Domain for engineering data assets"
              }
            }
          ]
        }
      },
      {
        "proposedSnapshot": {
          "urn": "urn:li:domain:hr",
          "aspects": [
            {
              "com.linkedin.domain.DomainProperties": {
                "name": "Human Resources",
                "description": "Domain for HR data assets"
              }
            }
          ]
        }
      }
    ]
    
  3. Run the ingestion command:
    Use the DataHub CLI to run the ingestion based on your recipe file.

    datahub ingest -c example-recipe.yml
    

Method 2: Using DataHub Python SDK

  1. Install the DataHub Python SDK:
    Ensure you have Python 3.6+ installed. Then, install the DataHub Python SDK.

    python3 -m pip install --upgrade pip wheel setuptools
    python3 -m pip install --upgrade acryl-datahub
    
  2. Create a Python script to emit domain metadata:
    Use the DataHub Python SDK to create and send Metadata Change Proposals (MCPs) for the domains.

   from datahub.emitter.mce_builder import make_domain_urn
   from datahub.emitter.rest_emitter import DatahubRestEmitter
   from datahub.metadata.schema_classes import MetadataChangeEventClass, DomainPropertiesClass, DomainSnapshotClass

   # Initialize the DataHub REST emitter
   emitter = DatahubRestEmitter(gms_server="<http://localhost:8080>")

   # List of domains to ingest
   domains = [
       {
           "urn": "urn:li:domain:engineering",
           "name": "Engineering",
           "description": "Domain for engineering data assets"
       },
       {
           "urn": "urn:li:domain:hr",
           "name": "Human Resources",
           "description": "Domain for HR data assets"
       }
   ]

   # Emit metadata for each domain
   for domain in domains:
       domain_urn = make_domain_urn(domain["urn"])
       domain_properties = DomainPropertiesClass(
           name=domain["name"],
           description=domain["description"]
       )
       domain_snapshot = DomainSnapshotClass(
           urn=domain_urn,
           aspects=[domain_properties]
       )
       mce = MetadataChangeEventClass(proposedSnapshot=domain_snapshot)
       emitter.emit_mce(mce)

Would you like more detailed steps or additional examples on any of these methods?

Sources: