Adding Domains to Multiple Glossary Terms in DataHub

Original Slack Thread

<@U06TM7M40RK> Is it possible to do a Bulk Change to add domains to a number of glossary terms?

To perform a bulk change to add domains to a number of glossary terms, you can use the simple_add_dataset_domain transformer in DataHub. This transformer allows you to add domains to datasets, and it can be configured to either replace existing domains or keep them.

However, if you specifically want to add domains to glossary terms, DataHub does not provide a direct transformer for this purpose. Instead, you can achieve this by using the DataHub API to programmatically link domains to glossary terms after ingesting them.

Here is a step-by-step approach:

  1. Ingest Glossary Terms: Use a YAML file to ingest your glossary terms.
  2. Link Domains to Glossary Terms: Use the DataHub API to link the ingested glossary terms to the relevant domains.

Step 1: Ingest Glossary Terms

Create a YAML file for your glossary terms and ingest it using the DataHub CLI. Here is an example structure for the YAML file:

version: 1
source: DataHub
owners:
  users:
    - mjames
url: "https://github.com/datahub-project/datahub/"
nodes:
  - name: Classification
    description: A set of terms related to Data Classification
    terms:
      - name: Sensitive
        description: Sensitive Data
      - name: Confidential
        description: Confidential Data
      - name: HighlyConfidential
        description: Highly Confidential Data
  - name: PersonalInformation
    description: All terms related to personal information
    terms:
      - name: Email
        description: An individual's email address
      - name: Address
        description: A physical address
      - name: Gender
        description: The gender identity of the individual

Step 2: Link Domains to Glossary Terms

You can use the DataHub Python SDK to link domains to glossary terms. Here is an example of how to do this:

from datahub.ingestion.graph.client import DataHubGraph, DataHubGraphConfig

# Initialize the DataHubGraph client
datahub_graph = DataHubGraph(DataHubGraphConfig(server="http://localhost:8080"))

# Define the URNs for the glossary term and the domain
glossary_term_urn = "urn:li:glossaryTerm:your-glossary-term-id"
domain_urn = "urn:li:domain:your-domain-id"

# Define the relationship type
relationship_type = "Domain"

# Create the relationship
relationship = {
    "type": relationship_type,
    "direction": "INCOMING",
    "entity": {
        "urn": domain_urn,
        "type": "domain"
    }
}

# Add the relationship to the glossary term
datahub_graph.add_relationship(glossary_term_urn, relationship)

Explanation

  • DataHubGraph Client: Initializes the DataHubGraph client to interact with the DataHub API.
  • URNs: Specifies the URNs for the glossary term and the domain.
  • Relationship Type: Defines the type of relationship (Domain in this case).
  • Create Relationship: Constructs the relationship object and adds it to the glossary term using the add_relationship method.

Would you like more detailed steps or additional examples?

Sources: