<@U06TM7M40RK> Is it possible to do a Bulk Change to add domains to a number of glossary terms?
To perform a bulk change to add domains to a number of glossary terms, you can use the simple_add_dataset_domain
transformer in DataHub. This transformer allows you to add domains to datasets, and it can be configured to either replace existing domains or keep them.
However, if you specifically want to add domains to glossary terms, DataHub does not provide a direct transformer for this purpose. Instead, you can achieve this by using the DataHub API to programmatically link domains to glossary terms after ingesting them.
Here is a step-by-step approach:
- Ingest Glossary Terms: Use a YAML file to ingest your glossary terms.
- Link Domains to Glossary Terms: Use the DataHub API to link the ingested glossary terms to the relevant domains.
Step 1: Ingest Glossary Terms
Create a YAML file for your glossary terms and ingest it using the DataHub CLI. Here is an example structure for the YAML file:
version: 1
source: DataHub
owners:
users:
- mjames
url: "https://github.com/datahub-project/datahub/"
nodes:
- name: Classification
description: A set of terms related to Data Classification
terms:
- name: Sensitive
description: Sensitive Data
- name: Confidential
description: Confidential Data
- name: HighlyConfidential
description: Highly Confidential Data
- name: PersonalInformation
description: All terms related to personal information
terms:
- name: Email
description: An individual's email address
- name: Address
description: A physical address
- name: Gender
description: The gender identity of the individual
Step 2: Link Domains to Glossary Terms
You can use the DataHub Python SDK to link domains to glossary terms. Here is an example of how to do this:
from datahub.ingestion.graph.client import DataHubGraph, DataHubGraphConfig
# Initialize the DataHubGraph client
datahub_graph = DataHubGraph(DataHubGraphConfig(server="http://localhost:8080"))
# Define the URNs for the glossary term and the domain
glossary_term_urn = "urn:li:glossaryTerm:your-glossary-term-id"
domain_urn = "urn:li:domain:your-domain-id"
# Define the relationship type
relationship_type = "Domain"
# Create the relationship
relationship = {
"type": relationship_type,
"direction": "INCOMING",
"entity": {
"urn": domain_urn,
"type": "domain"
}
}
# Add the relationship to the glossary term
datahub_graph.add_relationship(glossary_term_urn, relationship)
Explanation
- DataHubGraph Client: Initializes the DataHubGraph client to interact with the DataHub API.
- URNs: Specifies the URNs for the glossary term and the domain.
- Relationship Type: Defines the type of relationship (
Domain
in this case). - Create Relationship: Constructs the relationship object and adds it to the glossary term using the
add_relationship
method.
Would you like more detailed steps or additional examples?
Sources:
- <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/docs/transformer/dataset_transformer.md|DataHub Transformer Documentation>
- <Dataset | DataHub API Documentation>
0 button
0 button
Hint: Mention @runllm in the thread for followups.