hey <@U06TM7M40RK>
I’m trying to keep a record on my glossary terms in a yaml file and running a business glossary ingestion on top of it.
My question: Is there any way to register related entities in the glossaryTerm field inside the yaml?
Also, do you know if there is a schema of all the accepted values for glossary yaml files?
To register related entities in the glossaryTerm
field inside the YAML file for a business glossary ingestion, you can use the inherits
and contains
fields. These fields allow you to define hierarchical relationships between glossary terms.
Example YAML Structure for Glossary Terms with Relationships
version: 1
source: DataHub
owners:
users:
- mjames
url: "https://github.com/datahub-project/datahub/"
nodes:
- name: Classification
description: A set of terms related to Data Classification
knowledge_links:
- label: Wiki link for classification
url: "https://en.wikipedia.org/wiki/Classification"
terms:
- name: Sensitive
description: Sensitive Data
custom_properties:
is_confidential: false
- name: Confidential
description: Confidential Data
custom_properties:
is_confidential: true
- name: HighlyConfidential
description: Highly Confidential Data
custom_properties:
is_confidential: true
domain: Marketing
- name: PersonalInformation
description: All terms related to personal information
owners:
users:
- mjames
terms:
- name: Email
description: An individual's email address
inherits:
- Classification.Confidential
owners:
groups:
- Trust and Safety
- name: Address
description: A physical address
- name: Gender
description: The gender identity of the individual
inherits:
- Classification.Sensitive
- name: Shipping
description: Provides terms related to the shipping domain
owners:
users:
- njones
groups:
- logistics
contains:
- Shipping.ZipCode
- Shipping.CountryCode
- Shipping.StreetAddress
Schema of Accepted Values for Glossary YAML Files
The schema for the business glossary YAML file includes the following top-level keys:
version
: The version of the business glossary file config.source
: The source format of the terms.owners
: Contains two nested fields,users
andgroups
.url
: External URL pointing to where the glossary is defined externally, if applicable.nodes
: List of childGlossaryNode
objects.terms
: List ofGlossaryTerm
objects.
Each GlossaryTerm
can have the following fields:
name
: Name of the term.description
: Description of the term.owners
: Contains two nested fields,users
andgroups
.term_source
: One ofEXTERNAL
orINTERNAL
.source_ref
: If external, the name of the source the glossary term is coming from.source_url
: If external, the URL of the source definition.inherits
: List ofGlossaryTerm
that this term inherits from.contains
: List ofGlossaryTerm
that this term contains.custom_properties
: Map of key/value pairs of arbitrary custom properties.knowledge_links
: List ofKnowledgeCard
related to this term.domain
: Domain name or domain URN.
For more detailed information, you can refer to the <Business Glossary | DataHub Glossary File Format documentation>.
Would you like more detailed steps or additional examples?
Sources:
- <Business Glossary | DataHub Glossary File Format> 0 button 0 button Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK>
uhm but I’m looking for adding data assets to the related entities, not terms to the Related terms
that from the glossary yaml file (if possible)
To add data assets as related entities to glossary terms directly from the YAML file, DataHub currently does not support this feature explicitly in the YAML schema for business glossary ingestion. The YAML schema primarily supports defining relationships between glossary terms themselves using inherits
and contains
.
However, you can achieve this by first ingesting your glossary terms and then programmatically linking the glossary terms to data assets using the DataHub API. This involves two steps:
- Ingest Glossary Terms: Use the YAML file to ingest your glossary terms.
- Link Data Assets to Glossary Terms: Use the DataHub API to link the ingested glossary terms to the relevant data assets.
Step 1: Ingest Glossary Terms
Create a YAML file for your glossary terms as shown in the previous example and ingest it using the DataHub CLI.
Step 2: Link Data Assets to Glossary Terms
You can use the DataHub Python SDK to link data assets to glossary terms. Here is an example of how to do this:
from datahub.ingestion.graph.client import DataHubGraph, DataHubGraphConfig
# Initialize the DataHubGraph client
datahub_graph = DataHubGraph(DataHubGraphConfig(server="<http://localhost:8080>"))
# Define the URNs for the glossary term and the data asset
glossary_term_urn = "urn:li:glossaryTerm:your-glossary-term-id"
dataset_urn = "urn:li:dataset:(urn:li:dataPlatform:hive,your-dataset-id,PROD)"
# Define the relationship type
relationship_type = "TermedWith"
# Create the relationship
relationship = {
"type": relationship_type,
"direction": "INCOMING",
"entity": {
"urn": dataset_urn,
"type": "dataset"
}
}
# Add the relationship to the glossary term
datahub_graph.add_relationship(glossary_term_urn, relationship)
Explanation
- DataHubGraph Client: Initializes the DataHubGraph client to interact with the DataHub API.
- URNs: Specifies the URNs for the glossary term and the data asset.
- Relationship Type: Defines the type of relationship (
TermedWith
in this case). - Create Relationship: Constructs the relationship object and adds it to the glossary term using the
add_relationship
method.
Sources:
- https://datahubspace.slack.com/archives/CV2KB471C/p1722458910.153669|https://datahubspace.slack.com/archives/CV2KB471C/p1722458910.153669
- https://datahubspace.slack.com/archives/CV2KB471C/p1722483892.389509|https://datahubspace.slack.com/archives/CV2KB471C/p1722483892.389509 1 button 0 button Hint: Mention <@U06TM7M40RK> in the thread for followups.