Adding Related Entities to Glossary Terms in a YAML File for Business Glossary Ingestion

Original Slack Thread

hey <@U06TM7M40RK>
I’m trying to keep a record on my glossary terms in a yaml file and running a business glossary ingestion on top of it.
My question: Is there any way to register related entities in the glossaryTerm field inside the yaml?
Also, do you know if there is a schema of all the accepted values for glossary yaml files?

To register related entities in the glossaryTerm field inside the YAML file for a business glossary ingestion, you can use the inherits and contains fields. These fields allow you to define hierarchical relationships between glossary terms.

Example YAML Structure for Glossary Terms with Relationships

version: 1
source: DataHub
owners:
  users:
    - mjames
url: "https://github.com/datahub-project/datahub/"
nodes:
  - name: Classification
    description: A set of terms related to Data Classification
    knowledge_links:
      - label: Wiki link for classification
        url: "https://en.wikipedia.org/wiki/Classification"
    terms:
      - name: Sensitive
        description: Sensitive Data
        custom_properties:
          is_confidential: false
      - name: Confidential
        description: Confidential Data
        custom_properties:
          is_confidential: true
      - name: HighlyConfidential
        description: Highly Confidential Data
        custom_properties:
          is_confidential: true
        domain: Marketing
  - name: PersonalInformation
    description: All terms related to personal information
    owners:
      users:
        - mjames
    terms:
      - name: Email
        description: An individual's email address
        inherits:
          - Classification.Confidential
        owners:
          groups:
            - Trust and Safety
      - name: Address
        description: A physical address
      - name: Gender
        description: The gender identity of the individual
        inherits:
          - Classification.Sensitive
      - name: Shipping
        description: Provides terms related to the shipping domain
        owners:
          users:
            - njones
          groups:
            - logistics
        contains:
          - Shipping.ZipCode
          - Shipping.CountryCode
          - Shipping.StreetAddress

Schema of Accepted Values for Glossary YAML Files

The schema for the business glossary YAML file includes the following top-level keys:

  • version: The version of the business glossary file config.
  • source: The source format of the terms.
  • owners: Contains two nested fields, users and groups.
  • url: External URL pointing to where the glossary is defined externally, if applicable.
  • nodes: List of child GlossaryNode objects.
  • terms: List of GlossaryTerm objects.

Each GlossaryTerm can have the following fields:

  • name: Name of the term.
  • description: Description of the term.
  • owners: Contains two nested fields, users and groups.
  • term_source: One of EXTERNAL or INTERNAL.
  • source_ref: If external, the name of the source the glossary term is coming from.
  • source_url: If external, the URL of the source definition.
  • inherits: List of GlossaryTerm that this term inherits from.
  • contains: List of GlossaryTerm that this term contains.
  • custom_properties: Map of key/value pairs of arbitrary custom properties.
  • knowledge_links: List of KnowledgeCard related to this term.
  • domain: Domain name or domain URN.

For more detailed information, you can refer to the <Business Glossary | DataHub Glossary File Format documentation>.

Would you like more detailed steps or additional examples?

Sources:

<@U06TM7M40RK>
uhm but I’m looking for adding data assets to the related entities, not terms to the Related terms

that from the glossary yaml file (if possible)

To add data assets as related entities to glossary terms directly from the YAML file, DataHub currently does not support this feature explicitly in the YAML schema for business glossary ingestion. The YAML schema primarily supports defining relationships between glossary terms themselves using inherits and contains.

However, you can achieve this by first ingesting your glossary terms and then programmatically linking the glossary terms to data assets using the DataHub API. This involves two steps:

  1. Ingest Glossary Terms: Use the YAML file to ingest your glossary terms.
  2. Link Data Assets to Glossary Terms: Use the DataHub API to link the ingested glossary terms to the relevant data assets.

Step 1: Ingest Glossary Terms

Create a YAML file for your glossary terms as shown in the previous example and ingest it using the DataHub CLI.

Step 2: Link Data Assets to Glossary Terms

You can use the DataHub Python SDK to link data assets to glossary terms. Here is an example of how to do this:

from datahub.ingestion.graph.client import DataHubGraph, DataHubGraphConfig

# Initialize the DataHubGraph client
datahub_graph = DataHubGraph(DataHubGraphConfig(server="<http://localhost:8080>"))

# Define the URNs for the glossary term and the data asset
glossary_term_urn = "urn:li:glossaryTerm:your-glossary-term-id"
dataset_urn = "urn:li:dataset:(urn:li:dataPlatform:hive,your-dataset-id,PROD)"

# Define the relationship type
relationship_type = "TermedWith"

# Create the relationship
relationship = {
    "type": relationship_type,
    "direction": "INCOMING",
    "entity": {
        "urn": dataset_urn,
        "type": "dataset"
    }
}

# Add the relationship to the glossary term
datahub_graph.add_relationship(glossary_term_urn, relationship)

Explanation

  • DataHubGraph Client: Initializes the DataHubGraph client to interact with the DataHub API.
  • URNs: Specifies the URNs for the glossary term and the data asset.
  • Relationship Type: Defines the type of relationship (TermedWith in this case).
  • Create Relationship: Constructs the relationship object and adds it to the glossary term using the add_relationship method.

Sources: