Troubleshooting Business Glossary Ingestion and Dataset Association

Original Slack Thread

Hi Team,
I wanted to add Terms and Terms group to snowflake objects (tables/views/columns) using business glossary . Business glossary (terms and terms group) are getting created in Datahub as expected. But these terms are not getting added to datasets.
Below are my ingestion steps:

  1. Created a business glossary.yml
  2. Created a script with source, sink and transformer to create glossary using step 1 and add terms to datasets using transformer.
  3. I am using Airflow to execute step 2.
    I am add more details in the comments.

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)
  1. I am using CLI for ingestions
  2. 0.12.0
  3. Snowflake

Hi <@U01GZEETMEZ>, can you review this and let me know what is the issue with this script? Thanks !

Scripts used:

  1. Business glossary.yaml
source: DataHub
owners:
  users:
    - fency
url: "<https://github.pie.apple.com/storeAnalytics/datahub>"
nodes:
  - name: Classification
    description: A set of terms related to Data Classification
    knowledge_links:
      - label: Wiki link for classification
        url: "<https://en.wikipedia.org/wiki/Classification>"
    terms:
      - name: Sensitive
        description: Sensitive Data
        custom_properties:
          is_confidential: false
      - name: Confidential
        description: Confidential Data
        custom_properties:
          is_confidential: true
      - name: HighlyConfidential
        description: Highly Confidential Data
        custom_properties:
          is_confidential: true
        domain: "urn:li:domain:rsa"
  - name: PII
    description: All terms related to personal information
    terms:
      - name: email
        description: An individual's email address
        inherits:
          - Classification.Confidential
      - name: address
        description: A physical address
      - name: gender
        description: The gender identity of the individual
        inherits:
          - Classification.Sensitive
      - name: ssn
        description: social security number
        inherits:
          - Classification.Sensitive
        domain: "urn:li:domain:rsa"```
2. Script used:
``` def snowflake_business_glossary():
    """business glossary ingestion for snowflake entities """
    from datahub.configuration.config_loader import load_config_file
    from datahub.ingestion.run.pipeline import Pipeline
    <http://logger.info|logger.info>("Creating business glossary")
    
    pipeline = Pipeline.create(
        # This configuration is analogous to a recipe configuration.
        {
            "source": {
                "type": "datahub-business-glossary",
                "config": {
                    "file": "/business_glossary_recipe.yml",
                    "enable_auto_id" : False
                },
            },
            "sink": {
                "type": "datahub-rest",
                "config": {
                    "server": f"{cf.server}",
                    "token": f"{cf.token}"
                },
            },
            "transformers": [
                {
                "type": "simple_add_dataset_terms",
                "config": {
                    "semantics": "PATCH",
                    "term_urns": ["urn:li:glossaryTerm:PII.email"],
                    "term_urns": ["urn:li:glossaryTerm:PII.ssn"],
                    },
                }
            ],
        }
    )
    pipeline.run()
    pipeline.pretty_print_summary()
    pipeline.raise_from_status()```
this module ran without any error, but datasets did not get any terms added.  Airflow log
``` [2024-04-12, 10:29:51 PDT] {logging_mixin.py:154} INFO - {'cli_version': '0.12.0.2',
 'cli_entry_location': '/usr/local/airflow/.local/lib/python3.8/site-packages/datahub/__init__.py',
 'py_version': '3.8.17 (default, Aug 10 2023, 12:50:17) \n[GCC 8.5.0 20210514 (Red Hat 8.5.0-20)]',
 'py_exec_path': '/usr/bin/python3.8',
 'os_details': 'Linux-5.10.210-201.852.amzn2.x86_64-x86_64-with-glibc2.2.5',
 'peak_memory_usage': '252.67 MB',
 'mem_info': '252.67 MB',
 'peak_disk_usage': '30.66 GB',
 'disk_info': {'total': '53.67 GB', 'used': '30.66 GB', 'free': '23.01 GB'}}
[2024-04-12, 10:29:51 PDT] {logging_mixin.py:154} INFO - Source (datahub-business-glossary) report:
[2024-04-12, 10:29:51 PDT] {logging_mixin.py:154} INFO - {'events_produced': 21,
 'events_produced_per_sec': 46,
 'entities': {'glossaryNode': ['urn:li:glossaryNode:Classification', 'urn:li:glossaryNode:PII'],
              'glossaryTerm': ['urn:li:glossaryTerm:Classification.Sensitive',
                               'urn:li:glossaryTerm:Classification.Confidential',
                               'urn:li:glossaryTerm:Classification.HighlyConfidential',
                               'urn:li:glossaryTerm:PII.email',
                               'urn:li:glossaryTerm:PII.address',
                               'urn:li:glossaryTerm:PII.gender',
                               'urn:li:glossaryTerm:PII.ssn']},
 'aspects': {'glossaryNode': {'glossaryNodeInfo': 2, 'ownership': 2, 'institutionalMemory': 1, 'status': 2},
             'glossaryTerm': {'glossaryTermInfo': 7, 'ownership': 7, 'domains': 2, 'glossaryRelatedTerms': 3, 'status': 7}},
 'warnings': {},
 'failures': {},
 'start_time': '2024-04-12 17:29:50.610827 (now)',
 'running_time': '0.45 seconds'}
[2024-04-12, 10:29:51 PDT] {logging_mixin.py:154} INFO - Sink (datahub-rest) report:
[2024-04-12, 10:29:51 PDT] {logging_mixin.py:154} INFO - {'total_records_written': 21,
 'records_written_per_second': 33,
 'warnings': [],
 'failures': [],
 'start_time': '2024-04-12 17:29:50.442028 (now)',
 'current_time': '2024-04-12 17:29:51.064747 (now)',
 'total_duration_in_seconds': 0.62,
 'gms_version': 'null',
 'pending_requests': 0}```