Managing Sibling Relationships in DataHub between Hive, Trino, and dbt Datasets

user-4 · September 9, 2024, 12:05am

Datahub version : v0.13.3
The hive, trino and dbt sibling stuff is very very confusing.
any docs around it.

Problem:
• ingested few datasets using hive as a source
• ingested same datasets using trino as source with same hive catalog
• now datasets appear as hive&trino. Running hive ingestion no more refreshes the dataset. CLI says ingestion completed but nothing is refreshed
• on deleting the dataset from trino, hive refresh works fine
Why would this happen. If they are siblings ingestion should work irrespective of source.

datahub_team · September 9, 2024, 12:05am

Hey there! Make sure your message includes the following information if relevant, so we can help more effectively!

<@U06TM7M40RK> Can you help with the questions in this thread?

_Please continue to tag RunLLM (e.g <@U06TM7M40RK>) to have it continue generating responses!

hyejin.yoon · September 9, 2024, 12:05am

To address your confusion around creating sibling relationships between datasets from Hive, Trino, and dbt in DataHub, let’s break down the process and provide some documentation references.

Understanding Sibling Relationships in DataHub

What is a Sibling Aspect?
- A sibling aspect in DataHub indicates a relationship between datasets that are considered equivalent or related across different platforms. This relationship is bidirectional, meaning each dataset in the sibling set points to the others as siblings. This is useful for representing the same dataset that might exist in different environments or platforms.

Creating Sibling Relationships Programmatically

You can create sibling relationships using the DataHub API. Here is an example using the DataHub Python client:

import datahub.emitter.mce_builder as builder
from datahub.emitter.mcp import MetadataChangeProposalWrapper
from datahub.metadata.schema_classes import SiblingsClass
from datahub.emitter.rest_emitter import DatahubRestEmitter

# Create an emitter to DataHub over REST
emitter = DatahubRestEmitter(gms_server="http://localhost:8080", extra_headers={})

# Test the connection
emitter.test_connection()

new_siblings = SiblingsClass(
    siblings=[
        "urn:li:dataset:(urn:li:dataPlatform:dbt,MY_DATABASE.my_schema.my_table,PROD)",
    ],
    primary=False
)

# Construct a MetadataChangeProposalWrapper object.
metadata_event = MetadataChangeProposalWrapper(
    entityType='dataset',
    changeType='UPSERT',
    entityUrn="urn:li:dataset:(urn:li:dataPlatform:snowflake,my_database.my_schema.my_table,PROD)",
    aspect=new_siblings,
)

# Emit metadata! This is a blocking call
emitter.emit(metadata_event)

Creating Sibling Relationships During Ingestion
- Siblings are typically created during ingestion when the ingestion framework identifies datasets that should be considered equivalent across different platforms. This can be configured in the ingestion recipes. For example, when ingesting dbt and Snowflake, you can configure the target_platform in the dbt ingestion recipe to ensure that the datasets are recognized as siblings.
Making Multiple Datasets Siblings During Ingestion
- To make multiple datasets from different platforms siblings of each other during ingestion, you can use transformers or custom scripts to emit the sibling relationships. Here is a conceptual example of how you might set this up:

Topic		Replies	Views
The Interaction and Troubleshooting of Trino Connector in DataHub Ingestion ingestion	9	35	December 9, 2024
Creating a Column Lineage Between Hive and Trino Datasets ingestion	2	82	April 15, 2024
Mapping Datasets as Siblings Based on URNs and Platforms ingestion	2	54	March 4, 2024
Clarifying Information for DataHub Ingestion Queries ingestion	2	7	July 15, 2024
How to Ingest and Transform Hive Database Names in DataHub ingestion	8	58	July 29, 2024

Managing Sibling Relationships in DataHub between Hive, Trino, and dbt Datasets

Hey there! Make sure your message includes the following information if relevant, so we can help more effectively!

<@U06TM7M40RK> Can you help with the questions in this thread?

Related topics