Managing Sibling Relationships in DataHub between Hive, Trino, and dbt Datasets

Original Slack Thread

Datahub version : v0.13.3
The hive, trino and dbt sibling stuff is very very confusing.
any docs around it.

Problem:
• ingested few datasets using hive as a source
• ingested same datasets using trino as source with same hive catalog
• now datasets appear as hive&trino. Running hive ingestion no more refreshes the dataset. CLI says ingestion completed but nothing is refreshed
• on deleting the dataset from trino, hive refresh works fine
Why would this happen. If they are siblings ingestion should work irrespective of source.

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

<@U06TM7M40RK> Can you help with the questions in this thread?

:information_source: _Please continue to tag RunLLM (e.g <@U06TM7M40RK>) to have it continue generating responses!

To address your confusion around creating sibling relationships between datasets from Hive, Trino, and dbt in DataHub, let’s break down the process and provide some documentation references.

Understanding Sibling Relationships in DataHub

  1. What is a Sibling Aspect?

    • A sibling aspect in DataHub indicates a relationship between datasets that are considered equivalent or related across different platforms. This relationship is bidirectional, meaning each dataset in the sibling set points to the others as siblings. This is useful for representing the same dataset that might exist in different environments or platforms.
  2. Creating Sibling Relationships Programmatically

    • You can create sibling relationships using the DataHub API. Here is an example using the DataHub Python client:
    import datahub.emitter.mce_builder as builder
    from datahub.emitter.mcp import MetadataChangeProposalWrapper
    from datahub.metadata.schema_classes import SiblingsClass
    from datahub.emitter.rest_emitter import DatahubRestEmitter
    
    # Create an emitter to DataHub over REST
    emitter = DatahubRestEmitter(gms_server="http://localhost:8080", extra_headers={})
    
    # Test the connection
    emitter.test_connection()
    
    new_siblings = SiblingsClass(
        siblings=[
            "urn:li:dataset:(urn:li:dataPlatform:dbt,MY_DATABASE.my_schema.my_table,PROD)",
        ],
        primary=False
    )
    
    # Construct a MetadataChangeProposalWrapper object.
    metadata_event = MetadataChangeProposalWrapper(
        entityType='dataset',
        changeType='UPSERT',
        entityUrn="urn:li:dataset:(urn:li:dataPlatform:snowflake,my_database.my_schema.my_table,PROD)",
        aspect=new_siblings,
    )
    
    # Emit metadata! This is a blocking call
    emitter.emit(metadata_event)
    
  3. Creating Sibling Relationships During Ingestion

    • Siblings are typically created during ingestion when the ingestion framework identifies datasets that should be considered equivalent across different platforms. This can be configured in the ingestion recipes. For example, when ingesting dbt and Snowflake, you can configure the target_platform in the dbt ingestion recipe to ensure that the datasets are recognized as siblings.
  4. Making Multiple Datasets Siblings During Ingestion

    • To make multiple datasets from different platforms siblings of each other during ingestion, you can use transformers or custom scripts to emit the sibling relationships. Here is a conceptual example of how you might set this up: