Resolving Duplicate Model Issues in DataHub When Using dbt and BigQuery Ingestion Recipes

Original Slack Thread

Hi all.
My ingestion makes duplicates of the same model coming from dbt recipe and bigquery recipe.
I tried different ways of changing the recipe but never managed to only get one model pointing to both sources in one.
Examples of model duplicates below (where I tried different configs): One of them have dbt & bigquery in one (like i want it), but still produces another one for dbt.attachmentattachment

Hey there! :wave:

<@U06TM7M40RK> Can you help with the questions in this thread?

:information_source: Please continue to tag RunLLM (e.g <@U06TM7M40RK>) to have it continue generating responses!

To address the issue of duplicate models appearing in your DataHub lineage when using both dbt and BigQuery ingestion recipes, you can follow these steps:

  1. Ensure Unique Platform Instances: Make sure that each dbt project and BigQuery ingestion has a unique platform_instance specified in its ingestion recipe. This helps in distinguishing between models from different projects and sources.

    # dbt ingestion recipe
    source:
      type: dbt
      config:
        platform_instance: dbt_instance
        target_platform: bigquery
        manifest_path: /path/to/manifest.json
        catalog_path: /path/to/catalog.json
        # other configs
    
    # BigQuery ingestion recipe
    source:
      type: bigquery
      config:
        platform_instance: bigquery_instance
        project_id: your_project_id
        # other configs
    
  2. Use SQL Parser Lineage: If the dbt metadata is causing issues, you can try using the SQL parser for lineage generation. This can be enabled with the prefer_sql_parser_lineage flag in your dbt ingestion recipe.

    source:
      type: dbt
      config:
        platform_instance: dbt_instance
        target_platform: bigquery
        manifest_path: /path/to/manifest.json
        catalog_path: /path/to/catalog.json
        prefer_sql_parser_lineage: true
    
  3. Check for Case Sensitivity Issues: Ensure that the URNs are consistently lowercased or retain their original casing across different ingestion sources. You can use the convert_urns_to_lowercase and convert_column_urns_to_lowercase flags in your ingestion recipes.

    # dbt ingestion recipe
    source:
      type: dbt
      config:
        manifest_path: /path/to/manifest.json
        catalog_path: /path/to/catalog.json
        target_platform: bigquery
        convert_urns_to_lowercase: false
        convert_column_urns_to_lowercase: false
    
    # BigQuery ingestion recipe
    source:
      type: bigquery
      config:
        project_id: your_project_id
        convert_urns_to_lowercase: false
    
  4. Re-ingestion and Cleanup: After updating the ingestion recipes, you may need to re-ingest the metadata to ensure that the changes take effect. If there are existing inconsistencies, you might need to delete the old metadata entries before re-ingesting. You can use the DataHub CLI for this purpose.

    datahub delete --platform dbt --soft --dry-run
    datahub delete --platform bigquery --soft --dry-run
    

Would you like more detailed steps or further assistance on any specific part of this process?

Sources:

  • <Slack Multiple dbt Projects> :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

Tried what the <@U06TM7M40RK> suggested with no success.
If anyone else has any tips, please lmk.

To further troubleshoot and resolve the issue of duplicate models appearing in DataHub when using both dbt and BigQuery ingestion recipes, let’s consider additional steps and configurations:

  1. Align URNs Across Ingestions:

    • Ensure that the URNs generated by both dbt and BigQuery ingestions are aligned. This can be achieved by using the same naming conventions and ensuring that the platform_instance and target_platform are consistent.
  2. Use stateful_ingestion:

    • Enable stateful ingestion to manage the lifecycle of entities and avoid duplicates. This feature helps in handling deletions and updates more effectively.
    # dbt ingestion recipe
    source:
      type: dbt
      config:
        manifest_path: /path/to/manifest.json
        catalog_path: /path/to/catalog.json
        target_platform: bigquery
        platform_instance: dbt_instance
        stateful_ingestion:
          enabled: true
          remove_stale_metadata: true
    
    # BigQuery ingestion recipe
    source:
      type: bigquery
      config:
        project_id: your_project_id
        platform_instance: bigquery_instance
        stateful_ingestion:
          enabled: true
          remove_stale_metadata: true
    
  3. Check for Entity Deletion:

    • If the entities were previously deleted, they might be marked as deleted in DataHub. You can undelete these entities using the DataHub Python library.
    from datahub.emitter.mce_builder import make_dataset_urn
    from datahub.emitter.rest_emitter import DatahubRestEmitter
    from datahub.metadata.schema_classes import StatusClass, MetadataChangeProposalWrapper
    
    def undelete_entity(entity_urn: str):
        emitter = DatahubRestEmitter(gms_server='http://localhost:8080')
        mcp_wrapper = MetadataChangeProposalWrapper(
            entityUrn=entity_urn,
            aspect=StatusClass(removed=False),
            changeType="UPSERT",
        )
        emitter.emit(mcp_wrapper)
        print(f"Entity {entity_urn} has been unmarked for deletion.")
    
    # Example usage
    undelete_entity("urn:li:dataset:(urn:li:dataPlatform:bigquery,example_dataset,PROD)")
    
  4. Run DataHub CLI with Debug Logs:

    • Running the DataHub CLI with debug logs enabled can provide more insights into what might be going wrong. Use the --debug flag to enable detailed logging.
    export DATAHUB_DEBUG=true
    datahub ingest -c your_ingestion_recipe.yml --debug
    
  5. Manual Metadata Emission:

    • If the issue persists, you can manually emit metadata to DataHub using the DataHub Python library.
    from datahub.emitter.mce_builder import make_dataset_urn
    from datahub.emitter.rest_emitter import DatahubRestEmitter
    from datahub.metadata.schema_classes import DatasetPropertiesClass

    emitter = DatahubRestEmitter(gms_server='<http://localhost:8080>')
    emitter.test_connection()

    dataset_urn = make_dataset_urn(platform='bigquery', name='example_dataset', env='PROD')
    dataset_properties = DatasetPropertiesClass(description='Sample dataset')

    metadata_event = MetadataChangeProposalWrapper(
        entityType='dataset',
        changeType='UPSERT',
        entityUrn=dataset_urn,
        aspect=dataset_properties,
    )

    emitter.emit(metadata_event)
    ```

Would you like more detailed steps or further assistance on any specific part of this process?

Sources:
- <https://datahubspace.slack.com/archives/CUMUWQU66/p1724513103.429069|Handling Multiple dbt Projects> :+1: 0 button :-1: 0 button :bulb: *Hint:* Mention @runllm in the thread for followups.

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><