Resolving Duplicate Model Issues in DataHub When Using dbt and BigQuery Ingestion Recipes

user-2 · October 14, 2024, 12:03am

Hi all.
My ingestion makes duplicates of the same model coming from dbt recipe and bigquery recipe.
I tried different ways of changing the recipe but never managed to only get one model pointing to both sources in one.
Examples of model duplicates below (where I tried different configs): One of them have dbt & bigquery in one (like i want it), but still produces another one for dbt. attachment

datahub_team · October 14, 2024, 12:04am

Hey there!

<@U06TM7M40RK> Can you help with the questions in this thread?

Please continue to tag RunLLM (e.g <@U06TM7M40RK>) to have it continue generating responses!

user-1 · October 14, 2024, 12:04am

To address the issue of duplicate models appearing in your DataHub lineage when using both dbt and BigQuery ingestion recipes, you can follow these steps:

Ensure Unique Platform Instances: Make sure that each dbt project and BigQuery ingestion has a unique platform_instance specified in its ingestion recipe. This helps in distinguishing between models from different projects and sources.

# dbt ingestion recipe
source:
  type: dbt
  config:
    platform_instance: dbt_instance
    target_platform: bigquery
    manifest_path: /path/to/manifest.json
    catalog_path: /path/to/catalog.json
    # other configs

# BigQuery ingestion recipe
source:
  type: bigquery
  config:
    platform_instance: bigquery_instance
    project_id: your_project_id
    # other configs

Use SQL Parser Lineage: If the dbt metadata is causing issues, you can try using the SQL parser for lineage generation. This can be enabled with the prefer_sql_parser_lineage flag in your dbt ingestion recipe.

source:
  type: dbt
  config:
    platform_instance: dbt_instance
    target_platform: bigquery
    manifest_path: /path/to/manifest.json
    catalog_path: /path/to/catalog.json
    prefer_sql_parser_lineage: true

Check for Case Sensitivity Issues: Ensure that the URNs are consistently lowercased or retain their original casing across different ingestion sources. You can use the convert_urns_to_lowercase and convert_column_urns_to_lowercase flags in your ingestion recipes.

# dbt ingestion recipe
source:
  type: dbt
  config:
    manifest_path: /path/to/manifest.json
    catalog_path: /path/to/catalog.json
    target_platform: bigquery
    convert_urns_to_lowercase: false
    convert_column_urns_to_lowercase: false

# BigQuery ingestion recipe
source:
  type: bigquery
  config:
    project_id: your_project_id
    convert_urns_to_lowercase: false

Re-ingestion and Cleanup: After updating the ingestion recipes, you may need to re-ingest the metadata to ensure that the changes take effect. If there are existing inconsistencies, you might need to delete the old metadata entries before re-ingesting. You can use the DataHub CLI for this purpose.
```
datahub delete --platform dbt --soft --dry-run
datahub delete --platform bigquery --soft --dry-run
```

Would you like more detailed steps or further assistance on any specific part of this process?

Sources:

<Slack Multiple dbt Projects> 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-2 · October 14, 2024, 12:04am

Tried what the <@U06TM7M40RK> suggested with no success.
If anyone else has any tips, please lmk.

user-1 · October 14, 2024, 12:04am

To further troubleshoot and resolve the issue of duplicate models appearing in DataHub when using both dbt and BigQuery ingestion recipes, let’s consider additional steps and configurations:

Align URNs Across Ingestions:
- Ensure that the URNs generated by both dbt and BigQuery ingestions are aligned. This can be achieved by using the same naming conventions and ensuring that the platform_instance and target_platform are consistent.

Use stateful_ingestion:

Enable stateful ingestion to manage the lifecycle of entities and avoid duplicates. This feature helps in handling deletions and updates more effectively.

# dbt ingestion recipe
source:
  type: dbt
  config:
    manifest_path: /path/to/manifest.json
    catalog_path: /path/to/catalog.json
    target_platform: bigquery
    platform_instance: dbt_instance
    stateful_ingestion:
      enabled: true
      remove_stale_metadata: true

# BigQuery ingestion recipe
source:
  type: bigquery
  config:
    project_id: your_project_id
    platform_instance: bigquery_instance
    stateful_ingestion:
      enabled: true
      remove_stale_metadata: true

Check for Entity Deletion:

If the entities were previously deleted, they might be marked as deleted in DataHub. You can undelete these entities using the DataHub Python library.

from datahub.emitter.mce_builder import make_dataset_urn
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import StatusClass, MetadataChangeProposalWrapper

def undelete_entity(entity_urn: str):
    emitter = DatahubRestEmitter(gms_server='http://localhost:8080')
    mcp_wrapper = MetadataChangeProposalWrapper(
        entityUrn=entity_urn,
        aspect=StatusClass(removed=False),
        changeType="UPSERT",
    )
    emitter.emit(mcp_wrapper)
    print(f"Entity {entity_urn} has been unmarked for deletion.")

# Example usage
undelete_entity("urn:li:dataset:(urn:li:dataPlatform:bigquery,example_dataset,PROD)")

Run DataHub CLI with Debug Logs:
- Running the DataHub CLI with debug logs enabled can provide more insights into what might be going wrong. Use the --debug flag to enable detailed logging.
```
export DATAHUB_DEBUG=true
datahub ingest -c your_ingestion_recipe.yml --debug
```
Manual Metadata Emission:
- If the issue persists, you can manually emit metadata to DataHub using the DataHub Python library.

user-1 · October 14, 2024, 12:04am

    from datahub.emitter.mce_builder import make_dataset_urn
    from datahub.emitter.rest_emitter import DatahubRestEmitter
    from datahub.metadata.schema_classes import DatasetPropertiesClass

    emitter = DatahubRestEmitter(gms_server='<http://localhost:8080>')
    emitter.test_connection()

    dataset_urn = make_dataset_urn(platform='bigquery', name='example_dataset', env='PROD')
    dataset_properties = DatasetPropertiesClass(description='Sample dataset')

    metadata_event = MetadataChangeProposalWrapper(
        entityType='dataset',
        changeType='UPSERT',
        entityUrn=dataset_urn,
        aspect=dataset_properties,
    )

    emitter.emit(metadata_event)
    ```

Would you like more detailed steps or further assistance on any specific part of this process?

Sources:
- <https://datahubspace.slack.com/archives/CUMUWQU66/p1724513103.429069|Handling Multiple dbt Projects> :+1: 0 button :-1: 0 button :bulb: *Hint:* Mention @runllm in the thread for followups.

datahub_team · October 14, 2024, 12:04am

Hi ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look

Hope this helps ! ><

Topic		Replies	Views
Troubleshooting dbt Test Results Ingestion for BigQuery Alignment ingestion	13	18	February 3, 2025
Handling Multiple dbt Projects and Lineage Graph Override ingestion	3	48	May 5, 2025
Preventing Duplicate Metadata Ingestion from dbt Cloud "CI Job" ingestion	3	1	April 7, 2025
Using the dbt ingestion source to load metadata into a custom data platform in DataHub ingestion	4	37	December 2, 2024
Troubleshooting DataHub Deployment with Multiple Data Entity Copies and Lineage Issues troubleshoot	3	104	March 4, 2024

Resolving Duplicate Model Issues in DataHub When Using dbt and BigQuery Ingestion Recipes

Hey there!

<@U06TM7M40RK> Can you help with the questions in this thread?

Related topics