Hi all.
My ingestion makes duplicates of the same model coming from dbt recipe and bigquery recipe.
I tried different ways of changing the recipe but never managed to only get one model pointing to both sources in one.
Examples of model duplicates below (where I tried different configs): One of them have dbt & bigquery in one (like i want it), but still produces another one for dbt.
Hey there!
<@U06TM7M40RK> Can you help with the questions in this thread?
Please continue to tag RunLLM (e.g <@U06TM7M40RK>) to have it continue generating responses!
To address the issue of duplicate models appearing in your DataHub lineage when using both dbt and BigQuery ingestion recipes, you can follow these steps:
-
Ensure Unique Platform Instances: Make sure that each dbt project and BigQuery ingestion has a unique
platform_instance
specified in its ingestion recipe. This helps in distinguishing between models from different projects and sources.# dbt ingestion recipe source: type: dbt config: platform_instance: dbt_instance target_platform: bigquery manifest_path: /path/to/manifest.json catalog_path: /path/to/catalog.json # other configs # BigQuery ingestion recipe source: type: bigquery config: platform_instance: bigquery_instance project_id: your_project_id # other configs
-
Use SQL Parser Lineage: If the dbt metadata is causing issues, you can try using the SQL parser for lineage generation. This can be enabled with the
prefer_sql_parser_lineage
flag in your dbt ingestion recipe.source: type: dbt config: platform_instance: dbt_instance target_platform: bigquery manifest_path: /path/to/manifest.json catalog_path: /path/to/catalog.json prefer_sql_parser_lineage: true
-
Check for Case Sensitivity Issues: Ensure that the URNs are consistently lowercased or retain their original casing across different ingestion sources. You can use the
convert_urns_to_lowercase
andconvert_column_urns_to_lowercase
flags in your ingestion recipes.# dbt ingestion recipe source: type: dbt config: manifest_path: /path/to/manifest.json catalog_path: /path/to/catalog.json target_platform: bigquery convert_urns_to_lowercase: false convert_column_urns_to_lowercase: false # BigQuery ingestion recipe source: type: bigquery config: project_id: your_project_id convert_urns_to_lowercase: false
-
Re-ingestion and Cleanup: After updating the ingestion recipes, you may need to re-ingest the metadata to ensure that the changes take effect. If there are existing inconsistencies, you might need to delete the old metadata entries before re-ingesting. You can use the DataHub CLI for this purpose.
datahub delete --platform dbt --soft --dry-run datahub delete --platform bigquery --soft --dry-run
Would you like more detailed steps or further assistance on any specific part of this process?
Sources:
- <Slack Multiple dbt Projects> 0 button 0 button Hint: Mention @runllm in the thread for followups.
Tried what the <@U06TM7M40RK> suggested with no success.
If anyone else has any tips, please lmk.
To further troubleshoot and resolve the issue of duplicate models appearing in DataHub when using both dbt and BigQuery ingestion recipes, let’s consider additional steps and configurations:
-
Align URNs Across Ingestions:
- Ensure that the URNs generated by both dbt and BigQuery ingestions are aligned. This can be achieved by using the same naming conventions and ensuring that the
platform_instance
andtarget_platform
are consistent.
- Ensure that the URNs generated by both dbt and BigQuery ingestions are aligned. This can be achieved by using the same naming conventions and ensuring that the
-
Use
stateful_ingestion
:- Enable stateful ingestion to manage the lifecycle of entities and avoid duplicates. This feature helps in handling deletions and updates more effectively.
# dbt ingestion recipe source: type: dbt config: manifest_path: /path/to/manifest.json catalog_path: /path/to/catalog.json target_platform: bigquery platform_instance: dbt_instance stateful_ingestion: enabled: true remove_stale_metadata: true # BigQuery ingestion recipe source: type: bigquery config: project_id: your_project_id platform_instance: bigquery_instance stateful_ingestion: enabled: true remove_stale_metadata: true
-
Check for Entity Deletion:
- If the entities were previously deleted, they might be marked as deleted in DataHub. You can undelete these entities using the DataHub Python library.
from datahub.emitter.mce_builder import make_dataset_urn from datahub.emitter.rest_emitter import DatahubRestEmitter from datahub.metadata.schema_classes import StatusClass, MetadataChangeProposalWrapper def undelete_entity(entity_urn: str): emitter = DatahubRestEmitter(gms_server='http://localhost:8080') mcp_wrapper = MetadataChangeProposalWrapper( entityUrn=entity_urn, aspect=StatusClass(removed=False), changeType="UPSERT", ) emitter.emit(mcp_wrapper) print(f"Entity {entity_urn} has been unmarked for deletion.") # Example usage undelete_entity("urn:li:dataset:(urn:li:dataPlatform:bigquery,example_dataset,PROD)")
-
Run DataHub CLI with Debug Logs:
- Running the DataHub CLI with debug logs enabled can provide more insights into what might be going wrong. Use the
--debug
flag to enable detailed logging.
export DATAHUB_DEBUG=true datahub ingest -c your_ingestion_recipe.yml --debug
- Running the DataHub CLI with debug logs enabled can provide more insights into what might be going wrong. Use the
-
Manual Metadata Emission:
- If the issue persists, you can manually emit metadata to DataHub using the DataHub Python library.
from datahub.emitter.mce_builder import make_dataset_urn
from datahub.emitter.rest_emitter import DatahubRestEmitter
from datahub.metadata.schema_classes import DatasetPropertiesClass
emitter = DatahubRestEmitter(gms_server='<http://localhost:8080>')
emitter.test_connection()
dataset_urn = make_dataset_urn(platform='bigquery', name='example_dataset', env='PROD')
dataset_properties = DatasetPropertiesClass(description='Sample dataset')
metadata_event = MetadataChangeProposalWrapper(
entityType='dataset',
changeType='UPSERT',
entityUrn=dataset_urn,
aspect=dataset_properties,
)
emitter.emit(metadata_event)
```
Would you like more detailed steps or further assistance on any specific part of this process?
Sources:
- <https://datahubspace.slack.com/archives/CUMUWQU66/p1724513103.429069|Handling Multiple dbt Projects> :+1: 0 button :-1: 0 button :bulb: *Hint:* Mention @runllm in the thread for followups.
Hi ! Did this interaction resolve your issue or do you still require assistance ?
Please tag @ RunLLM if you still need further assistance !
Or tag @Datahub Community Support if you require a person to take a look
Hope this helps ! ><