Preventing Duplicate Metadata Ingestion from dbt Cloud "CI Job"

user-3 · April 7, 2025, 12:03am

<@U06TM7M40RK> When ingesting from a dbt Cloud job of the type “CI Job” (dbt term), each new PR triggering the job creates duplicate metadata in DataHub from the overlap with the previous CI Jobs. Can that be prevented? What’s the best practice there?

user-1 · April 7, 2025, 12:03am

It seems likely that the issue of duplicate metadata when ingesting from a dbt Cloud “CI Job” is related to how the metadata is being managed and ingested into DataHub. Here are some potential approaches and best practices that might help address this issue:

Stateful Ingestion: According to the <https://datahubproject.io/docs/0.14.0/generated/ingestion/sources/dbt|DataHub documentation>, enabling stateful ingestion can help manage changes over time and prevent duplication. This feature allows DataHub to track the state of metadata and only ingest changes, which can be particularly useful for CI jobs that run frequently.
Use of entities_enabled Config: You can configure your ingestion recipe to selectively enable or disable certain entities. For example, you might want to only ingest test results or specific metadata aspects to avoid duplication. The <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/docs/sources/dbt/dbt.md|DataHub documentation> provides examples of how to configure this.
Metadata Endpoint Configuration: Ensure that the metadata endpoint is correctly configured to avoid unnecessary duplication. The https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_cloud.py|DBTCloudConfig allows you to specify the metadata endpoint, which should be set correctly to ensure accurate metadata ingestion.
Handling of Tags and Ownership: There might be issues with tags not being updated correctly, as noted in <https://github.com/datahub-project/datahub/issues/11482|this GitHub issue>. Ensuring that tags and ownership are managed correctly can help prevent duplication.
Custom Ingestion Logic: If the standard ingestion process does not meet your needs, you might consider implementing custom logic to handle specific cases, such as CI jobs. This could involve using the DataHub API to manage metadata more granularly.

Would you like more detailed steps or further assistance on any specific part of this process?

Sources:

<https://datahubproject.io/docs/0.14.0/generated/ingestion/sources/dbt|DataHub documentation>
<https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/docs/sources/dbt/dbt.md|DataHub documentation>
https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_cloud.py|DBTCloudConfig
<https://github.com/datahub-project/datahub/issues/11482|this GitHub issue> 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-1 · April 7, 2025, 12:03am

Hi <@U07UX44AHDH>, how did my answer work out for you? Need any more assistance? You can also tag <@U07FV3UHE4D> to let your fellow community members know you’re still looking for guidance.

datahub_team · April 7, 2025, 12:03am

Hi ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look

Hope this helps ! ><

Topic		Replies	Views
Resolving Duplicate Model Issues in DataHub When Using dbt and BigQuery Ingestion Recipes ingestion	6	24	October 14, 2024
Creating and Managing a dbt Ingestion Source with DataHub Python Client ingestion	12	51	October 28, 2024
Troubleshooting Steps for Ingesting Local DBT Metadata ingestion	6	36	December 9, 2024
Troubleshooting Stuck DBT Ingestion Job from AWS S3 to DataHub ingestion	11	4	January 13, 2025
Using the dbt ingestion source to load metadata into a custom data platform in DataHub ingestion	4	33	December 2, 2024

Preventing Duplicate Metadata Ingestion from dbt Cloud "CI Job"

Related topics