Preventing Duplicate Metadata Ingestion from dbt Cloud "CI Job"

Original Slack Thread

<@U06TM7M40RK> When ingesting from a dbt Cloud job of the type “CI Job” (dbt term), each new PR triggering the job creates duplicate metadata in DataHub from the overlap with the previous CI Jobs. Can that be prevented? What’s the best practice there?

It seems likely that the issue of duplicate metadata when ingesting from a dbt Cloud “CI Job” is related to how the metadata is being managed and ingested into DataHub. Here are some potential approaches and best practices that might help address this issue:

  1. Stateful Ingestion: According to the <https://datahubproject.io/docs/0.14.0/generated/ingestion/sources/dbt|DataHub documentation>, enabling stateful ingestion can help manage changes over time and prevent duplication. This feature allows DataHub to track the state of metadata and only ingest changes, which can be particularly useful for CI jobs that run frequently.

  2. Use of entities_enabled Config: You can configure your ingestion recipe to selectively enable or disable certain entities. For example, you might want to only ingest test results or specific metadata aspects to avoid duplication. The <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/docs/sources/dbt/dbt.md|DataHub documentation> provides examples of how to configure this.

  3. Metadata Endpoint Configuration: Ensure that the metadata endpoint is correctly configured to avoid unnecessary duplication. The https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/dbt/dbt_cloud.py|DBTCloudConfig allows you to specify the metadata endpoint, which should be set correctly to ensure accurate metadata ingestion.

  4. Handling of Tags and Ownership: There might be issues with tags not being updated correctly, as noted in <https://github.com/datahub-project/datahub/issues/11482|this GitHub issue>. Ensuring that tags and ownership are managed correctly can help prevent duplication.

  5. Custom Ingestion Logic: If the standard ingestion process does not meet your needs, you might consider implementing custom logic to handle specific cases, such as CI jobs. This could involve using the DataHub API to manage metadata more granularly.

Would you like more detailed steps or further assistance on any specific part of this process?

Sources:

Hi <@U07UX44AHDH>, how did my answer work out for you? Need any more assistance? You can also tag <@U07FV3UHE4D> to let your fellow community members know you’re still looking for guidance.

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><