Handling Multiple dbt Projects and Lineage Graph Override

Original Slack Thread

<@U06TM7M40RK> We have multiple dbt projects. They depend on each other as packages: project a has a dependency on project b, another project c depends on b too. Models from other projects are used as refs rather than sources. When i do a separate ingestion for each project, model “foo” from project b is ingested twice - for platform instance “dbt a” and “dbt c”. when i run “c” second, the “foo” model loaded previously in ingestion for “a” get’s overriden in lineage graph. is this intended behaviour? if it is, i’d probably go back to ingesting my “global” project (a project that depends on all of my projects).

The behavior you’re experiencing, where the model “foo” from project B is ingested twice and gets overridden in the lineage graph, is related to how DataHub handles multiple dbt projects and their dependencies. According to the <https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/dbt#multiple-dbt-projects|DataHub documentation on handling multiple dbt projects>, each dbt project should have its own ingestion recipe, and the platform_instance field in the recipe should be set to the dbt project name. This helps in distinguishing datasets from different dbt projects.

However, when models from one project are used as references in another project, and you perform separate ingestions for each project, it can lead to the same model being ingested multiple times under different platform_instance values. This can cause the lineage graph to show duplicates or override previously ingested models, as you’ve observed.

To address this issue, you can consider the following approaches:

  1. Use a Global Project for Ingestion: As you mentioned, ingesting a “global” project that depends on all your projects can help consolidate the lineage and avoid duplicates. This approach ensures that all models and their dependencies are ingested in a single run, maintaining a consistent lineage graph.

  2. Configure Unique Platform Instances: Ensure that each dbt project has a unique platform_instance specified in its ingestion recipe. This helps in distinguishing between models from different projects and can prevent overwriting in the lineage graph.

  3. Use SQL Parser Lineage: If the dbt metadata is causing issues, you can try using the SQL parser for lineage generation. This can be enabled with the prefer_sql_parser_lineage flag, which may help in resolving lineage conflicts.

  4. Review Ingestion Order: Ensure that the ingestion order is consistent and that the models are not being overwritten by subsequent ingestions. This can sometimes cause duplicates if the same models are ingested multiple times with different configurations.

Would you like more detailed steps or further assistance on any specific part of the configuration?

Sources:

<@U06TM7M40RK> I’m using separate recipes with separate platform instances.

If you are using separate recipes with distinct platform_instance values for each dbt project, and still encountering issues with models being overridden in the lineage graph, it might be due to how the ingestion process handles overlapping models across different projects.

Here are some steps and considerations to help address this issue:

  1. Ensure Unique Platform Instances: Double-check that each dbt project has a unique platform_instance value in its ingestion recipe. This is crucial for distinguishing datasets from different projects.

  2. Review Lineage Configuration: Make sure that your ingestion recipes are configured to handle lineage correctly. You can use the prefer_sql_parser_lineage flag to enable SQL parser lineage, which might help in resolving conflicts when models are referenced across projects.

  3. Check for Overlapping Models: If models are being referenced across projects, ensure that the ingestion process is not duplicating these models under different platform instances. This can happen if the same model is ingested multiple times with different configurations.

  4. Use Stateful Ingestion: Consider enabling stateful ingestion to manage metadata changes more effectively. This can help in tracking changes across different ingestion runs and prevent overwriting of models.

  5. Consolidate Ingestion: If the above steps do not resolve the issue, you might consider consolidating your ingestion process by using a “global” project that includes all dependencies. This approach can help maintain a consistent lineage graph by ingesting all models and their dependencies in a single run.

If you would like more detailed steps on any of these approaches or further assistance, please let me know.

Sources: