Issue with Lineage Ingestion in DataHub when Multiple Projects are Involved

Original Slack Thread

Hello all,

I’m using datahub 0.13.3.4 and using CLI to ingest, data source is dbt. I have 2 dbt projects called silver and gold, where the gold project uses silver’s tables as source.
If I ingest the silver first, I can see the lineage of tables on datahub fine. However after that, If I ingest the gold project, the upstream lineage of silver’s tables those are used as source for gold is gone. If I ingest silver again, it works as intended. Is it a bug or a feature?

Below is what happens when I ingest them by silver → gold order (upstream is gone) compare to what happens by gold → silver order (can see upstream):attachmentattachment

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?


:information_source: We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a :thumbsup: or :thumbsdown: so other folks in the Community know if you’re still looking for help on the topic!

To better assist you, could you please provide the following additional information?

  1. Are you using the UI or CLI for ingestion?
  2. Which DataHub version are you using? (You mentioned 0.13.3.4, please confirm)
  3. What data source(s) are you integrating with DataHub? (e.g., BigQuery, Snowflake, etc.)

Based on the information provided, it seems like there might be an issue with how the lineage is being ingested and updated. This could be related to how DataHub handles lineage updates when multiple projects are involved.

According to the <Slack Slack discussion>, there have been issues with lineage ingestion, especially when temporary tables or multiple projects are involved. Additionally, the <Issues · datahub-project/datahub · GitHub issue> mentions that lineage can sometimes be incorrectly displayed due to how dependencies are parsed.

Please confirm the details above, and let me know if you would like more detailed steps or further investigation into this issue.

Sources:

<@U06TM7M40RK> I would like more detailed steps or further investigation into this issue.

Sure, let’s dive deeper into the issue. Based on your description, it seems like the lineage information is being overwritten or not properly merged when ingesting the gold project after the silver project. Here are some detailed steps and considerations to help troubleshoot and potentially resolve this issue:

Steps to Investigate and Resolve Lineage Issues

  1. Check Ingestion Configuration:
    Ensure that your ingestion configurations for both silver and gold projects are correctly set up to include lineage information. Here is an example configuration for a dbt project:

    source:
      type: dbt
      config:
        manifest_path: /path/to/manifest.json
        catalog_path: /path/to/catalog.json
        sources_path: /path/to/sources.json
        include_lineage: true
    sink:
      type: datahub-rest
      config:
        server: http://localhost:8080
    
  2. Enable Stateful Ingestion:
    Stateful ingestion helps in maintaining the state of the metadata and can prevent issues where lineage information is overwritten. Ensure that stateful ingestion is enabled in your configuration:

    source:
      type: dbt
      config:
        manifest_path: /path/to/manifest.json
        catalog_path: /path/to/catalog.json
        sources_path: /path/to/sources.json
        include_lineage: true
        stateful_ingestion:
          enabled: true
    
  3. Ingest Both Projects Together:
    Instead of ingesting the silver and gold projects separately, try ingesting them together in a single pipeline. This can help ensure that the lineage information is correctly merged.

    source:
      type: dbt
      config:
        manifest_path: /path/to/silver/manifest.json
        catalog_path: /path/to/silver/catalog.json
        sources_path: /path/to/silver/sources.json
        include_lineage: true
    sink:
      type: datahub-rest
      config:
        server: http://localhost:8080
    ---
    source:
      type: dbt
      config:
        manifest_path: /path/to/gold/manifest.json
        catalog_path: /path/to/gold/catalog.json
        sources_path: /path/to/gold/sources.json
        include_lineage: true
    sink:
      type: datahub-rest
      config:
        server: http://localhost:8080
    
  4. Check Logs for Errors:
    Review the logs from the ingestion process to identify any errors or warnings that might indicate why lineage information is being lost. Logs can provide valuable insights into what might be going wrong.

  5. Use the DataHub CLI to Inspect Lineage:
    Use the DataHub CLI to inspect the lineage information for specific datasets. This can help you understand if the lineage is being ingested correctly but not displayed properly in the UI.

    datahub get --urn "urn:li:dataset:(urn:li:dataPlatform:dbt,silver.table_name,PROD)" --aspect upstreamLineage
    datahub get --urn "urn:li:dataset:(urn:li:dataPlatform:dbt,gold.table_name,PROD)" --aspect upstreamLineage
    
  1. Consider Version Compatibility:
    Ensure that the version of DataHub you are using supports the features you need. Some features might not be available or fully functional in older versions.

Example Configuration for Ingesting Both Projects Together

Here is an example of how you might configure the ingestion for both silver and gold projects together:

# Ingest Silver Project
source:
  type: dbt
  config:
    manifest_path: /path/to/silver/manifest.json
    catalog_path: /path/to/silver/catalog.json
    sources_path: /path/to/silver/sources.json
    include_lineage: true
    stateful_ingestion:
      enabled: true
sink:
  type: datahub-rest
  config:
    server: <http://localhost:8080>

# Ingest Gold Project
source:
  type: dbt
  config:
    manifest_path: /path/to/gold/manifest.json
    catalog_path: /path/to/gold/catalog.json
    sources_path: /path/to/gold/sources.json
    include_lineage: true
    stateful_ingestion:
      enabled: true
sink:
  type: datahub-rest
  config:
    server: <http://localhost:8080>

Additional Resources

Would you like more detailed instructions or examples for any specific part of this process?

Sources: