Troubleshooting DataHub Deployment with Multiple Data Entity Copies and Lineage Issues

Original Slack Thread

Hi - I am working to troubleshoot why my DataHub deployment (version 0.12.0) has:

  1. Multiple copies of the same data entities
  2. Working & Non-working lineage
    Background
  3. My DataHub version ingests from both BQ and DBT (recent addition). Since adding DBT, I have had to add convert_urns_to_lowercase to my BQ recipe to ensure table urns of the same data entity align.
  4. Some ARDs(Views) are complex and are generated from multiple SQL scripts which are executed via a dataflow job. The data from these Views appears to be correct in BigQuery and DataHub, only the lineage fails (i.e. unavailable or ends unexpectedly).
    Problem(s)
  5. Multiple entities in DataHub: old BQ uppercase urn, new BQ&DBT lowercase urn, and/or new BQ lowercase urns.
  6. Lineage works for simple ARDs, but fails as the work to create to an ARD gets increasingly more complex.
    Question
  7. How can I align DBT and BQ tables/views to use uppercase urns only? I would rather use the uppercase option as this is more reflective of the underlying data.
  8. How can I ensure lineage works for all my tables/views? Does the BQ plugin struggle to read lineage at a certain level of complexity?
    I am new to posting in the DataHub slack channel, so I hope this makes sense. I found both issues as part of the same investigation, so I am unsure if they are separate or related. I would appreciate any guidance that could help me troubleshoot these issues!

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Which DataHub version are you using? (e.g. 0.12.0)
  2. Please post any relevant error logs on the thread!

Hi Rebecca, good questions here.

  1. My suspicion is that when you first ingested DBT and BQ your naming conventions didn’t match and this created a set of URNs. If it were me I would probably just do a delete of those platforms with the CLI and re-ingest. Here’s a guide for that https://datahubproject.io/docs/how/delete-metadata#selecting-entities-to-delete

Make sure when you use the CLI to delete to use a --dry-run flag first to see what will happen

  1. Let me know if you’re seeing issues with view lineage from there. If the URNs are aligned, you should see sibling lineage between DBT and BQ

Thank you very much Jeffrey :slightly_smiling_face: Would it be reasonable to say that these are related enough, that I need to fix step 1 to align DBT and BQ before trying to fix the lineage in step 2?