Understanding Lineage Model in DataHub

Original Slack Thread

Hi All,
I have a couple of questions about lineage model. I appreciate it if you can help me with them
• If we set lineage in inlet or outlet of airlfow tasks and those tables are not yet ingested to the datahub, what happens? Does datahub create a kind of placeholder for them?
• If we add a lineage to a dataset and the parent dataplatform of that dataset gets updated on a regular schedule. Does the added lineage gets overwritten?

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?


:information_source: We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a :thumbsup: or :thumbsdown: so other folks in the Community know if you’re still looking for help on the topic!

To address your questions about the lineage model in DataHub:

1. What happens if we set lineage in the inlet or outlet of Airflow tasks and those tables are not yet ingested into DataHub?
If you set lineage in the inlets or outlets of Airflow tasks and those tables are not yet ingested into DataHub, DataHub will create placeholder entities for them. This allows you to visualize and manage lineage even if the actual dataset entities are not yet present in DataHub. According to <Slack Slack discussion>, using inlets and outlets in Airflow allows you to manually annotate lineage, and DataHub will manage these annotations appropriately.

2. If we add a lineage to a dataset and the parent data platform of that dataset gets updated on a regular schedule, does the added lineage get overwritten?
Yes, the added lineage can get overwritten if the parent data platform updates the lineage information. As per <Slack Slack conversation>, if the ingestion source emits lineage, it will overwrite both the manually added lineage and the lineage emitted by the source. This is because the ingestion source’s picture of lineage is presumed to be more accurate.

Would you like more details on either of these points?

Sources:

  • <Slack conversation on lineage overwrites>
  • <Slack Slack discussion> :+1: 1 button :-1: 0 button :bulb: Hint: Mention <@U06TM7M40RK> in the thread for followups.