Managing Lineage and Metadata Overwrites in DataHub

user-1 · November 4, 2024, 12:05am

<@U06TM7M40RK> if I am using the ingestion for a source using DataHub connector, and also a different method (Ex: using an LLM service) to write the lineage to DataHub, is there any risk of ingestion will overwrite the lineage and additional metadata wrote using the LLMs? (for context; I will run the ingestion using the connector daily but the LLM service once a week)

user-2 · November 4, 2024, 12:05am

When using multiple methods to write lineage to DataHub, such as a DataHub connector for ingestion and an LLM service, there is a risk of overwriting lineage and additional metadata. According to the <https://datahubproject.io/docs/0.14.0/generated/lineage/lineage-feature-guide|DataHub Lineage Feature Guide>, lineage added manually or programmatically may conflict with lineage extracted automatically, leading to unwanted overwrites. This is because the ingestion source’s picture of lineage is presumed to be more accurate and can overwrite manually added lineage.

To mitigate this risk, it is recommended to carefully manage the timing and scope of each method’s updates. For example, you could ensure that the LLM service updates do not overlap with the connector ingestion updates, or you could use incremental lineage updates to minimize conflicts.

Would you like more details on how to manage these updates or any specific configurations that might help in your scenario?

Sources:

<https://datahubproject.io/docs/0.14.0/generated/lineage/lineage-feature-guide|DataHub Lineage Feature Guide> 1 button 0 button Hint: Mention <@U06TM7M40RK> in the thread for followups.

user-1 · November 4, 2024, 12:05am

<@U06TM7M40RK> is there a configuration in the ingestion recipe to avoid overwriting some values from ingestion; if they are existing in DataHub?

user-2 · November 4, 2024, 12:05am

Yes, there is a way to configure your ingestion recipe to avoid overwriting certain metadata values in DataHub if they already exist. This can be achieved using the PATCH semantics in transformers. The PATCH semantics allow you to add or update metadata without overwriting existing values.

For example, if you want to add tags to datasets without overwriting existing tags, you can use the simple_add_dataset_tags transformer with the semantics: PATCH configuration. Here’s how you can configure your ingestion recipe:

transformers:
  - type: "simple_add_dataset_tags"
    config:
      tag_urns:
        - "urn:li:tag:NeedsDocumentation"
        - "urn:li:tag:Legacy"
      semantics: PATCH  # This ensures existing tags are preserved

This configuration ensures that the specified tags are added to the datasets, but any existing tags are not overwritten.

Would you like more detailed instructions or have any specific questions about this process?

Sources:

https://datahubspace.slack.com/archives/CUMUWQU66/p1726216517.161159|https://datahubspace.slack.com/archives/CUMUWQU66/p1726216517.161159
https://datahubspace.slack.com/archives/CUMUWQU66/p1722867483.803079|https://datahubspace.slack.com/archives/CUMUWQU66/p1722867483.803079 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-1 · November 4, 2024, 12:05am

<@U06TM7M40RK> is it possible to not to overwrite the lineage?

user-2 · November 4, 2024, 12:05am

Yes, it is possible to configure your ingestion process to avoid overwriting existing lineage in DataHub. This can be achieved by using the incremental_lineage configuration option, which allows lineage data to be emitted incrementally rather than being restated on each run.

When incremental_lineage is set to true, the ingestion process will add new lineage information to the existing data without replacing it. This is particularly useful if you want to preserve the history of lineage changes over time and avoid overwriting manually added lineage.

Here is an example configuration for enabling incremental lineage in a Snowflake ingestion:

pipeline_name: snowflake-lineage-ingestion
source:
  type: snowflake
  config:
    account_id: ${SNOWFLAKE_ACCOUNT_ID}
    warehouse: ${SNOWFLAKE_WAREHOUSE}
    username: ${SNOWFLAKE_USERNAME}
    password: ${SNOWFLAKE_PASSWORD}
    role: ${SNOWFLAKE_ROLE}
    include_view_lineage: true
    include_table_lineage: true
    include_usage_stats: true
    include_column_lineage: true
    incremental_lineage: true  # Enable incremental lineage
    stateful_ingestion:
      enabled: true
sink:
  type: "datahub-rest"
  config:
    server: ${DATAHUB_GMS_ENDPOINT}

This configuration ensures that new lineage information is added without overwriting existing lineage data.

Would you like more details on this configuration or have any specific questions about implementing it?

Sources:

https://datahubspace.slack.com/archives/CUMUWQU66/p1716403406.590709|https://datahubspace.slack.com/archives/CUMUWQU66/p1716403406.590709
https://datahubproject.io/docs/0.14.0/generated/ingestion/sources/file-based-lineage|https://datahubproject.io/docs/0.14.0/generated/ingestion/sources/file-based-lineage 0 button 0 button Hint: Mention @runllm in the thread for followups.

datahub_team · November 4, 2024, 12:05am

Hi ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look

Hope this helps ! ><

Topic		Replies	Views
Understanding Lineage Model in DataHub ingestion	2	31	July 8, 2024
Issue with Lineage Ingestion in DataHub when Multiple Projects are Involved ingestion	5	42	July 29, 2024
How to Ingest Lineage Information Manually Stored in Spark into DataHub ingestion	2	27	December 16, 2024
Troubleshooting Lineage Information Fetching Issue in DataHub UI ingestion	5	16	September 2, 2024
Troubleshooting Data Lineage In DataHub Spark Listener Configuration ingestion	9	20	October 28, 2024

Managing Lineage and Metadata Overwrites in DataHub

Related topics