Managing Lineage and Metadata Overwrites in DataHub

Original Slack Thread

<@U06TM7M40RK> if I am using the ingestion for a source using DataHub connector, and also a different method (Ex: using an LLM service) to write the lineage to DataHub, is there any risk of ingestion will overwrite the lineage and additional metadata wrote using the LLMs? (for context; I will run the ingestion using the connector daily but the LLM service once a week)

When using multiple methods to write lineage to DataHub, such as a DataHub connector for ingestion and an LLM service, there is a risk of overwriting lineage and additional metadata. According to the <https://datahubproject.io/docs/0.14.0/generated/lineage/lineage-feature-guide|DataHub Lineage Feature Guide>, lineage added manually or programmatically may conflict with lineage extracted automatically, leading to unwanted overwrites. This is because the ingestion source’s picture of lineage is presumed to be more accurate and can overwrite manually added lineage.

To mitigate this risk, it is recommended to carefully manage the timing and scope of each method’s updates. For example, you could ensure that the LLM service updates do not overlap with the connector ingestion updates, or you could use incremental lineage updates to minimize conflicts.

Would you like more details on how to manage these updates or any specific configurations that might help in your scenario?

Sources:

<@U06TM7M40RK> is there a configuration in the ingestion recipe to avoid overwriting some values from ingestion; if they are existing in DataHub?

Yes, there is a way to configure your ingestion recipe to avoid overwriting certain metadata values in DataHub if they already exist. This can be achieved using the PATCH semantics in transformers. The PATCH semantics allow you to add or update metadata without overwriting existing values.

For example, if you want to add tags to datasets without overwriting existing tags, you can use the simple_add_dataset_tags transformer with the semantics: PATCH configuration. Here’s how you can configure your ingestion recipe:

transformers:
  - type: "simple_add_dataset_tags"
    config:
      tag_urns:
        - "urn:li:tag:NeedsDocumentation"
        - "urn:li:tag:Legacy"
      semantics: PATCH  # This ensures existing tags are preserved

This configuration ensures that the specified tags are added to the datasets, but any existing tags are not overwritten.

Would you like more detailed instructions or have any specific questions about this process?

Sources:

<@U06TM7M40RK> is it possible to not to overwrite the lineage?

Yes, it is possible to configure your ingestion process to avoid overwriting existing lineage in DataHub. This can be achieved by using the incremental_lineage configuration option, which allows lineage data to be emitted incrementally rather than being restated on each run.

When incremental_lineage is set to true, the ingestion process will add new lineage information to the existing data without replacing it. This is particularly useful if you want to preserve the history of lineage changes over time and avoid overwriting manually added lineage.

Here is an example configuration for enabling incremental lineage in a Snowflake ingestion:

pipeline_name: snowflake-lineage-ingestion
source:
  type: snowflake
  config:
    account_id: ${SNOWFLAKE_ACCOUNT_ID}
    warehouse: ${SNOWFLAKE_WAREHOUSE}
    username: ${SNOWFLAKE_USERNAME}
    password: ${SNOWFLAKE_PASSWORD}
    role: ${SNOWFLAKE_ROLE}
    include_view_lineage: true
    include_table_lineage: true
    include_usage_stats: true
    include_column_lineage: true
    incremental_lineage: true  # Enable incremental lineage
    stateful_ingestion:
      enabled: true
sink:
  type: "datahub-rest"
  config:
    server: ${DATAHUB_GMS_ENDPOINT}

This configuration ensures that new lineage information is added without overwriting existing lineage data.

Would you like more details on this configuration or have any specific questions about implementing it?

Sources:

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><