Benefits and Downsides of Setting `incremental_lineage: true` in Snowflake Data Ingestion

Original Slack Thread

Hi, I saw that snowflake ingestion has an option for incremental_lineage which defaults to false. Could anyone explain what are the scenarios we’d want to flip that to true ?

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?


:information_source: We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a :thumbsup: or :thumbsdown: so other folks in the Community know if you’re still looking for help on the topic!

To summarize, the incremental_lineage option in Snowflake ingestion determines whether lineage information is emitted incrementally or restated on each run. Setting this to true can be beneficial in scenarios where you want to avoid overwriting existing lineage data with each ingestion run, thereby preserving the lineage history.

Detailed Explanation

  1. Default Behavior (incremental_lineage: false):

    • When incremental_lineage is set to false, each ingestion run will restate the lineage information. This means that the lineage data is completely replaced with the new data from the current ingestion run.
    • This can be useful if you want to ensure that the lineage information is always up-to-date and reflects the current state of your data ecosystem.
  2. Incremental Lineage (incremental_lineage: true):

    • When incremental_lineage is set to true, the ingestion process will emit lineage data incrementally. This means that new lineage information is added to the existing data without replacing it.
    • This is particularly useful in scenarios where you want to preserve the history of lineage changes over time. For example, if you have manual edits or additional lineage information that you do not want to be overwritten by subsequent ingestion runs.
    • It can also help in cases where you have complex data pipelines with temporary tables or intermediate steps that should not be lost during re-ingestion.

Example Configuration

Here is an example of how you might configure the Snowflake ingestion with incremental_lineage set to true:

pipeline_name: snowflake-lineage-ingestion
source:
  type: snowflake
  config:
    account_id: ${SNOWFLAKE_ACCOUNT_ID}
    warehouse: ${SNOWFLAKE_WAREHOUSE}
    username: ${SNOWFLAKE_USERNAME}
    password: ${SNOWFLAKE_PASSWORD}
    role: ${SNOWFLAKE_ROLE}
    include_view_lineage: true
    include_table_lineage: true
    include_usage_stats: true
    include_column_lineage: true
    incremental_lineage: true
    stateful_ingestion:
      enabled: true
sink:
  type: "datahub-rest"
  config:
    server: ${DATAHUB_GMS_ENDPOINT}

When to Use Incremental Lineage

  • Preserving Manual Edits: If you manually enrich lineage information and want to ensure these edits are not overwritten by subsequent ingestion runs.
  • Complex Pipelines: If your data pipelines involve temporary tables or intermediate steps that should be preserved in the lineage history.
  • Historical Analysis: If you need to maintain a history of lineage changes for auditing or analysis purposes.

Sources

Feel free to ask any follow-up questions or provide additional details about your specific use case! :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

Ok, so it seems like there are a several benefits to setting incremental_lineage: true … are there any downsides?

<@U06TM7M40RK> are there any downsides to enabling incremental lineage for Snowflake ingestion?

Yes, while enabling incremental_lineage has several benefits, there are also some potential downsides to consider:

Potential Downsides of Enabling incremental_lineage

  1. Increased Complexity:

    • Managing incremental lineage can add complexity to your data governance processes. You need to ensure that the incremental updates are accurate and do not introduce inconsistencies.
  2. Storage Overhead:

    • Incremental lineage can lead to an increase in the amount of metadata stored, as it retains historical lineage information. This can result in higher storage costs and potentially slower performance when querying lineage data.
  3. Potential for Stale Data:

    • If not managed properly, incremental lineage can lead to stale or outdated lineage information being retained. This can happen if the lineage data is not regularly cleaned or if there are errors in the ingestion process.
  4. Manual Intervention:

    • In some cases, you may need to manually intervene to correct or update lineage information. This can be time-consuming and may require additional resources.
  5. Complexity in Debugging:

    • Debugging issues related to lineage can become more complex when dealing with incremental updates. It may be harder to trace the source of an issue when multiple incremental updates have been applied over time.

Example Configuration with Incremental Lineage

Here is an example configuration for enabling incremental lineage in Snowflake ingestion:

pipeline_name: snowflake-lineage-ingestion
source:
  type: snowflake
  config:
    account_id: ${SNOWFLAKE_ACCOUNT_ID}
    warehouse: ${SNOWFLAKE_WAREHOUSE}
    username: ${SNOWFLAKE_USERNAME}
    password: ${SNOWFLAKE_PASSWORD}
    role: ${SNOWFLAKE_ROLE}
    include_view_lineage: true
    include_table_lineage: true
    include_usage_stats: true
    include_column_lineage: true
    incremental_lineage: true
    stateful_ingestion:
      enabled: true
sink:
  type: "datahub-rest"
  config:
    server: ${DATAHUB_GMS_ENDPOINT}

When to Avoid Incremental Lineage

  • Simple Data Pipelines: If your data pipelines are relatively simple and do not require the preservation of historical lineage information, you may not need incremental lineage.
  • Resource Constraints: If you have limited storage or computational resources, the additional overhead of managing incremental lineage may not be justified.
  • Frequent Full Refreshes: If you frequently perform full refreshes of your data and lineage information, the benefits of incremental lineage may be diminished.

Sources

Feel free to ask any more questions or provide additional context about your specific use case! :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

<@U04583XMT5Z> I’ll be curious for your perspective of if this is helpful follow-up from your latest question! ^^

Thanks in advance in going on this AI journey with us :bestfriendsclub:

<@U0121TRV0FL> the answers seem reasonable, but I dont know enough about incremental_lineage to verify they are accurate. I can’t find any documentation about the feature… briefly looking through the code it looks like it patches existing lineage for a Dataset with newly discovered lineage from Snowflake. I can see the connection then to the chatbot’s answers where enabling this features means we’ll start storing historical lineage… I guess there is no mechanism to delete lineage that no longer exists. If that’s the case then we probably dont want to enable it…what we would really want is to be able to restate just automatically collected SF lineage while maintaining any user-added lineage