Troubleshooting Snowflake Lineage Extraction Error: "Max LOB size exceeded"

Original Slack Thread

Snowflake ingestion lineage extraction fails due to the following error:
[2024-05-29 06:12:51,707] WARNING {datahub.ingestion.source.snowflake.snowflake_utils:250} - external_lineage => Populating table external lineage from Snowflake failed due to error 100082 (22000): Max LOB size (16777216) exceeded, actual size of parsed column is 188464892.
Seems like the query projection exceeds the allowed field size.
This was my initial run for a new snowflake an I therefore enabled ignore_start_time_lineage
Will try again without that argument.
Running on Datahub v0.13.2

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?


:information_source: We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a :thumbsup: or :thumbsdown: so other folks in the Community know if you’re still looking for help on the topic!

To address the issue of Snowflake ingestion lineage extraction failing due to the “Max LOB size exceeded” error, here is a summary and step-by-step reasoning:

Summary
The error occurs because the query projection exceeds the allowed field size in Snowflake. Disabling the ignore_start_time_lineage argument should help mitigate this issue by limiting the amount of data processed during lineage extraction.

Step-by-Step Reasoning

  1. Error Analysis:

    • The error message indicates that the size of the parsed column exceeds the maximum allowed LOB size in Snowflake.
    • This typically happens when a large amount of data is being processed in a single query.
  2. Configuration Adjustment:

    • The ignore_start_time_lineage argument, when enabled, processes all historical data, which can lead to large result sets.
    • Disabling this argument will limit the lineage extraction to a smaller, more manageable dataset.
  3. Recommended Configuration:

    • Set ignore_start_time_lineage to False to limit the lineage extraction to recent data.
    • Optionally, you can also set use_legacy_lineage_method to True to use a more efficient method for lineage extraction.

Example Configuration
Here is an example of how you can adjust your configuration:

source:
  type: snowflake
  config:
    account_id: <your_account_id>
    username: <your_username>
    password: <your_password>
    warehouse: <your_warehouse>
    role: <your_role>
    include_table_lineage: true
    include_view_lineage: true
    ignore_start_time_lineage: false  # Disable to limit data processed
    use_legacy_lineage_method: true   # Optional: Use legacy method for efficiency
    profiling:
      enabled: true
    stateful_ingestion:
      enabled: true
sink:
  type: datahub-rest
  config:
    server: 'http://datahub-datahub-gms:8080'
    token: <your_token>

Additional Steps

  • Run the Ingestion: Execute the ingestion process with the updated configuration.
  • Monitor Logs: Check the logs to ensure that the ingestion process completes without errors.

Sources:

  • <Slack Slack Message 1>
  • <Slack Slack Message 2>
  • <Slack Slack Message 3> :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

Thanks bot, that’s what I suggested…and it is just a workaround, not really a solution.
I mean, that’s okay for now, but if someone from the team could pick this up and maybe find a more efficient method to extract the entire lineage in an upcoming version, that’d be even better. Thanks.