Snowflake ingestion lineage extraction fails due to the following error:
[2024-05-29 06:12:51,707] WARNING {datahub.ingestion.source.snowflake.snowflake_utils:250} - external_lineage => Populating table external lineage from Snowflake failed due to error 100082 (22000): Max LOB size (16777216) exceeded, actual size of parsed column is 188464892.
Seems like the query projection exceeds the allowed field size.
This was my initial run for a new snowflake an I therefore enabled ignore_start_time_lineage
Will try again without that argument.
Running on Datahub v0.13.2
Hey there! Make sure your message includes the following information if relevant, so we can help more effectively!
- Are you using UI or CLI for ingestion?
- Which DataHub version are you using? (e.g. 0.12.0)
- What data source(s) are you integrating with DataHub? (e.g. BigQuery)
<@U06TM7M40RK> Can you help with the questions in this thread?
We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a
or
so other folks in the Community know if you’re still looking for help on the topic!
To address the issue of Snowflake ingestion lineage extraction failing due to the “Max LOB size exceeded” error, here is a summary and step-by-step reasoning:
Summary
The error occurs because the query projection exceeds the allowed field size in Snowflake. Disabling the ignore_start_time_lineage
argument should help mitigate this issue by limiting the amount of data processed during lineage extraction.
Step-by-Step Reasoning
-
Error Analysis:
- The error message indicates that the size of the parsed column exceeds the maximum allowed LOB size in Snowflake.
- This typically happens when a large amount of data is being processed in a single query.
-
Configuration Adjustment:
- The
ignore_start_time_lineage
argument, when enabled, processes all historical data, which can lead to large result sets. - Disabling this argument will limit the lineage extraction to a smaller, more manageable dataset.
- The
-
Recommended Configuration:
- Set
ignore_start_time_lineage
toFalse
to limit the lineage extraction to recent data. - Optionally, you can also set
use_legacy_lineage_method
toTrue
to use a more efficient method for lineage extraction.
- Set
Example Configuration
Here is an example of how you can adjust your configuration:
source:
type: snowflake
config:
account_id: <your_account_id>
username: <your_username>
password: <your_password>
warehouse: <your_warehouse>
role: <your_role>
include_table_lineage: true
include_view_lineage: true
ignore_start_time_lineage: false # Disable to limit data processed
use_legacy_lineage_method: true # Optional: Use legacy method for efficiency
profiling:
enabled: true
stateful_ingestion:
enabled: true
sink:
type: datahub-rest
config:
server: 'http://datahub-datahub-gms:8080'
token: <your_token>
Additional Steps
- Run the Ingestion: Execute the ingestion process with the updated configuration.
- Monitor Logs: Check the logs to ensure that the ingestion process completes without errors.
Sources:
Thanks bot, that’s what I suggested…and it is just a workaround, not really a solution.
I mean, that’s okay for now, but if someone from the team could pick this up and maybe find a more efficient method to extract the entire lineage in an upcoming version, that’d be even better. Thanks.