Resolving Snowflake Max LOB Size Error and Lineage Extraction Optimization

Original Slack Thread

Hi, I am facing the <class 'snowflake.connector.errors.ProgrammingError'>: 100082 (22000): Max LOB size (16777216) exceeded, actual size of parsed column is 18431556
while i have tried the suggested configuration flags to fix this i.e., as part of the config

use_legacy_lineage_method: true   # Optional: Use legacy method for ```
the error doesn't seem to resolve, need help with this

Hey there! :wave:

<@U06TM7M40RK> Can you help with the questions in this thread?

:information_source: Please continue to tag RunLLM (e.g <@U06TM7M40RK>) to have it continue generating responses!

The error you’re encountering, Max LOB size exceeded, typically occurs when the size of the data being processed exceeds the maximum allowed size for a Large Object (LOB) in Snowflake. This is a common issue when dealing with large datasets or complex queries during metadata ingestion.

Based on the information from the <Slack Slack conversation>, here are some steps and considerations to address this issue:

  1. Configuration Adjustments:

    • You have already set ignore_start_time_lineage: false and use_legacy_lineage_method: true. These settings are intended to limit the amount of data processed and use a more efficient method for lineage extraction.
  2. Data Volume:

    • Consider reducing the amount of data being processed in a single run. This can be done by limiting the scope of the ingestion to specific tables or schemas, or by breaking down the ingestion process into smaller chunks.
  3. Check for Temporary Tables:

    • If you are using any ETL tools that create temporary tables with varying names, this could lead to large result sets. Ensure that such tables are not included in the lineage extraction process.
  4. Review Query Complexity:

    • Simplify the queries used for lineage extraction if possible. Complex queries with large result sets can contribute to exceeding the LOB size limit.
  5. Monitor Logs:

    • Keep an eye on the logs for any additional errors or warnings that might provide more context on what is causing the large data size.

If these steps do not resolve the issue, it might be helpful to provide more specific details about your setup or any additional error messages you are seeing. Would you like more detailed steps or further assistance on any specific part of the process?

Sources:

  • <Slack Slack conversation> :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

<@U06TM7M40RK> step 1 ignore_start_time_lineage: false & use_legacy_lineage_method: true are set & step 2 from the suggestion data volume is limited by reducing the schema & db, How to filter for temporary tables & simplify the queries used for lineage extraction

To filter out temporary tables and simplify queries used for lineage extraction in DataHub, you can follow these steps:

  1. Filter Temporary Tables:

    Example configuration:

    source:
      type: snowflake
      config:
        account_id: <your_account_id>
        username: <your_username>
        password: <your_password>
        warehouse: <your_warehouse>
        role: <your_role>
        include_table_lineage: true
        include_view_lineage: true
        ignore_start_time_lineage: false
        use_legacy_lineage_method: true
        temporary_tables_pattern:
          - '.*_temp.*'  # Example pattern to exclude tables with '_temp' in their name
    
  2. Simplify Queries:

    • Review the queries used for lineage extraction and identify any that are unnecessarily complex or that return large result sets.
    • Consider breaking down complex queries into smaller, more manageable parts, or using more targeted queries that focus on specific tables or schemas.
    • If possible, avoid using wildcard selections (e.g., SELECT *) and instead specify only the necessary columns.
  3. Use Efficient Lineage Methods:

    • Ensure that you are using the most efficient lineage extraction methods available. The use_legacy_lineage_method flag is one such option that can help optimize the process.

By implementing these strategies, you should be able to reduce the data volume and complexity of queries, which can help mitigate the “Max LOB size exceeded” error.

Would you like more detailed guidance on any specific part of this process?

Sources:

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><

the issue is resolved