DataHub Snowflake Ingestion Issue with CLI Version 0.13.0 and Later

Original Slack Thread

Hi <@U06TM7M40RK>. We’re running ingestion to datahub from Snowflake. If we’re running ingestion with CLI version: 0.12.1.5 everything runs fine. But since version 0.13.0 it is failing.

[2024-09-05, 12:32:45 CEST] {subprocess.py:93} INFO - 
[2024-09-05, 12:32:45 CEST] {subprocess.py:93} INFO - Cli report:
[2024-09-05, 12:32:45 CEST] {subprocess.py:93} INFO - {'cli_version': '0.14.0.4',
[2024-09-05, 12:32:45 CEST] {subprocess.py:93} INFO -  'cli_entry_location': '/data01/airflow/.pyenv/versions/py3_datahub/lib/python3.12/site-packages/datahub/__init__.py',
[2024-09-05, 12:32:45 CEST] {subprocess.py:93} INFO -  'models_version': 'bundled',
[2024-09-05, 12:32:45 CEST] {subprocess.py:93} INFO -  'py_version': '3.12.5 (main, Aug 20 2024, 10:24:55) [GCC 8.5.0 20210514 (Red Hat 8.5.0-22)]',
[2024-09-05, 12:32:45 CEST] {subprocess.py:93} INFO -  'py_exec_path': '/data01/airflow/.pyenv/versions/py3_datahub/bin/python',
[2024-09-05, 12:32:45 CEST] {subprocess.py:93} INFO -  'os_details': 'Linux-4.18.0-553.8.1.el8_10.x86_64-x86_64-with-glibc2.28',
[2024-09-05, 12:32:45 CEST] {subprocess.py:93} INFO -  'mem_info': '1.27 GB',
[2024-09-05, 12:32:45 CEST] {subprocess.py:93} INFO -  'peak_memory_usage': '1.27 GB',
[2024-09-05, 12:32:45 CEST] {subprocess.py:93} INFO -  'disk_info': {'total': '12.87 GB', 'used': '4.22 GB', 'used_initally': '4.22 GB', 'free': '8.65 GB'},
[2024-09-05, 12:32:45 CEST] {subprocess.py:93} INFO -  'peak_disk_usage': '4.22 GB',
[2024-09-05, 12:32:45 CEST] {subprocess.py:93} INFO -  'thread_count': 3,
[2024-09-05, 12:32:45 CEST] {subprocess.py:93} INFO -  'peak_thread_count': 18}
[2024-09-05, 12:32:45 CEST] {subprocess.py:93} INFO - Source (snowflake) report:
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO - [2024-09-05 12:32:46,206] ERROR    {datahub.entrypoints:218} - Command failed: too many SQL variables
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO - Traceback (most recent call last):
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -   File "/data01/airflow/.pyenv/versions/py3_datahub/lib/python3.12/site-packages/datahub/entrypoints.py", line 205, in main
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -     sys.exit(datahub(standalone_mode=False, **kwargs))
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -   File "/data01/airflow/.pyenv/versions/py3_datahub/lib/python3.12/site-packages/click/core.py", line 1157, in __call__
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -     return self.main(*args, **kwargs)
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -            ^^^^^^^^^^^^^^^^^^^^^^^^^^
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -   File "/data01/airflow/.pyenv/versions/py3_datahub/lib/python3.12/site-packages/click/core.py", line 1078, in main
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -     rv = self.invoke(ctx)
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -          ^^^^^^^^^^^^^^^^
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -   File "/data01/airflow/.pyenv/versions/py3_datahub/lib/python3.12/site-packages/click/core.py", line 1688, in invoke
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -     return _process_result(sub_ctx.command.invoke(sub_ctx))
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -   File "/data01/airflow/.pyenv/versions/py3_datahub/lib/python3.12/site-packages/click/core.py", line 1688, in invoke
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -     return _process_result(sub_ctx.command.invoke(sub_ctx))
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -   File "/data01/airflow/.pyenv/versions/py3_datahub/lib/python3.12/site-packages/click/core.py", line 1434, in invoke
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -     return ctx.invoke(self.callback, **ctx.params)
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -   File "/data01/airflow/.pyenv/versions/py3_datahub/lib/python3.12/site-packages/click/core.py", line 783, in invoke
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -     return __callback(*args, **kwargs)
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -   File "/data01/airflow/.pyenv/versions/py3_datahub/lib/python3.12/site-packages/datahub/telemetry/telemetry.py", line 462, in wrapper
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -     raise e
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -   File "/data01/airflow/.pyenv/versions/py3_datahub/lib/python3.12/site-packages/datahub/telemetry/telemetry.py", line 411, in wrapper
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -     res = func(*args, **kwargs)
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -           ^^^^^^^^^^^^^^^^^^^^^
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -   File "/data01/airflow/.pyenv/versions/py3_datahub/lib/python3.12/site-packages/datahub/cli/ingest_cli.py", line 203, in run
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -     ret = loop.run_until_complete(run_ingestion_and_check_upgrade())
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -   File "/data01/airflow/.pyenv/versions/3.12.5/lib/python3.12/asyncio/base_events.py", line 687, in run_until_complete
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -     return future.result()
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -            ^^^^^^^^^^^^^^^
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -   File "/data01/airflow/.pyenv/versions/py3_datahub/lib/python3.12/site-packages/datahub/cli/ingest_cli.py", line 187, in run_ingestion_and_check_upgrade
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -     ret = await ingestion_future
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -           ^^^^^^^^^^^^^^^^^^^^^^
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -   File "/data01/airflow/.pyenv/versions/py3_datahub/lib/python3.12/site-packages/datahub/cli/ingest_cli.py", line 145, in run_pipeline_to_completion
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -     ret = pipeline.pretty_print_summary(warnings_as_failure=strict_warnings)
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -   File "/data01/airflow/.pyenv/versions/py3_datahub/lib/python3.12/site-packages/datahub/ingestion/run/pipeline.py", line 678, in pretty_print_summary
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -     click.echo(self.source.get_report().as_string())
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -   File "/data01/airflow/.pyenv/versions/py3_datahub/lib/python3.12/site-packages/datahub/ingestion/api/report.py", line 83, in as_string
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -     return pprint.pformat(self.as_obj(), width=150, sort_dicts=False)
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -                           ^^^^^^^^^^^^^
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -   File "/data01/airflow/.pyenv/versions/py3_datahub/lib/python3.12/site-packages/datahub/ingestion/api/source.py", line 321, in as_obj
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -     **super().as_obj(),
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -       ^^^^^^^^^^^^^^^^
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -   File "/data01/airflow/.pyenv/versions/py3_datahub/lib/python3.12/site-packages/datahub/ingestion/api/report.py", line 76, in as_obj
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -     str(key): Report.to_pure_python_obj(value)
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -   File "/data01/airflow/.pyenv/versions/py3_datahub/lib/python3.12/site-packages/datahub/ingestion/api/report.py", line 42, in to_pure_python_obj
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -     return some_val.as_obj()
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -            ^^^^^^^^^^^^^^^^^
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -   File "/data01/airflow/.pyenv/versions/py3_datahub/lib/python3.12/site-packages/datahub/ingestion/api/report.py", line 74, in as_obj
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -     self.compute_stats()
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -   File "/data01/airflow/.pyenv/versions/py3_datahub/lib/python3.12/site-packages/datahub/sql_parsing/sql_parsing_aggregator.py", line 247, in compute_stats
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -     self.num_unique_query_fingerprints = len(self._aggregator._query_map)
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -   File "/data01/airflow/.pyenv/versions/py3_datahub/lib/python3.12/site-packages/datahub/utilities/file_backed_collections.py", line 388, in __len__
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -     cursor = self._conn.execute(
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -              ^^^^^^^^^^^^^^^^^^^
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -   File "/data01/airflow/.pyenv/versions/py3_datahub/lib/python3.12/site-packages/datahub/utilities/file_backed_collections.py", line 117, in execute
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -     return self.conn.execute(sql, parameters)
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO -            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2024-09-05, 12:32:46 CEST] {subprocess.py:93} INFO - sqlite3.OperationalError: too many SQL variables
[2024-09-05, 12:32:52 CEST] {subprocess.py:97} INFO - Command exited with return code 1```
What could be causing this?

The error message sqlite3.OperationalError: too many SQL variables indicates that the ingestion process is hitting a limit on the number of SQL variables that can be used in a single SQLite query. This issue seems to have started occurring after upgrading to DataHub CLI version 0.13.0 or later.

Possible Causes and Solutions

  1. Increased Data Volume: The newer versions of DataHub might be processing more data or additional metadata, leading to more SQL variables being used in the ingestion process.

  2. Changes in DataHub Code: There might be changes in how DataHub handles SQL queries internally, leading to more variables being used in a single query.

Steps to Resolve

  1. Upgrade to the Latest Version: Ensure you are using the latest version of DataHub CLI. Sometimes, issues like these are fixed in subsequent releases. According to <Slack Slack conversation>, upgrading to the latest version can resolve some issues.

  2. Reduce Data Volume: Try to reduce the amount of data being ingested in a single run. You can do this by:

    • Limiting the number of tables or schemas being ingested.
    • Breaking down the ingestion process into smaller chunks.
  3. Adjust Configuration: Modify the ingestion configuration to use fewer SQL variables. This might involve changing how data is batched or processed.

  4. Check for Known Issues: Look for any known issues or discussions related to this error in the DataHub community forums or GitHub issues. For example, <Issues · datahub-project/datahub · GitHub GitHub issue> discusses similar problems with Snowflake ingestion.

Example Configuration Adjustment

If you are using a configuration similar to the one mentioned in the provided data sources, you might want to adjust the max_threads or other parameters to see if it helps:

source:
  type: snowflake
  config:
    account_id: <account-id>
    include_table_lineage: true
    include_view_lineage: true
    include_tables: true
    include_views: true
    profiling:
      enabled: true
      profile_table_level_only: true
    stateful_ingestion:
      enabled: true
    warehouse: DATAHUB_WH
    username: datahub_user
    role: DATAHUB_READER
    database_pattern:
      allow:
        - PATTERN
    password: '${SNOWFLAKE_DATAHUB_USER_PASSWORD}'
sink:
  type: datahub-rest
  config:
    server: 'http://datahub-datahub-gms:8080/'
    max_threads: 1

Additional Resources

Would you like more detailed steps or further assistance on any specific part of the process?

Sources: