Troubleshooting dbt-cloud Ingestion Errors

Original Slack Thread

Hey y’all,
Wondering if anybody has started to run into issues with their dbt-cloud ingestions? Over the past few days I have started running into the following error. It just as follows

    comment = " " + comment if comment[0].strip() else comment
IndexError: string index out of range
[2024-05-30 01:54:17,785] DEBUG    {datahub.entrypoints:207} - DataHub CLI version: 0.13.2.3 at /tmp/datahub/ingest/venv-dbt-cloud-c7954d619fbe9b0e/lib/python3.10/site-packages/datahub/__init__.py
[2024-05-30 01:54:17,785] DEBUG    {datahub.entrypoints:210} - Python version: 3.10.13 (main, Jan 17 2024, 06:53:56) [GCC 12.2.0] at /tmp/datahub/ingest/venv-dbt-cloud-c7954d619fbe9b0e/bin/python3 on Linux-6.5.0-1020-aws-x86_64-with-glibc2.36
[2024-05-30 01:54:17,785] DEBUG    {datahub.entrypoints:215} - GMS config {'models': {}, 'patchCapable': True, 'versions': {'acryldata/datahub': {'version': 'v0.13.3rc1', 'commit': '648fd459eb6b981a8283084fde4942fff6c9d6aa'}}, 'managedIngestion': {'defaultCliVersion': '0.13.3rc1', 'enabled': True}, 'statefulIngestionCapable': True, 'supportsImpactAnalysis': True, 'timeZone': 'GMT', 'telemetry': {'enabledCli': True, 'enabledIngestion': False}, 'datasetUrnNameCasing': False, 'retention': 'true', 'datahub': {'serverType': 'quickstart'}, 'noCode': 'true'}```
Unable to pinpoint the error to.
I am currently running the application via docker using 0.13.1 and using UI version 0.13.2.3 for dbt connector
Ingested data via UI and integration this with snowflake also

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?


:information_source: We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a :thumbsup: or :thumbsdown: so other folks in the Community know if you’re still looking for help on the topic!

To help you more effectively, could you please provide the following information?

  1. Are you using the UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g., 0.13.1)?
  3. What data source(s) are you integrating with DataHub? (e.g., Snowflake, BigQuery)?

Based on the error message you provided, it seems like there might be an issue with the sqlglot library used during the ingestion process. This could be related to a specific version of the library or a particular configuration in your ingestion setup.

According to the <Slack Slack thread>, similar issues have been resolved by upgrading the CLI version or adjusting the configuration.

Would you like more detailed steps on how to troubleshoot this issue?

Sources:

  • <Slack Slack thread> :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

<@U06TM7M40RK>

  1. I am using UI for dbt ingestion.
  2. DataHub CLI version: 0.13.1.3
  3. Integrating with snowflake + dbt

Thank you for providing the details. Based on your description and the error message, it seems like there might be an issue with the sqlglot library during the dbt ingestion process. Here are some steps to help troubleshoot and resolve the issue:

Debugging Steps

  1. Check Logs for Detailed Errors:

    • Review the logs from the DataHub ingestion executor to identify any specific error messages. This can provide insights into why the ingestion is failing.
    • You can access the logs by clicking on the Details button of the failed ingestion run in the UI.
  2. Upgrade DataHub CLI:

    • Ensure you are using the latest version of the DataHub CLI. Sometimes, issues are resolved in newer versions. You can upgrade the CLI by running:
      pip install --upgrade acryl-datahub
      
  3. Check sqlglot Library Version:

    • The error message indicates an issue with the sqlglot library. Ensure that you are using a compatible version of this library. You might need to pin the version in your environment.
  4. Configuration Review:

Example Configuration

Here is an example of how you might configure your ingestion recipe for dbt and Snowflake:

source:
  type: dbt-cloud
  config:
    max_threads: 1
    metadata_endpoint: '<https://my-metadata-cloud.com/graphql>'
    project_id: '3'
    job_id: '82'
    target_platform: snowflake
    stateful_ingestion:
      enabled: true
    account_id: '9999'
    token: MYDBTToken

Additional Resources

Would you like more detailed steps on any of these points?

Sources:

<@U06TM7M40RK> Now running into these errors

exception calling callback for &lt;Future at 0x77d84c36a290 state=finished returned NoneType&gt;
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 342, in _invoke_callbacks
    callback(self)
  File "/tmp/datahub/ingest/venv-dbt-cloud-c7954d619fbe9b0e/lib/python3.10/site-packages/datahub/utilities/advanced_thread_executor.py", line 119, in _system_done_callback
    raise e
  File "/tmp/datahub/ingest/venv-dbt-cloud-c7954d619fbe9b0e/lib/python3.10/site-packages/datahub/utilities/advanced_thread_executor.py", line 109, in _system_done_callback
    self._submit_nowait(key, fn, args, kwargs, user_done_callback)
  File "/tmp/datahub/ingest/venv-dbt-cloud-c7954d619fbe9b0e/lib/python3.10/site-packages/datahub/utilities/advanced_thread_executor.py", line 96, in _submit_nowait
    future = self._executor.submit(fn, *args, **kwargs)
  File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 169, in submit
    raise RuntimeError('cannot schedule new futures after '
RuntimeError: cannot schedule new futures after interpreter shutdown
exception calling callback for &lt;Future at 0x77d84c36b550 state=finished returned NoneType&gt;
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/concurrent/futures/_base.py", line 342, in _invoke_callbacks
    callback(self)
  File "/tmp/datahub/ingest/venv-dbt-cloud-c7954d619fbe9b0e/lib/python3.10/site-packages/datahub/utilities/advanced_thread_executor.py", line 119, in _system_done_callback
    raise e
  File "/tmp/datahub/ingest/venv-dbt-cloud-c7954d619fbe9b0e/lib/python3.10/site-packages/datahub/utilities/advanced_thread_executor.py", line 109, in _system_done_callback
    self._submit_nowait(key, fn, args, kwargs, user_done_callback)
  File "/tmp/datahub/ingest/venv-dbt-cloud-c7954d619fbe9b0e/lib/python3.10/site-packages/datahub/utilities/advanced_thread_executor.py", line 96, in _submit_nowait
    future = self._executor.submit(fn, *args, **kwargs)
  File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 169, in submit
    raise RuntimeError('cannot schedule new futures after '
RuntimeError: cannot schedule new futures after interpreter shutdown```

The error message RuntimeError: cannot schedule new futures after interpreter shutdown indicates that the Python interpreter is shutting down while there are still tasks being scheduled or running. This can happen if the ingestion process is taking too long or if there are issues with the threading or concurrency setup.

Here are some steps to troubleshoot and resolve this issue:

Steps to Resolve

  1. Increase Timeout and Retry Settings:

    • Ensure that your ingestion process has sufficient time to complete. You can increase the timeout and retry settings in your ingestion configuration.
  2. Reduce Concurrency:

    • Reduce the number of concurrent threads or processes to avoid overwhelming the system. You can do this by setting the max_threads parameter in your ingestion configuration.
  3. Check Resource Limits:

    • Ensure that your system has enough resources (CPU, memory) to handle the ingestion process. You might need to allocate more resources to the container running the ingestion.
  4. Upgrade DataHub CLI:

    • Ensure you are using the latest version of the DataHub CLI. Sometimes, issues are resolved in newer versions. You can upgrade the CLI by running:
      pip install --upgrade acryl-datahub
      
  5. Review Advanced Thread Executor Configuration:

    • The error is occurring in the advanced_thread_executor.py file. Review the configuration and ensure that it is set up correctly.

Example Configuration

Here is an example of how you might configure your ingestion recipe with reduced concurrency and increased timeout:

source:
  type: dbt-cloud
  config:
    max_threads: 1
    metadata_endpoint: 'https://my-metadata-cloud.com/graphql'
    project_id: '3'
    job_id: '82'
    target_platform: snowflake
    stateful_ingestion:
      enabled: true
    account_id: '9999'
    token: MYDBTToken
    timeout: 600  # Increase timeout to 600 seconds
    retries: 3    # Set the number of retries to 3

Additional Resources

Would you like more detailed steps on any of these points?

Sources: