Troubleshooting Tips for Hanging DBT Cloud Ingestion Jobs

Original Slack Thread

<@U06TM7M40RK> When running several dbt cloud ingest jobs simultanelously, we often experience the ingestion jobs hanging, ie never timing out nor being cancelled or completing

When running multiple dbt cloud ingestion jobs simultaneously, and experiencing issues with jobs hanging, there are several steps you can take to troubleshoot and potentially resolve the issue:

  1. Resource Allocation: Ensure that your datahub-actions pod has sufficient CPU and memory allocated. Inadequate resources can cause ingestion jobs to hang. You can verify and adjust resource allocation using Kubernetes commands.

  2. Timeout Configuration: You can set a timeout for your ingestion jobs by configuring the timeout_sec parameter in your ingestion recipe. This parameter specifies the per-HTTP request timeout in seconds. However, if this parameter is not having the desired effect, you might need to set a global timeout for the entire ingestion job by configuring the Kubernetes CronJob or Job to have a timeout using the activeDeadlineSeconds field.

  3. Cancel Stuck Jobs: If an ingestion job is hanging, you can manually stop the process by identifying the process ID (PID) and killing it. This can be done using Docker commands if you are running DataHub in Docker.

  4. Check Logs for Issues: After stopping the stuck ingestion, it is important to check the logs to understand why the ingestion got stuck. You can inspect the logs of the datahub-actions container.

  5. Stagger Ingestion Runs: To avoid overloading the system, consider staggering your ingestion runs so that they do not overlap.

  6. Use Kafka Sink for Scalability: If you are using the datahub-rest sink and experiencing scalability issues, consider using the Kafka sink for better scalability.

For more detailed steps on any of these points, please let me know.

Sources:

<@U06TM7M40RK> Are there any other global configurations of the actions pod that might allow the pod to better utilize a form of ingestion queue?

I’m not able to find any data sources to answer your question. You might want to specify the current configuration settings you are using for the actions pod. If this doesn’t help, please reach out to the DataHub support team for further assistance. :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><