Handling Kubernetes Pod Evictions and Resuming Ingestion Tasks in DataHub-actions

user-1 · March 3, 2025, 12:03am

<@U06TM7M40RK> I’m testing out ingestion using the datahub-actions container running in a Kubernetes cluster. The ingestion_executor is configured to be triggered via the Kafka Source, but I noticed that if the container where an existing ingestion execution is running gets killed or pod gets evicted, the new datahub-actions container never picks up the job to either retry or resume. I would expect it to be restarted by a new consumer. Is there some configuration that I need to do to enable this behaviour?

user-3 · March 3, 2025, 12:03am

It seems likely that the behavior you’re experiencing is related to how the datahub-actions container handles ingestion tasks and the lack of automatic retry or resumption of tasks when a container is killed or evicted. Based on the available data sources, there are a few considerations and potential steps you might take to address this:

Ingestion Executor Configuration: According to the <https://datahubproject.io/docs/0.14.0/actions/actions/executor|Ingestion Executor documentation>, the executor runs ingestion recipes in a sub-process when an execution request is received. However, if the container is terminated, any in-flight ingestion processing will continue to execute as a subprocess on your system, which means there may be “orphaned” processes that are not marked as “Succeeded” or “Failed” in the UI.
Handling Orphaned Processes: The documentation suggests that you can “Cancel” the ingestion source on the UI once you’ve restarted the Ingestion Executor action. This implies that there is no built-in mechanism to automatically retry or resume tasks if the container is killed.
Kubernetes Configuration: You might consider configuring your Kubernetes setup to handle pod evictions more gracefully. This could involve setting up pod disruption budgets or using persistent volumes to ensure that state is maintained across pod restarts.
Custom Retry Logic: If automatic retry or resumption is critical, you may need to implement custom logic to detect when a task has failed and re-submit it. This could be done by monitoring the status of ingestion tasks and using the DataHub API to re-trigger them if necessary.
Check for Updates: Ensure that you are using the latest version of DataHub and datahub-actions, as there may have been updates or improvements related to task management and recovery.

Would you like more detailed steps on any of these points or further assistance on a specific part of the process?

Sources:

<https://datahubproject.io/docs/0.14.0/actions/actions/executor|Ingestion Executor documentation> 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-3 · March 3, 2025, 12:03am

Hi <@U08A5U4NFR7>, how did my answer work out for you? Need any more assistance? You can also tag <@U07FV3UHE4D> to let your fellow community members know you’re still looking for guidance.

datahub_team · March 3, 2025, 12:03am

Hi ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look

Hope this helps ! ><

user-1 · March 3, 2025, 12:03am

This has not resolved my issue. Yes, I can try the Kubernetes Configuration option to limit pod eviction, but it is still inevitable.

user-1 · March 3, 2025, 12:03am

<@U06TM7M40RK> How do I implement the custom retry logic or identify if a task is not longer being process by the DataHub actions?

user-3 · March 3, 2025, 12:03am

To implement custom retry logic or identify if a task is no longer being processed by the DataHub actions, you can follow these steps:

Monitor Task Status: You can monitor the status of ingestion tasks using the DataHub UI or API. This will help you identify tasks that are stuck or have failed. According to the <https://github.com/datahub-project/datahub/blob/master/docs/ui-ingestion.md|DataHub UI Ingestion documentation>, you can view the output of each run in the UI for easier debugging. If a task is not progressing, it may be necessary to cancel and restart it.
Use DataHub API: You can use the DataHub API to programmatically check the status of ingestion tasks. This can be useful for automating the detection of failed or stuck tasks. You can then trigger a retry or resumption of the task using the API.
Implement a Watchdog Process: Consider implementing a watchdog process that periodically checks the status of ingestion tasks. If a task is detected to be stuck or failed, the watchdog can trigger a retry. This process can be implemented as a separate service or script that interacts with the DataHub API.
Configure Retry Logic: If you are using a Kafka source, you can configure the retry logic within your Kafka consumer settings. According to the <https://github.com/datahub-project/datahub/blob/master/docs/actions/sources/kafka-event-source.md|Kafka Event Source documentation>, you can configure consumer settings to handle retries and failures.
Use Failure Mode Configuration: The DataHub Actions framework allows you to configure the failure mode for actions. According to the <https://github.com/datahub-project/datahub/blob/master/docs/actions/concepts.md|DataHub Actions Concepts documentation>, you can set the failure_mode to CONTINUE or THROW. If set to CONTINUE, failed events will be logged to a failed_events.log file, and the pipeline will continue processing other events. You can use this log to identify failed events and implement custom retry logic.
Persistent Storage for State: Ensure that any state required for resuming tasks is stored in a persistent manner. This could involve using a database or persistent volume to store offsets or task states.

Would you like more detailed steps on any of these points or further assistance on a specific part of the process?

Sources:

<https://github.com/datahub-project/datahub/blob/master/docs/ui-ingestion.md|DataHub UI Ingestion documentation>
<https://github.com/datahub-project/datahub/blob/master/docs/actions/sources/kafka-event-source.md|Kafka Event Source documentation>
<https://github.com/datahub-project/datahub/blob/master/docs/actions/concepts.md|DataHub Actions Concepts documentation> 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-1 · March 3, 2025, 12:03am

<@U07FV3UHE4D> can anyone help me here to understand the behaviour of datahub-actions in practice. From my experience, if after a while the container running the ingestion process in a datahub-actions pod gets evicted, the IngestionExecutionRequest just never get’s picked up by any new datahub-actions container.

This is a very specific scenario of the whole container dying and not the. I would imagine if using the Kafka Source, it should be able to keep track of the ingestion offset and retry from the last unprocessed/committed offset, but In practice this is not happening and I’m thinking perhaps there is some configuration I’m missing in my setup. Has anyone experience this and have a solution.

Topic		Replies	Views
Troubleshooting DataHub UI Ingestion Connectivity Issues ingestion	6	35	December 16, 2024
Troubleshooting Tips for Hanging DBT Cloud Ingestion Jobs ingestion	4	4	January 13, 2025
Enabling Retries for DataHub Actions Executor ingestion	2	7	December 9, 2024
Troubleshooting Stuck DataHub Ingestions from Terminal ingestion	2	16	September 9, 2024
Understanding Metadata Change Events in Ingestion Process and DataHub's Architecture ingestion	3	55	March 4, 2024

Handling Kubernetes Pod Evictions and Resuming Ingestion Tasks in DataHub-actions

Related topics