I am deploying DataHub in an On-Premises OpenShift and it uses a descheduler to recreate pods every 24 hours of uptime. This causes DataHub to break for me as the System Update Job’s pod is removed and the GMS pod gets stuck at: 2024-02-01 09:26:07,401 [main] INFO c.l.metadata.boot.BootstrapManager:33 - Executing bootstrap step 1/15 with name WaitForSystemUpdateStep...
And then times out at:
2024/01/30 13:04:27 Command exited with error: exit status 143```
This DataHub instance is using v0.12.0 images.
Has anyone experienced this issue or a similar issue before?
I think the job needs to somehow be started again when the descheduling process is started or I need to change the job to a cronjob but with that you would have the potential for downtime between system update runs. None of the other helm hooks worked either for this issue.
Interesting, on a restart the DataHub Upgrade KafkaListener should automatically seek backwards to the last message via the onPartitionsAssigned logic. Is your Kafka broker also getting “descheduled” and the topic data wiped? That would be problematic for many reasons.
Restarting pods doesn’t require re-running the system update job for sure. The only situation when system-update needs to run again (assuming no data loss in the kafka topics) is when the helm revision changes. This revision number is present if you list the installed revision using helm and is the number after dash in the message. For example, 0.12.0-99 which would mean revision 99 or the 99th install of the helm chart. Check to see if this revision number is being incremented from a helm upgrade/install process but the corresponding system-update is not running. Again just restarting all pods for all DataHub deployments will not increment this number, it would have to also be re-installing via helm.
> on a restart the DataHub Upgrade KafkaListener should automatically seek backwards to the last message via the onPartitionsAssigned logic
This is failed attempt to seek backwards on the upgrade topic:
2024-02-01 09:08:20,830 [ThreadPoolTaskExecutor-1] INFO o.a.k.clients.consumer.KafkaConsumer:1603 - [Consumer clientId=consumer-datahub-duhe-consumer-job-client-gms-1, groupId=datahub-duhe-consumer-job-client-gms] Seeking to offset 168 for partition oat_DataHub_UpgradeHistory_JSON_v1-0
2024-02-01 09:08:20,853 [ThreadPoolTaskExecutor-1] INFO o.a.k.c.consumer.internals.Fetcher:1274 - [Consumer clientId=consumer-datahub-duhe-consumer-job-client-gms-1, groupId=datahub-duhe-consumer-job-client-gms] Fetch offset 168 is out of range for partition oat_DataHub_UpgradeHistory_JSON_v1-0, resetting offset```
Realistically months should also be fine assuming regular upgrades occur, but this is an extremely low volume topic and space shouldn’t be a real issue. You could also set retention in bytes size if that would be more reasonable for your company as long as the last few messages are retained it works.