Troubleshooting DataHub Breakage on On-Premises OpenShift with Descheduler and System Updates

user-2 · March 4, 2024, 5:09pm

Hi all,

I am deploying DataHub in an On-Premises OpenShift and it uses a descheduler to recreate pods every 24 hours of uptime. This causes DataHub to break for me as the System Update Job’s pod is removed and the GMS pod gets stuck at:
2024-02-01 09:26:07,401 [main] INFO c.l.metadata.boot.BootstrapManager:33 - Executing bootstrap step 1/15 with name WaitForSystemUpdateStep...
And then times out at:

2024/01/30 13:04:27 Command exited with error: exit status 143```
This DataHub instance is using v0.12.0 images.

Has anyone experienced this issue or a similar issue before?

I think the job needs to somehow be started again when the descheduling process is started or I need to change the job to a cronjob but with that you would have the potential for downtime between system update runs. None of the other helm hooks worked either for this issue.

user-1 · March 4, 2024, 5:09pm

Interesting, on a restart the DataHub Upgrade KafkaListener should automatically seek backwards to the last message via the onPartitionsAssigned logic. Is your Kafka broker also getting “descheduled” and the topic data wiped? That would be problematic for many reasons.

user-2 · March 4, 2024, 5:09pm

I don’t believe it is, The Kafka it is using is external from OpenShift

datahub_team · March 4, 2024, 5:09pm

Restarting pods doesn’t require re-running the system update job for sure. The only situation when system-update needs to run again (assuming no data loss in the kafka topics) is when the helm revision changes. This revision number is present if you list the installed revision using helm and is the number after dash in the message. For example, 0.12.0-99⁣ which would mean revision 99 or the 99th install of the helm chart. Check to see if this revision number is being incremented from a helm upgrade/install process but the corresponding system-update is not running. Again just restarting all pods for all DataHub deployments will not increment this number, it would have to also be re-installing via helm.

user-2 · March 4, 2024, 5:09pm

From what I can see, the revision number doesn’t changed when the pods are descheduled: attachment

datahub_team · March 4, 2024, 5:09pm

<@U05SKM6KGGK> can you help here?

user-4 · March 4, 2024, 5:09pm

Can you provide logs with full context? If it’s waiting for an upgrade, there should be messages before it explaining why

user-2 · March 4, 2024, 5:09pm

Sure I will DM you them in full

user-4 · March 4, 2024, 5:09pm

Recalling what <@UV5UEC3LN> mentioned:

> on a restart the DataHub Upgrade KafkaListener should automatically seek backwards to the last message via the onPartitionsAssigned logic
This is failed attempt to seek backwards on the upgrade topic:

2024-02-01 09:08:20,830 [ThreadPoolTaskExecutor-1] INFO  o.a.k.clients.consumer.KafkaConsumer:1603 - [Consumer clientId=consumer-datahub-duhe-consumer-job-client-gms-1, groupId=datahub-duhe-consumer-job-client-gms] Seeking to offset 168 for partition oat_DataHub_UpgradeHistory_JSON_v1-0
2024-02-01 09:08:20,853 [ThreadPoolTaskExecutor-1] INFO  o.a.k.c.consumer.internals.Fetcher:1274 - [Consumer clientId=consumer-datahub-duhe-consumer-job-client-gms-1, groupId=datahub-duhe-consumer-job-client-gms] Fetch offset 168 is out of range for partition oat_DataHub_UpgradeHistory_JSON_v1-0, resetting offset```

user-4 · March 4, 2024, 5:09pm

The upgrade check does not work if the last event is not available

user-2 · March 4, 2024, 5:10pm

Ah okay, I will look to make some fixes for that thanks The only thing I will not be able to do is set the topic to have a retention of -1

     "--entity-type topics --entity-name "$DATAHUB_UPGRADE_HISTORY_TOPIC_NAME" --alter --add-config <http://retention.ms|retention.ms>=-1"```

datahub_team · March 4, 2024, 5:10pm

Hm yeah that would be required. Maybe set it to a large enough value instead (e.g. years)

user-2 · March 4, 2024, 5:10pm

I would like to make it infinite but for the mean time I can do years

user-1 · March 4, 2024, 5:10pm

Realistically months should also be fine assuming regular upgrades occur, but this is an extremely low volume topic and space shouldn’t be a real issue. You could also set retention in bytes size if that would be more reasonable for your company as long as the last few messages are retained it works.

user-2 · March 4, 2024, 5:10pm

I will have a word with my company’s architects to see what’s the largest size I am permitted to use. Thanks for all of your help

Topic		Replies	Views
Managing `datahub-system-update` Job in Helm Deployment all-things-deployment	2	54	March 4, 2024
Troubleshooting "Context Deadline Exceeded" Error During Datahub Installation through Helm getting-started	1	52	March 4, 2024
Troubleshooting issues with datahub-gms pod and datahub upgrade all-things-deployment	2	90	March 4, 2024
Troubleshooting Elasticsearch Master Pod Deployment in OpenShift all-things-deployment	2	105	March 4, 2024
Troubleshooting MySQL Setup Job Timeout and GMS Pod Errors getting-started	5	75	March 4, 2024

Troubleshooting DataHub Breakage on On-Premises OpenShift with Descheduler and System Updates

Related topics