Resolving Kubernetes Pod Restart Issues and Readiness Check Failures in DataHub 0.11.0

user-1 · March 4, 2024, 5:02pm

Hey all, we’re running 0.11.0 on kubernetes and have an issues where if our GMS pod restarts after some amount of time (don’t know exactly how long, shortest so far was a few days), the new instance will fail to pass the readiness check after the reboot and we’ll get a ton of WARN c.l.r.t.h.c.c.ChannelPoolLifecycle:139 - Failed to create channel, remote=localhost/127.0.0.1:8080 messages in the logs. It’ll never self correct, it’ll just eternally restart.

If I redeploy the whole thing, the datahub-system-update-job will rerun and upon it’s success the gms pod will successfully come up and everything will work again. If it restarts again after a few days though, it will fail to come up again and i’ll have to do the same thing. Wondering if anybody has any ideas as to why this would happen?

We were previously on 0.9.6.1 and this never happened, I believe it started alongside us upgrading to 0.11.0

datahub_team · March 4, 2024, 5:02pm

Hey there! Make sure your message includes the following information if relevant, so we can help more effectively!

Which DataHub version are you using? (e.g. 0.12.0)
Please post any relevant error logs on the thread!

user-1 · March 4, 2024, 5:02pm

what’s potentially interesting is that I’ve tried upgrading to 0.12.0 (hoping that this might be due to a bug that’s been fixed), on either of them I get the 2024-01-09 23:29:27,166 [R2 Nio Event Loop-3-2] WARN c.l.r.t.h.c.c.ChannelPoolLifecycle:139 - Failed to create channel, remote=localhost/127.0.0.1:8080 consistently, even after the datahub-system-update-job completes successfully. So it seems something about my setup is probably borked but I have come up with no leads as to why

user-1 · March 4, 2024, 5:02pm

ahhhh, just saw this in the <https://datahubproject.io/docs/how/kafka-config|0.12.0 docs for kafka topic config> DataHubUpgradeHistory_v1: Notifies the end of DataHub Upgrade job so dependants can act accordingly (_eg_, startup). Note this topic requires special configuration: Infinite retention. Also, 1 partition is enough for the occasional traffic. Had mine set for 1 week, seems very likely that’s causing my problem

user-1 · March 4, 2024, 5:02pm

in case anyone else runs into this same issue, I think the reason my 0.12.0 gms pod wouldn’t launch is that the 0.12.0 datahub-system-upgrade job wrotev0.12.1-1 into my DataHubUpgradeHistory_v1 topic, (instead of what I assume it should be writing, v0.12.0-1), deploying the pre-release 0.12.1 works, i’m gonna just wait until thats out to upgrade further

datahub_team · March 4, 2024, 5:02pm

Yep, the component versions between the system-upgrade and gms need to be in sync. Either deploying with all components on v0.12.0 or v0.12.1 should work equally well.

Topic		Replies	Views
Troubleshooting GMS Pods Restarting and Not Reaching "Ready" State troubleshoot	2	52	April 1, 2024
Troubleshooting GMS Startup Issue with DataHub Upgrade troubleshoot	7	276	April 15, 2024
Troubleshooting datahub-gms Component Restarts in Manual Installation with Podman and Postgres Replacement all-things-deployment	3	80	March 4, 2024
Handling Upgrades in a Kubernetes Environment with Datahub Instance all-things-deployment	2	52	March 4, 2024
Troubleshooting issues with datahub-gms pod and datahub upgrade all-things-deployment	2	90	March 4, 2024

Resolving Kubernetes Pod Restart Issues and Readiness Check Failures in DataHub 0.11.0

Related topics