Resolving Kubernetes Pod Restart Issues and Readiness Check Failures in DataHub 0.11.0

Original Slack Thread

Hey all, we’re running 0.11.0 on kubernetes and have an issues where if our GMS pod restarts after some amount of time (don’t know exactly how long, shortest so far was a few days), the new instance will fail to pass the readiness check after the reboot and we’ll get a ton of WARN c.l.r.t.h.c.c.ChannelPoolLifecycle:139 - Failed to create channel, remote=localhost/127.0.0.1:8080 messages in the logs. It’ll never self correct, it’ll just eternally restart.

If I redeploy the whole thing, the datahub-system-update-job will rerun and upon it’s success the gms pod will successfully come up and everything will work again. If it restarts again after a few days though, it will fail to come up again and i’ll have to do the same thing. Wondering if anybody has any ideas as to why this would happen?

We were previously on 0.9.6.1 and this never happened, I believe it started alongside us upgrading to 0.11.0

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Which DataHub version are you using? (e.g. 0.12.0)
  2. Please post any relevant error logs on the thread!

what’s potentially interesting is that I’ve tried upgrading to 0.12.0 (hoping that this might be due to a bug that’s been fixed), on either of them I get the 2024-01-09 23:29:27,166 [R2 Nio Event Loop-3-2] WARN c.l.r.t.h.c.c.ChannelPoolLifecycle:139 - Failed to create channel, remote=localhost/127.0.0.1:8080 consistently, even after the datahub-system-update-job completes successfully. So it seems something about my setup is probably borked but I have come up with no leads as to why

ahhhh, just saw this in the <https://datahubproject.io/docs/how/kafka-config|0.12.0 docs for kafka topic config> DataHubUpgradeHistory_v1: Notifies the end of DataHub Upgrade job so dependants can act accordingly (_eg_, startup). Note this topic requires special configuration: Infinite retention. Also, 1 partition is enough for the occasional traffic. Had mine set for 1 week, seems very likely that’s causing my problem :sweat_smile:

in case anyone else runs into this same issue, I think the reason my 0.12.0 gms pod wouldn’t launch is that the 0.12.0 datahub-system-upgrade job wrotev0.12.1-1 into my DataHubUpgradeHistory_v1 topic, (instead of what I assume it should be writing, v0.12.0-1), deploying the pre-release 0.12.1 works, i’m gonna just wait until thats out to upgrade further

Yep, the component versions between the system-upgrade and gms need to be in sync. Either deploying with all components on v0.12.0 or v0.12.1 should work equally well.