Handling Upgrades in a Kubernetes Environment with Datahub Instance

Original Slack Thread

Hi all,
I have a running datahub instance deployed on kubernetes (raw manifests, helm not supported) and am now wondering what the recommended way is to do upgrades.

So far I’ve found that a newer gms version will wait for the corresponding upgrade-job to be finished (kafka message). This means the old gms version won’t be torn down while the upgrade job is running (great!)
What I couldn’t find out however is how I can sync my frontend pods with the gms version. The health check doesn’t seem to check the gms version, so the pod goes live and will potentially throw errors (graphql schema mismatch for example).
I could kubectl wait for the gms pod with a new version to be ready, and then && kubectl apply, but that feels hacky.
I could potentially also call the gms health endpoint through the frontend proxy and compare versions (would then add a new frontend ready endpoint that responds with 500 until the versions match).
Or I could probably create another kind: Service with some sort of “next” selector, so that the frontend has a stable hostname to check for the readiness of the new gms.

Do you guys have recommendations? :slightly_smiling_face:

<@U03MF8MU5P0> might be able to speak to this!

Our current approach is as follows:
We created another service called datahub-gms-next that has the next-to-be-deployed version of datahub as a selector (so it will only route correctly once the new version of gms is running, as we label it with the correct version). And then we use an initContainer on the frontend - with curl to check for the “next” gms to be up.

That seems to work ok. Only caveat is that the frontend container will only start booting up once gms has switched over to the new version. So during boot-time of the new frontend, the old one would still be used.

If you have a better idea on how to handle this, let me know :slightly_smiling_face: In theory we could have 2 completely distinct deployments of FE and BE, do some canary checks and only swap the ingress when everything is table… but that seems like overkill.