Troubleshooting Datahub Upgrade from 0.12.0 to 0.13.0

We have a problem when updating from Datahub 0.12.0 to 0.13.0 version.
The datahub-upgrade job are running flawlessly (all 8 steps):

Completed Step 2/8: BuildIndicesStep successfully.
Completed Step 3/8: BuildIndicesPostStep successfully.
Completed Step 4/8: DataHubStartupStep successfully.
BackfillBrowsePathsV2Step was already run. Skipping.
Skipping Step 5/8: BackfillBrowsePathsV2Step...
BackfillPolicyFieldsStep was already run. Skipping.
Skipping Step 6/8: BackfillPolicyFieldsStep...
Completed Step 7/8: CleanUpIndicesStep successfully.
OwnershipTypes was already run. Skipping.
Skipping Step 8/8: OwnershipTypes...
Success! Completed upgrade with id SystemUpdate successfully.
Upgrade SystemUpdate completed with result SUCCEEDED. Exiting...```
After that the datahub-gms Deployment starts and there is some glitch.

I checked the listening ports inside of the Pods:
```~ $ netstat -tulpn
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 :::4318                 :::*                    LISTEN      30/java
tcp        0      0 :::5701                 :::*                    LISTEN      30/java
tcp        0      0 :::8080                 :::*                    LISTEN      30/java```
As you can see the 8080 port are in a "LISTEN" state.
But the liveness/readiness probes are failing because of that 8080 port are not accessible.
And because of that things happens:
• the GMS Pods are not switching to *Ready* state:
```$ oc get pods
NAME                                                    READY   STATUS      RESTARTS   AGE
datahub-gms-58d95dc9d-7lx9w                             0/1     Running     6          92m
datahub-gms-58d95dc9d-rx4gv                             0/1     Running     6          92m```
• the pods cannot resolve the external service *datahub-gms-v0-13-0-hazelcast-svc* (which is used by gms pods to connect to each other):
```WARNING: []:5701 [dev] [5.3.6] DNS lookup for serviceDns 'datahub-gms-v0-13-0-hazelcast-svc' failed: unknown host```
• hazelcast cannot unite all GMS pods into one cluster
• the Pods starts to reboot in a cycle
If I use curl to get some info from 8080 port I have only this:
```~ $ curl -v --head -X GET <http://localhost:8080/health>
* Host localhost:8080 was resolved.
* IPv6: ::1
* IPv4:
*   Trying [::1]:8080...
* Connected to localhost (::1) port 8080
> GET /health HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/8.5.0
> Accept: */*
Nothing changed in an hours - the curl hangs for eternity and showing only what is above.

At the same time if I use curl to get the info from 4318 port everything is ok:
```~ $ curl -v --head -X GET <http://localhost:4318/health>
* Host localhost:4318 was resolved.
* IPv6: ::1
* IPv4:
*   Trying [::1]:4318...
* Connected to localhost (::1) port 4318
> GET /health HTTP/1.1
> Host: localhost:4318
> User-Agent: curl/8.5.0
> Accept: */*
< HTTP/1.1 200 OK
HTTP/1.1 200 OK
< Date: Thu, 11 Apr 2024 11:39:07 GMT
Date: Thu, 11 Apr 2024 11:39:07 GMT
< Content-type: text/plain; version=0.0.4; charset=utf-8
Content-type: text/plain; version=0.0.4; charset=utf-8
< Content-length: 274697
Content-length: 274697```
So, why is the 8080 port are not accessible even if they are marked as *LISTEN* in netstat?
What I can check to find the source of this glitch?

> So, why is the 8080 port are not accessible even if they are marked as LISTEN in netstat?
LISTEN state indicates that a process is listening on that, not that it’s able to process new connections. Besides this is needed because, for example, Kubernetes would use this port to health check the process.

> What I can check to find the source of this glitch?
Please attach GMS logs. Also you have found one problem already:

> WARNING: []:5701 [dev] [5.3.6] DNS lookup for serviceDns ‘datahub-gms-v0-13-0-hazelcast-svc’ failed: unknown host
Make sure the hazelcast service exists. It’s setup by the Helm chart

Yes, the Hazelcast service are exist:

datahub-gms-v0-13-0-hazelcast-svc      ClusterIP   None             &lt;none&gt;        5701/TCP                     28h```
But the GMS pods cannot connect to this service because of that the pods cannot pass the readiness/liveness probes (8080 port are not responding anything to the probes). Therefore, the GMS pods remains in the 0/1 (not Ready) state and cannot resolve the name of a Hazelcast service because the DNS records will appear in the GMS pods only after they switch to a 1/1 (Ready) state.
So, as I see, the main problem is that the 8080 port are not working as expected.

The Datahub GMS log in the attachment.
GMS is stuck on step 1 of the bootstrap process:

2024-04-12 15:41:41,061 [ThreadPoolTaskExecutor-1] INFO  o.a.k.c.c.i.SubscriptionState:397 - [Consumer clientId=consumer-generic-duhe-consumer-job-client-1, groupId=generic-duhe-consumer-job-client] Resetting offset for partition DataHubUpgradeHistory_v1-0 to offset 100.
2024-04-12 15:41:41,056 [ThreadPoolTaskExecutor-1] INFO  o.a.k.c.consumer.internals.Fetcher:1274 - [Consumer clientId=consumer-generic-duhe-consumer-job-client-1, groupId=generic-duhe-consumer-job-client] Fetch offset 99 is out of range for partition DataHubUpgradeHistory_v1-0, resetting offset```
The way that the upgrade check works is that GMS checks the last message in the `DataHubUpgradeHistory_v1` topic to make sure it's the same version as GMS. So it gets the current offset (100) and subtracts 1 to get the last message (99). But offset 99 appears to be invalid.

Try deploying again to get the datahub-upgrade job to run again and post the event to Kafka

Also hopefully you don’t have any custom settings in the DataHubUpgradeHistory_v1 topic that would cause events to be deleted?

I redeployed the Datahub.
Now, I don’t see any word about offset.
Looks like similar issue, now it doesn’t even get to the point of Resetting offset for partition - there’s some issue with reading events from Kafka

no partitions are being assigned
2024-04-12 17:58:16,197 [ThreadPoolTaskExecutor-1] INFO o.a.k.c.c.i.ConsumerCoordinator:273 - [Consumer clientId=consumer-generic-duhe-consumer-job-client-1, groupId=generic-duhe-consumer-job-client] Adding newly assigned partitions: