We have a problem when updating from Datahub 0.12.0 to 0.13.0 version.
The datahub-upgrade job are running flawlessly (all 8 steps):
Completed Step 2/8: BuildIndicesStep successfully.
Completed Step 3/8: BuildIndicesPostStep successfully.
Completed Step 4/8: DataHubStartupStep successfully.
BackfillBrowsePathsV2Step was already run. Skipping.
Skipping Step 5/8: BackfillBrowsePathsV2Step...
BackfillPolicyFieldsStep was already run. Skipping.
Skipping Step 6/8: BackfillPolicyFieldsStep...
Completed Step 7/8: CleanUpIndicesStep successfully.
OwnershipTypes was already run. Skipping.
Skipping Step 8/8: OwnershipTypes...
Success! Completed upgrade with id SystemUpdate successfully.
Upgrade SystemUpdate completed with result SUCCEEDED. Exiting...```
After that the datahub-gms Deployment starts and there is some glitch.
I checked the listening ports inside of the Pods:
```~ $ netstat -tulpn
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 :::4318 :::* LISTEN 30/java
tcp 0 0 :::5701 :::* LISTEN 30/java
tcp 0 0 :::8080 :::* LISTEN 30/java```
As you can see the 8080 port are in a "LISTEN" state.
But the liveness/readiness probes are failing because of that 8080 port are not accessible.
And because of that things happens:
• the GMS Pods are not switching to *Ready* state:
```$ oc get pods
NAME READY STATUS RESTARTS AGE
datahub-gms-58d95dc9d-7lx9w 0/1 Running 6 92m
datahub-gms-58d95dc9d-rx4gv 0/1 Running 6 92m```
• the pods cannot resolve the external service *datahub-gms-v0-13-0-hazelcast-svc* (which is used by gms pods to connect to each other):
```WARNING: [10.239.28.206]:5701 [dev] [5.3.6] DNS lookup for serviceDns 'datahub-gms-v0-13-0-hazelcast-svc' failed: unknown host```
• hazelcast cannot unite all GMS pods into one cluster
• the Pods starts to reboot in a cycle
If I use curl to get some info from 8080 port I have only this:
```~ $ curl -v --head -X GET <http://localhost:8080/health>
* Host localhost:8080 was resolved.
* IPv6: ::1
* IPv4: 127.0.0.1
* Trying [::1]:8080...
* Connected to localhost (::1) port 8080
> GET /health HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/8.5.0
> Accept: */*
>```
Nothing changed in an hours - the curl hangs for eternity and showing only what is above.
At the same time if I use curl to get the info from 4318 port everything is ok:
```~ $ curl -v --head -X GET <http://localhost:4318/health>
* Host localhost:4318 was resolved.
* IPv6: ::1
* IPv4: 127.0.0.1
* Trying [::1]:4318...
* Connected to localhost (::1) port 4318
> GET /health HTTP/1.1
> Host: localhost:4318
> User-Agent: curl/8.5.0
> Accept: */*
>
< HTTP/1.1 200 OK
HTTP/1.1 200 OK
< Date: Thu, 11 Apr 2024 11:39:07 GMT
Date: Thu, 11 Apr 2024 11:39:07 GMT
< Content-type: text/plain; version=0.0.4; charset=utf-8
Content-type: text/plain; version=0.0.4; charset=utf-8
< Content-length: 274697
Content-length: 274697```
So, why is the 8080 port are not accessible even if they are marked as *LISTEN* in netstat?
What I can check to find the source of this glitch?
> So, why is the 8080 port are not accessible even if they are marked as LISTEN in netstat?
LISTEN state indicates that a process is listening on that, not that it’s able to process new connections. Besides this is needed because, for example, Kubernetes would use this port to health check the process.
> What I can check to find the source of this glitch?
Please attach GMS logs. Also you have found one problem already:
datahub-gms-v0-13-0-hazelcast-svc ClusterIP None <none> 5701/TCP 28h```
But the GMS pods cannot connect to this service because of that the pods cannot pass the readiness/liveness probes (8080 port are not responding anything to the probes). Therefore, the GMS pods remains in the 0/1 (not Ready) state and cannot resolve the name of a Hazelcast service because the DNS records will appear in the GMS pods only after they switch to a 1/1 (Ready) state.
So, as I see, the main problem is that the 8080 port are not working as expected.
The Datahub GMS log in the attachment.
``````![attachment]({'ID': 'F06U3D78Q4S', 'EDITABLE': False, 'IS_EXTERNAL': False, 'USER_ID': 'U042BPEVCAX', 'CREATED': '2024-04-12 16:41:11+00:00', 'PERMALINK': 'https://datahubspace.slack.com/files/U042BPEVCAX/F06U3D78Q4S/datahub-gms-58d95dc9d-rx4gv.log', 'EXTERNAL_TYPE': '', 'TIMESTAMPS': '2024-04-12 16:41:11+00:00', 'MODE': 'hosted', 'DISPLAY_AS_BOT': False, 'PRETTY_TYPE': 'Binary', 'NAME': 'datahub-gms-58d95dc9d-rx4gv.log', 'IS_PUBLIC': True, 'PREVIEW_HIGHLIGHT': None, 'MIMETYPE': 'application/octet-stream', 'PERMALINK_PUBLIC': 'https://slack-files.com/TUMKD5EGJ-F06U3D78Q4S-65332043e8', 'FILETYPE': 'binary', 'EDIT_LINK': None, 'URL_PRIVATE': 'https://files.slack.com/files-pri/TUMKD5EGJ-F06U3D78Q4S/datahub-gms-58d95dc9d-rx4gv.log', 'HAS_RICH_PREVIEW': False, 'TITLE': 'datahub-gms-58d95dc9d-rx4gv.log', 'IS_STARRED': False, 'PREVIEW_IS_TRUNCATED': None, 'URL_PRIVATE_DOWNLOAD': 'https://files.slack.com/files-pri/TUMKD5EGJ-F06U3D78Q4S/download/datahub-gms-58d95dc9d-rx4gv.log', 'PREVIEW': None, 'PUBLIC_URL_SHARED': False, 'MESSAGE_TS': '1712940092.539769', 'PARENT_MESSAGE_TS': '1712837636.837619', 'MESSAGE_CHANNEL_ID': 'C029A3M079U', '_FIVETRAN_DELETED': False, 'LINES_MORE': None, 'LINES': None, 'SIZE': 65209, '_FIVETRAN_SYNCED': '2024-04-14 12:55:22.845000+00:00'})
...
2024-04-12 15:41:41,061 [ThreadPoolTaskExecutor-1] INFO o.a.k.c.c.i.SubscriptionState:397 - [Consumer clientId=consumer-generic-duhe-consumer-job-client-1, groupId=generic-duhe-consumer-job-client] Resetting offset for partition DataHubUpgradeHistory_v1-0 to offset 100.
2024-04-12 15:41:41,056 [ThreadPoolTaskExecutor-1] INFO o.a.k.c.consumer.internals.Fetcher:1274 - [Consumer clientId=consumer-generic-duhe-consumer-job-client-1, groupId=generic-duhe-consumer-job-client] Fetch offset 99 is out of range for partition DataHubUpgradeHistory_v1-0, resetting offset```
The way that the upgrade check works is that GMS checks the last message in the `DataHubUpgradeHistory_v1` topic to make sure it's the same version as GMS. So it gets the current offset (100) and subtracts 1 to get the last message (99). But offset 99 appears to be invalid.
no partitions are being assigned 2024-04-12 17:58:16,197 [ThreadPoolTaskExecutor-1] INFO o.a.k.c.c.i.ConsumerCoordinator:273 - [Consumer clientId=consumer-generic-duhe-consumer-job-client-1, groupId=generic-duhe-consumer-job-client] Adding newly assigned partitions: