Troubleshooting Datahub Upgrade from 0.12.0 to 0.13.0

Original Slack Thread

Greetings!

We have a problem when updating from Datahub 0.12.0 to 0.13.0 version.
The datahub-upgrade job are running flawlessly (all 8 steps):

Completed Step 2/8: BuildIndicesStep successfully.
Completed Step 3/8: BuildIndicesPostStep successfully.
Completed Step 4/8: DataHubStartupStep successfully.
BackfillBrowsePathsV2Step was already run. Skipping.
Skipping Step 5/8: BackfillBrowsePathsV2Step...
BackfillPolicyFieldsStep was already run. Skipping.
Skipping Step 6/8: BackfillPolicyFieldsStep...
Completed Step 7/8: CleanUpIndicesStep successfully.
OwnershipTypes was already run. Skipping.
Skipping Step 8/8: OwnershipTypes...
Success! Completed upgrade with id SystemUpdate successfully.
Upgrade SystemUpdate completed with result SUCCEEDED. Exiting...```
After that the datahub-gms Deployment starts and there is some glitch.

I checked the listening ports inside of the Pods:
```~ $ netstat -tulpn
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 :::4318                 :::*                    LISTEN      30/java
tcp        0      0 :::5701                 :::*                    LISTEN      30/java
tcp        0      0 :::8080                 :::*                    LISTEN      30/java```
As you can see the 8080 port are in a "LISTEN" state.
But the liveness/readiness probes are failing because of that 8080 port are not accessible.
And because of that things happens:
• the GMS Pods are not switching to *Ready* state:
```$ oc get pods
NAME                                                    READY   STATUS      RESTARTS   AGE
datahub-gms-58d95dc9d-7lx9w                             0/1     Running     6          92m
datahub-gms-58d95dc9d-rx4gv                             0/1     Running     6          92m```
• the pods cannot resolve the external service *datahub-gms-v0-13-0-hazelcast-svc* (which is used by gms pods to connect to each other):
```WARNING: [10.239.28.206]:5701 [dev] [5.3.6] DNS lookup for serviceDns 'datahub-gms-v0-13-0-hazelcast-svc' failed: unknown host```
• hazelcast cannot unite all GMS pods into one cluster
• the Pods starts to reboot in a cycle
If I use curl to get some info from 8080 port I have only this:
```~ $ curl -v --head -X GET <http://localhost:8080/health>
* Host localhost:8080 was resolved.
* IPv6: ::1
* IPv4: 127.0.0.1
*   Trying [::1]:8080...
* Connected to localhost (::1) port 8080
> GET /health HTTP/1.1
> Host: localhost:8080
> User-Agent: curl/8.5.0
> Accept: */*
>```
Nothing changed in an hours - the curl hangs for eternity and showing only what is above.

At the same time if I use curl to get the info from 4318 port everything is ok:
```~ $ curl -v --head -X GET <http://localhost:4318/health>
* Host localhost:4318 was resolved.
* IPv6: ::1
* IPv4: 127.0.0.1
*   Trying [::1]:4318...
* Connected to localhost (::1) port 4318
> GET /health HTTP/1.1
> Host: localhost:4318
> User-Agent: curl/8.5.0
> Accept: */*
> 
< HTTP/1.1 200 OK
HTTP/1.1 200 OK
< Date: Thu, 11 Apr 2024 11:39:07 GMT
Date: Thu, 11 Apr 2024 11:39:07 GMT
< Content-type: text/plain; version=0.0.4; charset=utf-8
Content-type: text/plain; version=0.0.4; charset=utf-8
< Content-length: 274697
Content-length: 274697```
So, why is the 8080 port are not accessible even if they are marked as *LISTEN* in netstat?
What I can check to find the source of this glitch?

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Which DataHub version are you using? (e.g. 0.12.0)
  2. Please post any relevant error logs on the thread!

> So, why is the 8080 port are not accessible even if they are marked as LISTEN in netstat?
LISTEN state indicates that a process is listening on that, not that it’s able to process new connections. Besides this is needed because, for example, Kubernetes would use this port to health check the process.

> What I can check to find the source of this glitch?
Please attach GMS logs. Also you have found one problem already:

> WARNING: [10.239.28.206]:5701 [dev] [5.3.6] DNS lookup for serviceDns ‘datahub-gms-v0-13-0-hazelcast-svc’ failed: unknown host
Make sure the hazelcast service exists. It’s setup by the Helm chart https://github.com/acryldata/datahub-helm/blob/master/charts/datahub/subcharts/datahub-gms/templates/hazelcastService.yaml

Yes, the Hazelcast service are exist:

datahub-gms-v0-13-0-hazelcast-svc      ClusterIP   None             &lt;none&gt;        5701/TCP                     28h```
But the GMS pods cannot connect to this service because of that the pods cannot pass the readiness/liveness probes (8080 port are not responding anything to the probes). Therefore, the GMS pods remains in the 0/1 (not Ready) state and cannot resolve the name of a Hazelcast service because the DNS records will appear in the GMS pods only after they switch to a 1/1 (Ready) state.
So, as I see, the main problem is that the 8080 port are not working as expected.

The Datahub GMS log in the attachment.
``````![attachment]({'ID': 'F06U3D78Q4S', 'EDITABLE': False, 'IS_EXTERNAL': False, 'USER_ID': 'U042BPEVCAX', 'CREATED': '2024-04-12 16:41:11+00:00', 'PERMALINK': 'https://datahubspace.slack.com/files/U042BPEVCAX/F06U3D78Q4S/datahub-gms-58d95dc9d-rx4gv.log', 'EXTERNAL_TYPE': '', 'TIMESTAMPS': '2024-04-12 16:41:11+00:00', 'MODE': 'hosted', 'DISPLAY_AS_BOT': False, 'PRETTY_TYPE': 'Binary', 'NAME': 'datahub-gms-58d95dc9d-rx4gv.log', 'IS_PUBLIC': True, 'PREVIEW_HIGHLIGHT': None, 'MIMETYPE': 'application/octet-stream', 'PERMALINK_PUBLIC': 'https://slack-files.com/TUMKD5EGJ-F06U3D78Q4S-65332043e8', 'FILETYPE': 'binary', 'EDIT_LINK': None, 'URL_PRIVATE': 'https://files.slack.com/files-pri/TUMKD5EGJ-F06U3D78Q4S/datahub-gms-58d95dc9d-rx4gv.log', 'HAS_RICH_PREVIEW': False, 'TITLE': 'datahub-gms-58d95dc9d-rx4gv.log', 'IS_STARRED': False, 'PREVIEW_IS_TRUNCATED': None, 'URL_PRIVATE_DOWNLOAD': 'https://files.slack.com/files-pri/TUMKD5EGJ-F06U3D78Q4S/download/datahub-gms-58d95dc9d-rx4gv.log', 'PREVIEW': None, 'PUBLIC_URL_SHARED': False, 'MESSAGE_TS': '1712940092.539769', 'PARENT_MESSAGE_TS': '1712837636.837619', 'MESSAGE_CHANNEL_ID': 'C029A3M079U', '_FIVETRAN_DELETED': False, 'LINES_MORE': None, 'LINES': None, 'SIZE': 65209, '_FIVETRAN_SYNCED': '2024-04-14 12:55:22.845000+00:00'})

GMS is stuck on step 1 of the bootstrap process:

...
2024-04-12 15:41:41,061 [ThreadPoolTaskExecutor-1] INFO  o.a.k.c.c.i.SubscriptionState:397 - [Consumer clientId=consumer-generic-duhe-consumer-job-client-1, groupId=generic-duhe-consumer-job-client] Resetting offset for partition DataHubUpgradeHistory_v1-0 to offset 100.
2024-04-12 15:41:41,056 [ThreadPoolTaskExecutor-1] INFO  o.a.k.c.consumer.internals.Fetcher:1274 - [Consumer clientId=consumer-generic-duhe-consumer-job-client-1, groupId=generic-duhe-consumer-job-client] Fetch offset 99 is out of range for partition DataHubUpgradeHistory_v1-0, resetting offset```
The way that the upgrade check works is that GMS checks the last message in the `DataHubUpgradeHistory_v1` topic to make sure it's the same version as GMS. So it gets the current offset (100) and subtracts 1 to get the last message (99). But offset 99 appears to be invalid.

Try deploying again to get the datahub-upgrade job to run again and post the event to Kafka

Also hopefully you don’t have any custom settings in the DataHubUpgradeHistory_v1 topic that would cause events to be deleted?

<@U064583E403>
I redeployed the Datahub.
Now, I don’t see any word about offset.
The pod exits with 143 error.![attachment]({‘ID’: ‘F06U483GJBU’, ‘EDITABLE’: False, ‘IS_EXTERNAL’: False, ‘USER_ID’: ‘U042BPEVCAX’, ‘CREATED’: ‘2024-04-12 19:13:02+00:00’, ‘PERMALINK’: ‘Slack’, ‘EXTERNAL_TYPE’: ‘’, ‘TIMESTAMPS’: ‘2024-04-12 19:13:02+00:00’, ‘MODE’: ‘hosted’, ‘DISPLAY_AS_BOT’: False, ‘PRETTY_TYPE’: ‘Binary’, ‘NAME’: ‘datahub-gms-58d95dc9d-nnfjv.log’, ‘IS_PUBLIC’: True, ‘PREVIEW_HIGHLIGHT’: None, ‘MIMETYPE’: ‘application/octet-stream’, ‘PERMALINK_PUBLIC’: ‘https://slack-files.com/TUMKD5EGJ-F06U483GJBU-b034fcb492’, ‘FILETYPE’: ‘binary’, ‘EDIT_LINK’: None, ‘URL_PRIVATE’: ‘Slack’, ‘HAS_RICH_PREVIEW’: False, ‘TITLE’: ‘datahub-gms-58d95dc9d-nnfjv.log’, ‘IS_STARRED’: False, ‘PREVIEW_IS_TRUNCATED’: None, ‘URL_PRIVATE_DOWNLOAD’: ‘Slack’, ‘PREVIEW’: None, ‘PUBLIC_URL_SHARED’: False, ‘MESSAGE_TS’: ‘1712949289.956329’, ‘PARENT_MESSAGE_TS’: ‘1712837636.837619’, ‘MESSAGE_CHANNEL_ID’: ‘C029A3M079U’, ‘_FIVETRAN_DELETED’: False, ‘LINES_MORE’: None, ‘LINES’: None, ‘SIZE’: 86328, ‘_FIVETRAN_SYNCED’: ‘2024-04-14 12:55:22.954000+00:00’})

Looks like similar issue, now it doesn’t even get to the point of Resetting offset for partition - there’s some issue with reading events from Kafka

no partitions are being assigned
2024-04-12 17:58:16,197 [ThreadPoolTaskExecutor-1] INFO o.a.k.c.c.i.ConsumerCoordinator:273 - [Consumer clientId=consumer-generic-duhe-consumer-job-client-1, groupId=generic-duhe-consumer-job-client] Adding newly assigned partitions: