Troubleshooting DataHub Upgrade to 0.12.0: Connecting GMS and Solving Job Failure

Original Slack Thread

Hi all, still stuck upgrading to 0.12.0. It seems to be a chicken-egg problem where the datahub upgrade job fails being unable to connect to GMS and GMS won’t work until upgraded. Thoughts on what else I can try?

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Which DataHub version are you using? (e.g. 0.12.0)
  2. Please post any relevant error logs on the thread!

see also https://datahubspace.slack.com/archives/C029A3M079U/p1700085780494409

more logs with context

2023-11-20 22:56:31,121 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Cleanup has not been requested.
2023-11-20 22:56:31,121 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Skipping Step 1/6: RemoveAspectV2TableStep...
2023-11-20 22:56:31,121 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Executing Step 2/6: GMSQualificationStep...
ANTLR Tool version 4.5 used for code generation does not match the current runtime version 4.8ANTLR Runtime version 4.5 used for parser compilation does not match the current runtime version 4.8ANTLR Tool version 4.5 used for code generation does not match the current runtime version 4.8ANTLR Runtime version 4.5 used for parser compilation does not match the current runtime version 4.8java.net.ConnectException: Connection refused (Connection refused)
	at java.base/java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.base/java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:412)
	at java.base/java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:255)
	at java.base/java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:237)
	at java.base/java.net.Socket.connect(Socket.java:609)
	at java.base/java.net.Socket.connect(Socket.java:558)
	at java.base/sun.net.NetworkClient.doConnect(NetworkClient.java:182)
	at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:509)
	at java.base/sun.net.www.http.HttpClient.openServer(HttpClient.java:604)
	at java.base/sun.net.www.http.HttpClient.<init>(HttpClient.java:277)
	at java.base/sun.net.www.http.HttpClient.New(HttpClient.java:376)
	at java.base/sun.net.www.http.HttpClient.New(HttpClient.java:397)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1253)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1187)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1081)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:1015)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1592)
	at java.base/sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1520)
	at com.linkedin.datahub.upgrade.common.steps.GMSQualificationStep.lambda$executable$0(GMSQualificationStep.java:80)
	at com.linkedin.datahub.upgrade.impl.DefaultUpgradeManager.executeStepInternal(DefaultUpgradeManager.java:110)
	at com.linkedin.datahub.upgrade.impl.DefaultUpgradeManager.executeInternal(DefaultUpgradeManager.java:68)
	at com.linkedin.datahub.upgrade.impl.DefaultUpgradeManager.executeInternal(DefaultUpgradeManager.java:42)
	at com.linkedin.datahub.upgrade.impl.DefaultUpgradeManager.execute(DefaultUpgradeManager.java:33)
	at com.linkedin.datahub.upgrade.UpgradeCli.run(UpgradeCli.java:80)
	at org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:768)
	at org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:752)
	at org.springframework.boot.SpringApplication.run(SpringApplication.java:314)
	at org.springframework.boot.builder.SpringApplicationBuilder.run(SpringApplicationBuilder.java:164)
	at com.linkedin.datahub.upgrade.UpgradeCliApplication.main(UpgradeCliApplication.java:23)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.springframework.boot.loader.MainMethodRunner.run(MainMethodRunner.java:49)
	at org.springframework.boot.loader.Launcher.launch(Launcher.java:108)
	at org.springframework.boot.loader.Launcher.launch(Launcher.java:58)
	at org.springframework.boot.loader.JarLauncher.main(JarLauncher.java:65)
[ ..... SNIP ...... ] 2 retries truncated
2023-11-20 22:56:34,308 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - ERROR: Cannot connect to GMSat <http://host> datahub-datahub-gms port 8080. Make sure GMS is on the latest version and is running at that host before starting the migration.
2023-11-20 22:56:34,308 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Failed Step 2/6: GMSQualificationStep. Failed after 2 retries.
2023-11-20 22:56:34,308 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Exiting upgrade NoCodeDataMigration with failure.
2023-11-20 22:56:34,309 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Upgrade NoCodeDataMigration completed with result FAILED. Exiting...```

<@U03MF8MU5P0> could you look into this? Thanks!

The system-update log you’ve shared is not the required upgrade but a post-gms start upgrade step called

````NoCodeDataMigration`⁣```

There should be a different argument for the job (SystemUpdate), see the helm chart here https://github.com/acryldata/datahub-helm/blob/master/charts/datahub/templates/datahub-upgrade/datahub-system-update-job.yml#L62⁣

Similarly in the various docker compose files it uses this argument as well, for example: https://github.com/datahub-project/datahub/blob/master/docker/quickstart/docker-compose-without-neo4j-m1.quickstart.yml#L120

Thanks <@U03MF8MU5P0> for clearing up my misunderstanding, but I still don’t understand why the systemUpdate job is apparently not running then, or if it is, where I can find its logs in ArgoCD. I can see that datahub.systemUpdate.enabled is true by default and our config has no override for it. I also note that the comment in the code suggest this merely configures the behaviour of the datahub-upgrade job.

Based on the screenshot there should be a job called datahub-datahub-system-update

If the system-update job has run, then this upgrade-job is due to an error with the other pod datahub-datahub-gms. Can you share the logs from a datahub-datahub-gms pod?

<@U03MF8MU5P0> I’m not sure how to get the full log to you but it seems to start going off the rails at this point in the log:attachment

And then just a bit further onattachment

Those are from the upgrade? They look like gms logs. Please share the full log from the start to the first error. Thanks!

Yes those were the gms logs, Ok, I’ve sent the full log download from both the GMS and upgrade job in a private message.

Please send the system-update logs, not the upgrade logs. GMS indicates it is waiting for the system-update job. The logs you’ve shared do not include this job. GMS is at this state: Executing bootstrap step 1/1 with name WaitForSystemUpdateStep.. - Thanks!

Hi <@U03MF8MU5P0> as I have said above, there is no system-update job. Can you think of any reason why? https://datahubspace.slack.com/archives/C029A3M079U/p1701222814802479?thread_ts=1700708811.149289&amp;cid=C029A3M079U

Is it possible you’re missing the global part? global.datahub.systemUpdate.enabled per the helm chart template https://github.com/acryldata/datahub-helm/blob/master/charts/datahub/templates/datahub-upgrade/datahub-system-update-job.yml#L1C6-L1C6|here?

Missing it in what sense <@U03MF8MU5P0> ?

As far as I can tell, our values file is not overwriting anything related to the update job from the chart template.