<@U0667UL20SD> yeah that flag should be set. from your original datahub-upgrade.log file what happened is that there was an attempt to access an indice before they were built. i suspect it’s mostly harmless. what is the actual problem? did the upgrade job fail? the log ends at the exception. Also the relevant parts from the log:
> 2024-01-24 15:18:23,875 [main] INFO c.l.d.u.impl.DefaultUpgradeReport:16 - Starting upgrade with id SystemUpdate…
> 2024-01-24 15:18:23,876 [main] INFO c.l.d.u.impl.DefaultUpgradeReport:16 - Executing Step 1/6: BuildIndicesPreStep…
> 2024-01-24 15:18:25,997 [pool-13-thread-1] ERROR c.l.m.s.e.query.ESSearchDAO:105 - Search query failed
> org.opensearch.OpenSearchStatusException: OpenSearch exception [type=index_not_found_exception, reason=no such index [datahubpolicyindex_v2]]
2024-01-24 19:10:58,802 [main] INFO c.l.m.s.e.i.ESIndexBuilder:491 - Index graph_service_v1 does not exist. Creating
2024-01-24 19:10:58,969 [main] INFO c.l.m.s.e.i.ESIndexBuilder:496 - Created index graph_service_v1
2024-01-24 19:10:59,519 [main] INFO c.l.m.s.e.i.ESIndexBuilder:491 - Index containerindex_v2 does not exist. Creating
[...]
2024-01-24 19:11:29,607 [main] ERROR c.l.d.u.s.e.steps.BuildIndicesStep:39 - BuildIndicesStep failed.
java.lang.RuntimeException: java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-0 [ACTIVE]
at com.linkedin.metadata.search.elasticsearch.indexbuilder.EntityIndexBuilders.reindexAll(EntityIndexBuilders.java:34)
at com.linkedin.metadata.search.elasticsearch.ElasticSearchService.configure(ElasticSearchService.java:45)
at com.linkedin.metadata.search.elasticsearch.ElasticSearchService.reindexAll(ElasticSearchService.java:55)
at com.linkedin.datahub.upgrade.system.elasticsearch.steps.BuildIndicesStep.lambda$executable$0(BuildIndicesStep.java:36)
at com.linkedin.datahub.upgrade.impl.DefaultUpgradeManager.executeStepInternal(DefaultUpgradeManager.java:110)
at com.linkedin.datahub.upgrade.impl.DefaultUpgradeManager.executeInternal(DefaultUpgradeManager.java:68)
at com.linkedin.datahub.upgrade.impl.DefaultUpgradeManager.executeInternal(DefaultUpgradeManager.java:42)
at com.linkedin.datahub.upgrade.impl.DefaultUpgradeManager.execute(DefaultUpgradeManager.java:33)
at com.linkedin.datahub.upgrade.UpgradeCli.run(UpgradeCli.java:80)
2024-01-24 19:11:29,608 [main] INFO c.l.d.u.impl.DefaultUpgradeReport:16 - Failed Step 2/6: BuildIndicesStep. Failed after 0 retries.
2024-01-24 19:11:29,608 [main] INFO c.l.d.u.impl.DefaultUpgradeReport:16 - Exiting upgrade SystemUpdate with failure.
2024-01-24 19:11:29,609 [main] INFO c.l.d.u.impl.DefaultUpgradeReport:16 - Upgrade SystemUpdate completed with result FAILED. Exiting...```
Yeah the job fails every attempt. I was able to curl the Opensearch cluster from the GMS pod and create an index essentially instantly. I know misconfigured security groups will result in timeouts, but that seems unlikely since I can curl the cluster successfully and some indices are created by datahub jobs. I’ll keep digging
To wrap up here. The culprit seemed to be something with the network routing in AWS. Our k8s was in a different cluster than the AWS clusters (RDS, MSK, Opensearch) which worked for everything except those socket timeouts. Putting them in the same VPC fixed it