Troubleshooting Elasticsearch Indices Creation Issue in Datahub Deployment on AWS

<@U0667UL20SD> yeah that flag should be set. from your original datahub-upgrade.log file what happened is that there was an attempt to access an indice before they were built. i suspect it’s mostly harmless. what is the actual problem? did the upgrade job fail? the log ends at the exception. Also the relevant parts from the log:

> 2024-01-24 15:18:23,875 [main] INFO c.l.d.u.impl.DefaultUpgradeReport:16 - Starting upgrade with id SystemUpdate…
> 2024-01-24 15:18:23,876 [main] INFO c.l.d.u.impl.DefaultUpgradeReport:16 - Executing Step 1/6: BuildIndicesPreStep…
> 2024-01-24 15:18:25,997 [pool-13-thread-1] ERROR c.l.m.s.e.query.ESSearchDAO:105 - Search query failed
> org.opensearch.OpenSearchStatusException: OpenSearch exception [type=index_not_found_exception, reason=no such index [datahubpolicyindex_v2]]

The job seemed to fail as a result of that. I attached a similar run with the full log above: https://datahubspace.slack.com/archives/CV2UVAPPG/p1706123825176009?thread_ts=1706110001.974929&amp;cid=CV2UVAPPG

2024-01-24 19:10:58,802 [main] INFO  c.l.m.s.e.i.ESIndexBuilder:491 - Index graph_service_v1 does not exist. Creating
2024-01-24 19:10:58,969 [main] INFO  c.l.m.s.e.i.ESIndexBuilder:496 - Created index graph_service_v1
2024-01-24 19:10:59,519 [main] INFO  c.l.m.s.e.i.ESIndexBuilder:491 - Index containerindex_v2 does not exist. Creating
[...]
2024-01-24 19:11:29,607 [main] ERROR c.l.d.u.s.e.steps.BuildIndicesStep:39 - BuildIndicesStep failed.
java.lang.RuntimeException: java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-0 [ACTIVE]
	at com.linkedin.metadata.search.elasticsearch.indexbuilder.EntityIndexBuilders.reindexAll(EntityIndexBuilders.java:34)
	at com.linkedin.metadata.search.elasticsearch.ElasticSearchService.configure(ElasticSearchService.java:45)
	at com.linkedin.metadata.search.elasticsearch.ElasticSearchService.reindexAll(ElasticSearchService.java:55)
	at com.linkedin.datahub.upgrade.system.elasticsearch.steps.BuildIndicesStep.lambda$executable$0(BuildIndicesStep.java:36)
	at com.linkedin.datahub.upgrade.impl.DefaultUpgradeManager.executeStepInternal(DefaultUpgradeManager.java:110)
	at com.linkedin.datahub.upgrade.impl.DefaultUpgradeManager.executeInternal(DefaultUpgradeManager.java:68)
	at com.linkedin.datahub.upgrade.impl.DefaultUpgradeManager.executeInternal(DefaultUpgradeManager.java:42)
	at com.linkedin.datahub.upgrade.impl.DefaultUpgradeManager.execute(DefaultUpgradeManager.java:33)
	at com.linkedin.datahub.upgrade.UpgradeCli.run(UpgradeCli.java:80)
2024-01-24 19:11:29,608 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Failed Step 2/6: BuildIndicesStep. Failed after 0 retries.
2024-01-24 19:11:29,608 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Exiting upgrade SystemUpdate with failure.
2024-01-24 19:11:29,609 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Upgrade SystemUpdate completed with result FAILED. Exiting...```

did you try running the job again? it’s basically timing out from ES on creating the indices

also maybe try creating an index manually to see whether it really takes +30s https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-create-index.html#indices-create-api-example

Yeah the job fails every attempt. I was able to curl the Opensearch cluster from the GMS pod and create an index essentially instantly. I know misconfigured security groups will result in timeouts, but that seems unlikely since I can curl the cluster successfully and some indices are created by datahub jobs. I’ll keep digging

Yeah that’s odd.

and it was able to create the previous index. wondering if it’s an issue with the opensearch client that we use

Can you check if there’s an idle timeout setting in Elasticsearch that is 30 secs?

or maybe there’s a load balancer in front of ES?

Just to confirm. On AWS should I be running the Elasticsearch_7.10 engine or an Opensearch_1.3 ?

To wrap up here. The culprit seemed to be something with the network routing in AWS. Our k8s was in a different cluster than the AWS clusters (RDS, MSK, Opensearch) which worked for everything except those socket timeouts. Putting them in the same VPC fixed it

Good to hear you are able find the issue <@U0667UL20SD>, it was good learning, Thanks informing back