Troubleshooting Elasticsearch Indices Creation Issue in Datahub Deployment on AWS

user-2 · March 4, 2024, 5:36pm

<@U0667UL20SD> yeah that flag should be set. from your original datahub-upgrade.log file what happened is that there was an attempt to access an indice before they were built. i suspect it’s mostly harmless. what is the actual problem? did the upgrade job fail? the log ends at the exception. Also the relevant parts from the log:

> 2024-01-24 15:18:23,875 [main] INFO c.l.d.u.impl.DefaultUpgradeReport:16 - Starting upgrade with id SystemUpdate…
> 2024-01-24 15:18:23,876 [main] INFO c.l.d.u.impl.DefaultUpgradeReport:16 - Executing Step 1/6: BuildIndicesPreStep…
> 2024-01-24 15:18:25,997 [pool-13-thread-1] ERROR c.l.m.s.e.query.ESSearchDAO:105 - Search query failed
> org.opensearch.OpenSearchStatusException: OpenSearch exception [type=index_not_found_exception, reason=no such index [datahubpolicyindex_v2]]

user-3 · March 4, 2024, 5:36pm

The job seemed to fail as a result of that. I attached a similar run with the full log above: https://datahubspace.slack.com/archives/CV2UVAPPG/p1706123825176009?thread_ts=1706110001.974929&cid=CV2UVAPPG

user-2 · March 4, 2024, 5:36pm

2024-01-24 19:10:58,802 [main] INFO  c.l.m.s.e.i.ESIndexBuilder:491 - Index graph_service_v1 does not exist. Creating
2024-01-24 19:10:58,969 [main] INFO  c.l.m.s.e.i.ESIndexBuilder:496 - Created index graph_service_v1
2024-01-24 19:10:59,519 [main] INFO  c.l.m.s.e.i.ESIndexBuilder:491 - Index containerindex_v2 does not exist. Creating
[...]
2024-01-24 19:11:29,607 [main] ERROR c.l.d.u.s.e.steps.BuildIndicesStep:39 - BuildIndicesStep failed.
java.lang.RuntimeException: java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-0 [ACTIVE]
	at com.linkedin.metadata.search.elasticsearch.indexbuilder.EntityIndexBuilders.reindexAll(EntityIndexBuilders.java:34)
	at com.linkedin.metadata.search.elasticsearch.ElasticSearchService.configure(ElasticSearchService.java:45)
	at com.linkedin.metadata.search.elasticsearch.ElasticSearchService.reindexAll(ElasticSearchService.java:55)
	at com.linkedin.datahub.upgrade.system.elasticsearch.steps.BuildIndicesStep.lambda$executable$0(BuildIndicesStep.java:36)
	at com.linkedin.datahub.upgrade.impl.DefaultUpgradeManager.executeStepInternal(DefaultUpgradeManager.java:110)
	at com.linkedin.datahub.upgrade.impl.DefaultUpgradeManager.executeInternal(DefaultUpgradeManager.java:68)
	at com.linkedin.datahub.upgrade.impl.DefaultUpgradeManager.executeInternal(DefaultUpgradeManager.java:42)
	at com.linkedin.datahub.upgrade.impl.DefaultUpgradeManager.execute(DefaultUpgradeManager.java:33)
	at com.linkedin.datahub.upgrade.UpgradeCli.run(UpgradeCli.java:80)
2024-01-24 19:11:29,608 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Failed Step 2/6: BuildIndicesStep. Failed after 0 retries.
2024-01-24 19:11:29,608 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Exiting upgrade SystemUpdate with failure.
2024-01-24 19:11:29,609 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Upgrade SystemUpdate completed with result FAILED. Exiting...```

user-2 · March 4, 2024, 5:36pm

did you try running the job again? it’s basically timing out from ES on creating the indices

user-2 · March 4, 2024, 5:36pm

also maybe try creating an index manually to see whether it really takes +30s https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-create-index.html#indices-create-api-example

user-3 · March 4, 2024, 5:36pm

Yeah the job fails every attempt. I was able to curl the Opensearch cluster from the GMS pod and create an index essentially instantly. I know misconfigured security groups will result in timeouts, but that seems unlikely since I can curl the cluster successfully and some indices are created by datahub jobs. I’ll keep digging

user-2 · March 4, 2024, 5:36pm

Yeah that’s odd.

user-2 · March 4, 2024, 5:36pm

and it was able to create the previous index. wondering if it’s an issue with the opensearch client that we use

user-2 · March 4, 2024, 5:36pm

Can you check if there’s an idle timeout setting in Elasticsearch that is 30 secs?

user-2 · March 4, 2024, 5:36pm

or maybe there’s a load balancer in front of ES?

user-3 · March 4, 2024, 5:36pm

Just to confirm. On AWS should I be running the Elasticsearch_7.10 engine or an Opensearch_1.3 ?

user-3 · March 4, 2024, 5:36pm

To wrap up here. The culprit seemed to be something with the network routing in AWS. Our k8s was in a different cluster than the AWS clusters (RDS, MSK, Opensearch) which worked for everything except those socket timeouts. Putting them in the same VPC fixed it

user-1 · March 4, 2024, 5:36pm

Good to hear you are able find the issue <@U0667UL20SD>, it was good learning, Thanks informing back

Topic		Replies	Views
Troubleshooting datahub-upgrade SystemUpdate Exception and Java 17 Compatibility all-things-deployment	4	47	March 4, 2024
Troubleshooting GMS Crash Due to Elasticsearch Index Not Found Exception getting-started	3	81	March 4, 2024
Troubleshooting missing index creation in Datahub GMS service getting-started	1	117	March 4, 2024
Troubleshooting 'datacatalog-elasticsearch-setup-job' Errors troubleshoot	4	63	May 20, 2024
Troubleshooting ElasticSearch error with Datahub deployment on Kubernetes using Helm troubleshoot	2	73	March 4, 2024

Troubleshooting Elasticsearch Indices Creation Issue in Datahub Deployment on AWS

Related topics