Troubleshooting ElasticSearch Re-indexing Failure Process

Original Slack Thread

Good morning everyone,
we are migrating from 0.10.5 to 0.12 and we are facing some issues with the re-indexing of ElasticSearch. We migrated successfully in all environments, except one. The dataprocessinstanceindex_v2 seems to have almost 93 million entities. We tried several runs of the upgrade job and it fails at different points of reindexing. We saw it fail at 80%, 60%, 5%, 27%, 38%, etc… The job will restart from the beginning after each failure.

From the logs I cannot find anything that explains the failure.

"2024-01-28 02:04:53,526 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Exiting upgrade SystemUpdate with failure."
"2024-01-28 02:04:53,526 [main] INFO  c.l.d.u.impl.DefaultUpgradeReport:16 - Failed Step 2/6: BuildIndicesStep. Failed after 0 retries."
"java.lang.RuntimeException: java.lang.RuntimeException: Reindex from dataprocessinstanceindex_v2 to dataprocessinstanceindex_v2_1706378590787 failed. Document count 92367364 != 28309669\n\tat com.linkedin.metadata.search.elasticsearch.indexbuilder.ESIndexBuilder.buildIndex(ESIndexBuilder.java:227)\n\tat com.linkedin.metadata.search.elasticsearch.indexbuilder.EntityIndexBuilders.reindexAll(EntityIndexBuilders.java:32)\n\tat com.linkedin.metadata.search.elasticsearch.ElasticSearchService.configure(ElasticSearchService.java:45)\n\tat com.linkedin.metadata.search.elasticsearch.ElasticSearchService.reindexAll(ElasticSearchService.java:55)\n\tat com.linkedin.datahub.upgrade.system.elasticsearch.steps.BuildIndicesStep.lambda$executable$0(BuildIndicesStep.java:36)\n\tat com.linkedin.datahub.upgrade.impl.DefaultUpgradeManager.executeStepInternal(DefaultUpgradeManager.java:110)\n\tat com.linkedin.datahub.upgrade.impl.DefaultUpgradeManager.executeInternal(DefaultUpgradeManager.java:68)\n\tat com.linkedin.datahub.upgrade.impl.DefaultUpgradeManager.executeInternal(DefaultUpgradeManager.java:42)\n\tat com.linkedin.datahub.upgrade.impl.DefaultUpgradeManager.execute(DefaultUpgradeManager.java:33)\n\tat com.linkedin.datahub.upgrade.UpgradeCli.run(UpgradeCli.java:80)\n\tat org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:768)\n\tat org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:752)\n\tat org.springframework.boot.SpringApplication.run(SpringApplication.java:314)\n\tat org.springframework.boot.builder.SpringApplicationBuilder.run(SpringApplicationBuilder.java:164)\n\tat com.linkedin.datahub.upgrade.UpgradeCliApplication.main(UpgradeCliApplication.java:23)\n\tat java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\n\tat java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\n\tat java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\n\tat java.base/java.lang.reflect.Method.invoke(Method.java:566)\n\tat org.springframework.boot.loader.MainMethodRunner.run(MainMethodRunner.java:49)\n\tat org.springframework.boot.loader.Launcher.launch(Launcher.java:108)\n\tat org.springframework.boot.loader.Launcher.launch(Launcher.java:58)\n\tat org.springframework.boot.loader.JarLauncher.main(JarLauncher.java:65)\nCaused by: java.lang.RuntimeException: Reindex from dataprocessinstanceindex_v2 to dataprocessinstanceindex_v2_1706378590787 failed. Document count 92367364 != 28309669\n\tat com.linkedin.metadata.search.elasticsearch.indexbuilder.ESIndexBuilder.reindex(ESIndexBuilder.java:352)\n\tat com.linkedin.metadata.search.elasticsearch.indexbuilder.ESIndexBuilder.buildIndex(ESIndexBuilder.java:225)\n\t... 22 common frames omitted"
"2024-01-28 02:04:53,521 [main] ERROR c.l.d.u.s.e.steps.BuildIndicesStep:39 - BuildIndicesStep failed."
"WARN [RestClient] request [DELETE <http://gott-datahub-deps:9200/dataprocessinstanceindex_v2_1706378590787?master_timeout=30s&amp;ignore_unavailable=false&amp;expand_wildcards=open%2Cclosed&amp;allow_no_indices=true&amp;ignore_throttled=false&amp;timeout=30s>] returned 1 warnings: [299 Elasticsearch-7.17.6-f65e9d338dc1d07b642e14a27f338990148ee5b6 \"[ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices.\"]"
"[2024-01-28T02:04:51,428][INFO ][o.e.c.m.MetadataDeleteIndexService] [gott-datahub-deps-68bb56544-7p86j] [dataprocessinstanceindex_v2_1706378590787/_NxHG_PpQRSRGPAHFjJ3cw] deleting index"
"2024-01-28 02:04:51,372 [main] ERROR c.l.m.s.e.i.ESIndexBuilder:357 - Failed to reindex dataprocessinstanceindex_v2 to dataprocessinstanceindex_v2_1706378590787: Exception java.lang.RuntimeException: Reindex from dataprocessinstanceindex_v2 to dataprocessinstanceindex_v2_1706378590787 failed. Document count 92367364 != 28309669"
"2024-01-28 02:04:51,371 [main] ERROR c.l.m.s.e.i.ESIndexBuilder:349 - Index: dataprocessinstanceindex_v2 - Post-reindex document count is different, source_doc_count: 92367364 reindex_doc_count: 28309669"
"2024-01-28 02:03:51,369 [main] WARN  c.l.m.s.e.i.ESIndexBuilder:321 - Task: RQTILb-VQ_OIK4SK5SE89g:1686 - Document counts do not match 92367364 != 28309669. Complete: 30.648994%"
"WARN [RestClient] request [POST <http://gott-datahub-deps:9200/dataprocessinstanceindex_v2_1706378590787/_count?ignore_throttled=true&amp;ignore_unavailable=false&amp;expand_wildcards=open&amp;allow_no_indices=true>] returned 1 warnings: [299 Elasticsearch-7.17.6-f65e9d338dc1d07b642e14a27f338990148ee5b6 \"[ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices.\"]"
"WARN [RestClient] request [POST <http://gott-datahub-deps:9200/dataprocessinstanceindex_v2/_count?ignore_throttled=true&amp;ignore_unavailable=false&amp;expand_wildcards=open&amp;allow_no_indices=true>] returned 1 warnings: [299 Elasticsearch-7.17.6-f65e9d338dc1d07b642e14a27f338990148ee5b6 \"[ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices.\"]"
"WARN [RestClient] request [POST <http://gott-datahub-deps:9200/datahubpolicyindex_v2/_search?typed_keys=true&amp;max_concurrent_shard_requests=5&amp;ignore_unavailable=false&amp;expand_wildcards=open&amp;allow_no_indices=true&amp;ignore_throttled=true&amp;search_type=query_then_fetch&amp;batched_reduce_size=512&amp;ccs_minimize_roundtrips=true>] returned 1 warnings: [299 Elasticsearch-7.17.6-f65e9d338dc1d07b642e14a27f338990148ee5b6 \"[ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices.\"]"
"WARN [RestClient] request [POST <http://gott-datahub-deps:9200/dataprocessinstanceindex_v2_1706378590787/_count?ignore_throttled=true&amp;ignore_unavailable=false&amp;expand_wildcards=open&amp;allow_no_indices=true>] returned 1 warnings: [299 Elasticsearch-7.17.6-f65e9d338dc1d07b642e14a27f338990148ee5b6 \"[ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices.\"]"
"WARN [RestClient] request [POST <http://gott-datahub-deps:9200/dataprocessinstanceindex_v2/_count?ignore_throttled=true&amp;ignore_unavailable=false&amp;expand_wildcards=open&amp;allow_no_indices=true>] returned 1 warnings: [299 Elasticsearch-7.17.6-f65e9d338dc1d07b642e14a27f338990148ee5b6 \"[ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices.\"]"
"WARN [RestClient] request [POST <http://gott-datahub-deps:9200/dataprocessinstanceindex_v2_1706378590787/_count?ignore_throttled=true&amp;ignore_unavailable=false&amp;expand_wildcards=open&amp;allow_no_indices=true>] returned 1 warnings: [299 Elasticsearch-7.17.6-f65e9d338dc1d07b642e14a27f338990148ee5b6 \"[ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices.\"]"
"WARN [RestClient] request [POST <http://gott-datahub-deps:9200/dataprocessinstanceindex_v2/_count?ignore_throttled=true&amp;ignore_unavailable=false&amp;expand_wildcards=open&amp;allow_no_indices=true>] returned 1 warnings: [299 Elasticsearch-7.17.6-f65e9d338dc1d07b642e14a27f338990148ee5b6 \"[ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices.\"]"
"2024-01-28 02:02:51,341 [main] INFO  c.l.m.s.e.i.ESIndexBuilder:307 - Task: RQTILb-VQ_OIK4SK5SE89g:1686 - Reindexing from dataprocessinstanceindex_v2 to dataprocessinstanceindex_v2_1706378590787 in progress..."
"2024-01-28 02:01:51,340 [main] WARN  c.l.m.s.e.i.ESIndexBuilder:321 - Task: RQTILb-VQ_OIK4SK5SE89g:1686 - Document counts do not match 92367364 != 28196785. Complete: 30.526785%"
"WARN [RestClient] request [POST <http://gott-datahub-deps:9200/dataprocessinstanceindex_v2_1706378590787/_count?ignore_throttled=true&amp;ignore_unavailable=false&amp;expand_wildcards=open&amp;allow_no_indices=true>] returned 1 warnings: [299 Elasticsearch-7.17.6-f65e9d338dc1d07b642e14a27f338990148ee5b6 \"[ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices.\"]"
"WARN [RestClient] request [POST <http://gott-datahub-deps:9200/dataprocessinstanceindex_v2/_count?ignore_throttled=true&amp;ignore_unavailable=false&amp;expand_wildcards=open&amp;allow_no_indices=true>] returned 1 warnings: [299 Elasticsearch-7.17.6-f65e9d338dc1d07b642e14a27f338990148ee5b6 \"[ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices.\"]"
"WARN [RestClient] request [POST <http://gott-datahub-deps:9200/datahubpolicyindex_v2/_search?typed_keys=true&amp;max_concurrent_shard_requests=5&amp;ignore_unavailable=false&amp;expand_wildcards=open&amp;allow_no_indices=true&amp;ignore_throttled=true&amp;search_type=query_then_fetch&amp;batched_reduce_size=512&amp;ccs_minimize_roundtrips=true>] returned 1 warnings: [299 Elasticsearch-7.17.6-f65e9d338dc1d07b642e14a27f338990148ee5b6 \"[ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices.\"]"```
Can some more familiar with re-index process give any idea of what might be happening?

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Which DataHub version are you using? (e.g. 0.12.0)
  2. Please post any relevant error logs on the thread!

DataHub is deployed in K8s using custom manifests, if that helps

the other environments where we successfully upgraded, have a much less data. The index had a max of around 12 million entities in the other environments

The error message is fairly clear:

The document count is off when reindexing. It’s unclear though why it’s so far off. We do include an option to skip this validation, but it may result in data loss if it’s truly not copying over all data so you might want to take a back-up before going forward with that (option is: https://github.com/acryldata/datahub-helm/blob/master/charts/datahub/values.yaml#L433). The cloning occurs on the ES side and it should be copying all documents, but perhaps there’s something going wrong there.

Hi, thank you for your reply.
Looking at the logs, it checks the document count, and if they are not equal, it will send more data, check the document count, send more data, check document count, … It will do this until the document count is the same.

What is not clear to me is why the process simply fails midway. And from our multiple tries, it failed at different progress count.
What am I missing?

I was looking again at <datahub/metadata-io/src/main/java/com/linkedin/metadata/search/elasticsearch/indexbuilder/ESIndexBuilder.java at ff0818bad722df4b618e106b523f78b7ef9242e0 · datahub-project/datahub · GitHub code> for the upgrade job and only now I noticed there is timeout for the re-index process of 8 hours.
If I am reading the code correctly, when this timeout is reached, it doesn’t mention this as the cause anywhere in the logs.
Could this be the reason why it fails midway?

I think it might be helpful to reiterate what <@UV5UEC3LN>:

> The cloning occurs on the ES side and it should be copying all documents, but perhaps there’s something going wrong there
Datahub uses Elasticsearch’s <https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html|Reindex API> to copy documents from a source to destination index. If the document count at the end is different, this is not something that Datahub has visibility into why.

You might want to look at Elasticsearch/other logs to determine:

  1. If Reindex is taking over 8 hours
  2. Why the document count is different