Troubleshooting DataHub Health After Upgrading to 0.12.0 using ArgoCD and Facing Elasticsearch Issues

user-1 · March 4, 2024, 4:34pm

Hi all,
I’m new to DataHub and spent the last week stuck on a problem that started after upgrading to 0.12.0 using ArgoCD. I’m using all defaults for the config other than we had to set replicas to 2 to stop the ES cluster failing to reach quorum. At the moment, datahub-gms is unhealthy and I can’t figure out why. I’ll post some logs in a thread under this. Many thanks in advance!

datahub_team · March 4, 2024, 4:34pm

Hey there! Make sure your message includes the following information if relevant, so we can help more effectively!

Which DataHub version are you using? (e.g. 0.12.0)
Please post any relevant error logs on the thread!

user-1 · March 4, 2024, 4:34pm

Warnings: [Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See <https://www.elastic.co/guide/en/elasticsearch/reference/7.17/security-minimal-setup.html> to enable security., [ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices.]
{"error":{"root_cause":[{"type":"query_shard_exception","reason":"[simple_query_string] analyzer [query_word_delimited] not found","index_uuid":"O6R7NsmaQjCFNZSUGrmkTg","index":"datahubpolicyindex_v2"}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"datahubpolicyindex_v2","node":"-8stBPQaRGWJ5ZkBIwW4yA","reason":{"type":"query_shard_exception","reason":"[simple_query_string] analyzer [query_word_delimited] not found","index_uuid":"O6R7NsmaQjCFNZSUGrmkTg","index":"datahubpolicyindex_v2"}}]},"status":400}
		at org.opensearch.client.RestClient.convertResponse(RestClient.java:375)
		at org.opensearch.client.RestClient.performRequest(RestClient.java:345)
		at org.opensearch.client.RestClient.performRequest(RestClient.java:320)
		at org.opensearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1918)
		... 17 common frames omitted
2023-11-15 21:58:20,443 [pool-8-thread-1] ERROR c.d.authorization.DataHubAuthorizer:252 - Failed to retrieve policy urns! Skipping updating policy cache until next refresh. start: 0, count: 30
com.datahub.util.exception.ESQueryException: Search query failed:
	at com.linkedin.metadata.search.elasticsearch.query.ESSearchDAO.executeAndExtract(ESSearchDAO.java:106)
	at com.linkedin.metadata.search.elasticsearch.query.ESSearchDAO.search(ESSearchDAO.java:203)
	at com.linkedin.metadata.search.elasticsearch.ElasticSearchService.search(ElasticSearchService.java:121)
	at com.linkedin.metadata.search.elasticsearch.ElasticSearchService.search(ElasticSearchService.java:112)
	at com.linkedin.metadata.client.JavaEntityClient.search(JavaEntityClient.java:336)
	at com.datahub.authorization.PolicyFetcher.fetchPolicies(PolicyFetcher.java:51)
	at com.datahub.authorization.PolicyFetcher.fetchPolicies(PolicyFetcher.java:43)
	at com.datahub.authorization.DataHubAuthorizer$PolicyRefreshRunnable.run(DataHubAuthorizer.java:245)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.opensearch.OpenSearchStatusException: OpenSearch exception [type=search_phase_execution_exception, reason=all shards failed]
	at org.opensearch.rest.BytesRestResponse.errorFromXContent(BytesRestResponse.java:209)
	at org.opensearch.client.RestHighLevelClient.parseEntity(RestHighLevelClient.java:2235)
	at org.opensearch.client.RestHighLevelClient.parseResponseException(RestHighLevelClient.java:2212)
	at org.opensearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1931)
	at org.opensearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1884)
	at org.opensearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1852)
	at org.opensearch.client.RestHighLevelClient.search(RestHighLevelClient.java:1095)
	at com.linkedin.metadata.search.elasticsearch.query.ESSearchDAO.executeAndExtract(ESSearchDAO.java:99)
	... 13 common frames omitted
	Suppressed: org.opensearch.client.ResponseException: method [POST], host [<http://elasticsearch-master:9200>], URI [/datahubpolicyindex_v2/_search?typed_keys=true&amp;max_concurrent_shard_requests=5&amp;ignore_unavailable=false&amp;expand_wildcards=open&amp;allow_no_indices=true&amp;ignore_throttled=true&amp;search_type=query_then_fetch&amp;batched_reduce_size=512&amp;ccs_minimize_roundtrips=true], status line [HTTP/1.1 400 Bad Request]
Warnings: [Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See <https://www.elastic.co/guide/en/elasticsearch/reference/7.17/security-minimal-setup.html> to enable security., [ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices.]
{"error":{"root_cause":[{"type":"query_shard_exception","reason":"[simple_query_string] analyzer [query_word_delimited] not found","index_uuid":"O6R7NsmaQjCFNZSUGrmkTg","index":"datahubpolicyindex_v2"}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"datahubpolicyindex_v2","node":"-8stBPQaRGWJ5ZkBIwW4yA","reason":{"type":"query_shard_exception","reason":"[simple_query_string] analyzer [query_word_delimited] not found","index_uuid":"O6R7NsmaQjCFNZSUGrmkTg","index":"datahubpolicyindex_v2"}}]},"status":400}
		at org.opensearch.client.RestClient.convertResponse(RestClient.java:375)
		at org.opensearch.client.RestClient.performRequest(RestClient.java:345)
		at org.opensearch.client.RestClient.performRequest(RestClient.java:320)
		at org.opensearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1918)
		... 17 common frames omitted
2023-11-15 21:58:56,101 [R2 Nio Event Loop-1-1] WARN  c.l.r.t.h.c.c.ChannelPoolLifecycle:139 - Failed to create channel, remote=localhost/127.0.0.1:8080
io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: localhost/127.0.0.1:8080
Caused by: java.net.ConnectException: Connection refused
	at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)
	at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:337)
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at java.base/java.lang.Thread.run(Thread.java:829)```

user-2 · March 4, 2024, 4:34pm

Did the datahub system update job succeed for you when you synced in argo? Failed to create channel... is the log line that stands out to me, and if you search it in the slack it’s mostly issues relating to the system upgrade not succeeding.

user-1 · March 4, 2024, 4:34pm

Thanks for your assistance <@U05JJ9WESHL>. I did not know about the datahub-upgrade job https://datahubproject.io/docs/docker/datahub-upgrade

I can’t, however, see any such job at all in ArgoCD. I can see it in the helm charts though. What am I missing ?

user-2 · March 4, 2024, 4:34pm

Is it enabled in your values file? Here’s what my entry looks like in values.yaml:

  enabled: true
  image:
    repository: acryldata/datahub-upgrade
  podSecurityContext: {}
  securityContext: {}
  annotations:
    <http://helm.sh/hook|helm.sh/hook>: pre-install,pre-upgrade
    <http://helm.sh/hook-weight|helm.sh/hook-weight>: "-4"
    <http://helm.sh/hook-delete-policy|helm.sh/hook-delete-policy>: before-hook-creation
  podAnnotations: {}
  resources:
    limits:
      cpu: 500m
      memory: 512Mi
    requests:
      cpu: 300m
      memory: 256Mi
  extraSidecars: []
  extraInitContainers: []```

user-2 · March 4, 2024, 4:34pm

And <datahub-helm/charts/datahub/templates/datahub-upgrade/datahub-system-update-job.yml at 6311fce06c11ce21c5c3edabf68b71fff88c027c · acryldata/datahub-helm · GitHub template> also requires it to be enabled in globals:

...
  datahub:
    systemUpdate:
      enabled: true
...```

user-1 · March 4, 2024, 4:34pm

Thanks <@U05JJ9WESHL> I see those settings in the values file in version 0.3.10 of the chart, and then in our local values file:

          enabled: true
          image:
            repository: acryldata/datahub-upgrade
            tag: "v0.12.0"
          noCodeDataMigration:
            sqlDbType: "MYSQL"```
and no overrides in our local `global` section

user-1 · March 4, 2024, 4:34pm

I just tried syncing again and yeah there’s no system update or upgrade job seen.

user-2 · March 4, 2024, 4:34pm

Do the other jobs run? I have five: system-update, elasticsearch setup, kafka setup, nocode migration, postgres setup.

user-1 · March 4, 2024, 4:34pm

Oh hangon I’m blind. It’s called datahub-datahub-upgrade-job

user-1 · March 4, 2024, 4:34pm

It says it’s healthy

user-1 · March 4, 2024, 4:34pm

Mind you there don’t seem to be any logs for it since April

Topic		Replies	Views
Troubleshooting DataHub Upgrade to 0.12.0: Connecting GMS and Solving Job Failure troubleshoot	22	281	March 4, 2024
Troubleshooting Elasticsearch Indices Creation Issue in Datahub Deployment on AWS all-things-deployment	32	369	March 4, 2024
Troubleshooting DataHub System Update Job failures related to Elasticsearch reindexing troubleshoot	1	122	March 4, 2024
Troubleshooting Errors in Custom Datahub Installation all-things-deployment	18	95	March 4, 2024
Troubleshooting 0.12.1 Upgrade Job Stuck Running with Recurring Error troubleshoot	9	47	March 4, 2024

Troubleshooting DataHub Health After Upgrading to 0.12.0 using ArgoCD and Facing Elasticsearch Issues

Related topics