Troubleshooting DataHub Health After Upgrading to 0.12.0 using ArgoCD and Facing Elasticsearch Issues

Original Slack Thread

Hi all,
I’m new to DataHub and spent the last week stuck on a problem that started after upgrading to 0.12.0 using ArgoCD. I’m using all defaults for the config other than we had to set replicas to 2 to stop the ES cluster failing to reach quorum. At the moment, datahub-gms is unhealthy and I can’t figure out why. I’ll post some logs in a thread under this. Many thanks in advance!

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Which DataHub version are you using? (e.g. 0.12.0)

  2. Please post any relevant error logs on the thread!

Warnings: [Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See <https://www.elastic.co/guide/en/elasticsearch/reference/7.17/security-minimal-setup.html> to enable security., [ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices.]
{"error":{"root_cause":[{"type":"query_shard_exception","reason":"[simple_query_string] analyzer [query_word_delimited] not found","index_uuid":"O6R7NsmaQjCFNZSUGrmkTg","index":"datahubpolicyindex_v2"}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"datahubpolicyindex_v2","node":"-8stBPQaRGWJ5ZkBIwW4yA","reason":{"type":"query_shard_exception","reason":"[simple_query_string] analyzer [query_word_delimited] not found","index_uuid":"O6R7NsmaQjCFNZSUGrmkTg","index":"datahubpolicyindex_v2"}}]},"status":400}
		at org.opensearch.client.RestClient.convertResponse(RestClient.java:375)
		at org.opensearch.client.RestClient.performRequest(RestClient.java:345)
		at org.opensearch.client.RestClient.performRequest(RestClient.java:320)
		at org.opensearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1918)
		... 17 common frames omitted
2023-11-15 21:58:20,443 [pool-8-thread-1] ERROR c.d.authorization.DataHubAuthorizer:252 - Failed to retrieve policy urns! Skipping updating policy cache until next refresh. start: 0, count: 30
com.datahub.util.exception.ESQueryException: Search query failed:
	at com.linkedin.metadata.search.elasticsearch.query.ESSearchDAO.executeAndExtract(ESSearchDAO.java:106)
	at com.linkedin.metadata.search.elasticsearch.query.ESSearchDAO.search(ESSearchDAO.java:203)
	at com.linkedin.metadata.search.elasticsearch.ElasticSearchService.search(ElasticSearchService.java:121)
	at com.linkedin.metadata.search.elasticsearch.ElasticSearchService.search(ElasticSearchService.java:112)
	at com.linkedin.metadata.client.JavaEntityClient.search(JavaEntityClient.java:336)
	at com.datahub.authorization.PolicyFetcher.fetchPolicies(PolicyFetcher.java:51)
	at com.datahub.authorization.PolicyFetcher.fetchPolicies(PolicyFetcher.java:43)
	at com.datahub.authorization.DataHubAuthorizer$PolicyRefreshRunnable.run(DataHubAuthorizer.java:245)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.opensearch.OpenSearchStatusException: OpenSearch exception [type=search_phase_execution_exception, reason=all shards failed]
	at org.opensearch.rest.BytesRestResponse.errorFromXContent(BytesRestResponse.java:209)
	at org.opensearch.client.RestHighLevelClient.parseEntity(RestHighLevelClient.java:2235)
	at org.opensearch.client.RestHighLevelClient.parseResponseException(RestHighLevelClient.java:2212)
	at org.opensearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1931)
	at org.opensearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1884)
	at org.opensearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1852)
	at org.opensearch.client.RestHighLevelClient.search(RestHighLevelClient.java:1095)
	at com.linkedin.metadata.search.elasticsearch.query.ESSearchDAO.executeAndExtract(ESSearchDAO.java:99)
	... 13 common frames omitted
	Suppressed: org.opensearch.client.ResponseException: method [POST], host [<http://elasticsearch-master:9200>], URI [/datahubpolicyindex_v2/_search?typed_keys=true&amp;max_concurrent_shard_requests=5&amp;ignore_unavailable=false&amp;expand_wildcards=open&amp;allow_no_indices=true&amp;ignore_throttled=true&amp;search_type=query_then_fetch&amp;batched_reduce_size=512&amp;ccs_minimize_roundtrips=true], status line [HTTP/1.1 400 Bad Request]
Warnings: [Elasticsearch built-in security features are not enabled. Without authentication, your cluster could be accessible to anyone. See <https://www.elastic.co/guide/en/elasticsearch/reference/7.17/security-minimal-setup.html> to enable security., [ignore_throttled] parameter is deprecated because frozen indices have been deprecated. Consider cold or frozen tiers in place of frozen indices.]
{"error":{"root_cause":[{"type":"query_shard_exception","reason":"[simple_query_string] analyzer [query_word_delimited] not found","index_uuid":"O6R7NsmaQjCFNZSUGrmkTg","index":"datahubpolicyindex_v2"}],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[{"shard":0,"index":"datahubpolicyindex_v2","node":"-8stBPQaRGWJ5ZkBIwW4yA","reason":{"type":"query_shard_exception","reason":"[simple_query_string] analyzer [query_word_delimited] not found","index_uuid":"O6R7NsmaQjCFNZSUGrmkTg","index":"datahubpolicyindex_v2"}}]},"status":400}
		at org.opensearch.client.RestClient.convertResponse(RestClient.java:375)
		at org.opensearch.client.RestClient.performRequest(RestClient.java:345)
		at org.opensearch.client.RestClient.performRequest(RestClient.java:320)
		at org.opensearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1918)
		... 17 common frames omitted
2023-11-15 21:58:56,101 [R2 Nio Event Loop-1-1] WARN  c.l.r.t.h.c.c.ChannelPoolLifecycle:139 - Failed to create channel, remote=localhost/127.0.0.1:8080
io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: localhost/127.0.0.1:8080
Caused by: java.net.ConnectException: Connection refused
	at java.base/sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:777)
	at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:337)
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:776)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at java.base/java.lang.Thread.run(Thread.java:829)```

Did the datahub system update job succeed for you when you synced in argo? Failed to create channel... is the log line that stands out to me, and if you search it in the slack it’s mostly issues relating to the system upgrade not succeeding.

Thanks for your assistance <@U05JJ9WESHL>. I did not know about the datahub-upgrade job https://datahubproject.io/docs/docker/datahub-upgrade

I can’t, however, see any such job at all in ArgoCD. I can see it in the helm charts though. What am I missing ?

Is it enabled in your values file? Here’s what my entry looks like in values.yaml:

  enabled: true
  image:
    repository: acryldata/datahub-upgrade
  podSecurityContext: {}
  securityContext: {}
  annotations:
    <http://helm.sh/hook|helm.sh/hook>: pre-install,pre-upgrade
    <http://helm.sh/hook-weight|helm.sh/hook-weight>: "-4"
    <http://helm.sh/hook-delete-policy|helm.sh/hook-delete-policy>: before-hook-creation
  podAnnotations: {}
  resources:
    limits:
      cpu: 500m
      memory: 512Mi
    requests:
      cpu: 300m
      memory: 256Mi
  extraSidecars: []
  extraInitContainers: []```

And <datahub-helm/charts/datahub/templates/datahub-upgrade/datahub-system-update-job.yml at 6311fce06c11ce21c5c3edabf68b71fff88c027c · acryldata/datahub-helm · GitHub template> also requires it to be enabled in globals:

...
  datahub:
    systemUpdate:
      enabled: true
...```

Thanks <@U05JJ9WESHL> I see those settings in the values file in version 0.3.10 of the chart, and then in our local values file:

          enabled: true
          image:
            repository: acryldata/datahub-upgrade
            tag: "v0.12.0"
          noCodeDataMigration:
            sqlDbType: "MYSQL"```
and no overrides in our local `global` section

I just tried syncing again and yeah there’s no system update or upgrade job seen.

Do the other jobs run? I have five: system-update, elasticsearch setup, kafka setup, nocode migration, postgres setup.

Oh hangon I’m blind. It’s called datahub-datahub-upgrade-job

It says it’s healthy

Mind you there don’t seem to be any logs for it since April