Troubleshooting ElasticsearchSetupJob on AWS OpenSearch and Terraforming Policy Creation

Original Slack Thread

Hi Team - when I am trying to run elasticsearchSetupJob on our k8s environment, I get the following error

going to use default elastic headers
not using any prefix

 datahub_analytics_enabled: true

>>> GET _opendistro/_ism/policies/datahub_usage_event_policy response code is 400
>>> failed to GET _opendistro/_ism/policies/datahub_usage_event_policy ! -> exiting
2023/09/14 18:13:15 Command exited with error: exit status 1```
Our ES installation is on AWS Open Search and here is my config
```  elasticsearchSetupJob:
    enabled: true
    image:
      repository: image
    resources:
      limits:
        cpu: 500m
        memory: 512Mi
      requests:
        cpu: 300m
        memory: 256Mi
    extraInitContainers: []
    podSecurityContext:
      fsGroup: 1000
    securityContext:
      runAsUser: 1000
    extraEnvs:
      - name: USE_AWS_ELASTICSEARCH
        value: "true"
      - name: OPENSEARCH_USE_AWS_IAM_AUTH
        value: "true"```
and the host configs are as follows.
```        elasticsearch:
          host: "hostname"
          port: "443"
          skipcheck: "false"
          insecure: "false"
          useSSL: "true"
          region: "us-west-2"```
 Do I have to add anything else to my configs to bypass this error?

<@U02TYQ4SPPD> can you help with this?

Looks like a permissions issue most likely since it is executing a GET there isn’t much about the request that can be invalid. The elasticsearch-setup job is expected to be able to create the policy for usage events which power the product’s analytics page. The elasticsearch-setup job requires higher permissions then the rest of the containers and doesn’t support IAM authentication. That flag OPENSEARCH_USE_AWS_IAM_AUTH is not applicable for the setup job. The most typical scenario in this case is users are controlling their ES instance outside of the setup job based on their own ES/OS management process. They typically disable the elasticsearch-setup job and perform the setup actions there manually or via their automation tools. Assuming you’ve created the elasticsearch user already then its just a matter of setting up the 3 resources in this section https://github.com/datahub-project/datahub/blob/master/docker/elasticsearch-setup/create-indices.sh#L113|here. This is only executed once and then the normal user is typically granted enough permissions to manage the other application indices with the prefix policy.

Links to the policies used by the script are https://github.com/datahub-project/datahub/blob/99d7eb756c09a3313a4c1bda6f96a0953004b58c/metadata-service/restli-servlet-impl/src/main/resources/index/usage-event/aws_es_ism_policy.json#L4|here and https://github.com/datahub-project/datahub/blob/99d7eb756c09a3313a4c1bda6f96a0953004b58c/metadata-service/restli-servlet-impl/src/main/resources/index/usage-event/aws_es_index_template.json#L4|here

Those are templates so be sure to replace PREFIX

Alternatively, you can disable the analytics tracking entirely and not have the analytics metrics populated on the dashboard in product with DATAHUB_ANALYTICS_ENABLED=false

Got it. Thanks <@U02TYQ4SPPD>. This is super helpful

I was referring to this https://datahubproject.io/docs/deploy/aws/#elasticsearch-service|doc and added the OPENSEARCH_USE_AWS_IAM_AUTH property to my yaml.

let me try setting the ism policy and datahub_usage_event_index_template and see if it works

may i know the use of this index template and ism policy ?

Also, what is the use of PREFIX ?

The prefix is by default an empty string “”, however if you are managing several datahub instances on the same OpenSearch cluster you can separate them by adding a prefix string. For example, instance_a and instance_b the indices would then be separated for each datahub instance similar to instance_a_index1 and instance_b_index1

got it.

@David Leifker while disabling the DATAHUB_ANALYTICS_ENABLED=false helped in running the elasticSearchSetupJob, the next step datahub-datahub-system-update-job failed with the following error in the BuildIndicesPreStep stage

org.elasticsearch.ElasticsearchStatusException: method [HEAD], host [<http://hostname:80>], URI [/graph_service_v1?ignore_throttled=false&amp;ignore_unavailable=false&amp;expand_wildcards=open%2Cclosed&amp;allow_no_indices=false], status line [HTTP/1.1 400 Bad Request]
Any pointers?

I invoked the URL manually from our env and see the query params are not getting recognized
contains unrecognized parameter: [ignore_unavailable]"},"status":400}
What version of ES is recommended for datahub v0.10.5?

we use AWS open search service 6.5

Upon a deeper look into the logs, we see the following error in the systemUpdate job. prior to the BuildIndicesPreStep stage. Do you think this might be the error for the BuildIndicesPreStep stage to fail?
2023-09-19 13:46:39,377 [kafka-producer-network-thread | producer-2] ERROR o.a.kafka.common.utils.KafkaThread:51 - Uncaught exception in thread 'kafka-producer-network-thread | producer-2':java.lang.OutOfMemoryError: Java heap space at java.base/java.nio.HeapByteBuffer.&lt;init&gt;(HeapByteBuffer.java:61) at java.base/java.nio.ByteBuffer.allocate(ByteBuffer.java:348) at org.apache.kafka.common.memory.MemoryPool$1.tryAllocate(MemoryPool.java:30) at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:113) at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:447) at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:397) at org.apache.kafka.common.network.Selector.attemptRead(Selector.java:678) at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:580) at org.apache.kafka.common.network.Selector.poll(Selector.java:485) at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:550) at org.apache.kafka.clients.producer.internals.Sender.runOnce(Sender.java:324) at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:239) at java.base/java.lang.Thread.run(Thread.java:829)

<@UV5UEC3LN> This is what I was talking about in our office hours.

> we use AWS open search service 6.5
Does this mean you are running in legacy mode for ElasticSearch 6.5? Or was it a typo meaning you’re using OpenSearch 2.5? Both are unsupported versions though, we support Opensearch 1.x and ElasticSearch 7.10 -> less than 8 (i.e. 7.17 or others should also be fine)

This is what we see in our AWS open search UIattachment