Setting up Dedicated Elasticsearch Nodes with Datahub Helm Chart

Original Slack Thread

Hello all, is there a way to deploy dedicated nodes types with Datahub helm chart for elasticsearch? E.G 3 master nodes ,and 3 data nodes. There are a lot of shards created by Datahub which makes ES queries unresponsive since 3 nodes share both master, data and ingest roles. Or do you recommend us to set up our own cluster? Thanks <@U01GCJKA8P9> <@UV14447EU>

<@U03MF8MU5P0> Would love your help here!

The Elasticsearch helm chart is not maintained by DataHub, we simply point to the http://elastic.co|elastic.co chart https://github.com/acryldata/datahub-helm/blob/master/charts/prerequisites/Chart.yaml#L9|here. Any configuration supported by their upstream chart will work. The configuration of ES depends on your scale and reliability requirements. Sharding can be controlled using the datahub helm chart configuration https://github.com/acryldata/datahub-helm/blob/master/charts/datahub/values.yaml#L336|here. I’d also consider the cpu and memory resources, you may simply need additional resources instead of a more complex ES configuration.

so you mean it would be OK if we dont use the elasticsearch values provided by datahub-helm, instead we spin up our own ES cluster using bitnami charts with 3 master and 6 data nodes say, then also the datahub setup would work ?

the ES chart that we see in the datahub-helm repo looks very basic setup from the one bitnami provided. We want to use the bitnami one with more configs as to number of master,data nodes.

You can definitely use your own ES cluster, the helm datastores are examples to get someone started.

<@U03MF8MU5P0> It seems like some components like Datahub GMS have hard coded values for IP resovle like: hostAliases:
- ip: "192.168.0.104"
hostnames:
- "broker"
- "mysql"
- "elasticsearch"
- "neo4j"

Is there anywhere in doc I can read more about how to connect to own cluster?

How are you running DataHub? The helm chart has values https://github.com/acryldata/datahub-helm/blob/master/charts/datahub/values.yaml#L317|here to configure the ES connection. If you are using docker-compose then there are environment variables to set https://github.com/datahub-project/datahub/blob/master/docker/datahub-gms/env/docker-without-neo4j.env#L10|here.

We are running in kubernetes with helm but now a seperate cluster with ES since the are a lot of shards needed for datahub ingestion. The sample set up cannot handle that since all the nodes are master, data node , ingest and coordinator at once. But thanks will check!

HI <@U03MF8MU5P0> we have removed the hardcoded settings from https://github.com/acryldata/datahub-helm/blob/935171e26592497818d2b329c886c3a5827ee597/charts/datahub/values.yaml#L627|values.yml and from https://github.com/acryldata/datahub-helm/blob/935171e26592497818d2b329c886c3a5827ee597/charts/datahub/subcharts/datahub-gms/values.yaml#L245|values.yml but when we run 2023/09/18 07:17:49 Waiting for: <tcp://cp200mysql01.ddc.nework.net:3306>
2023/09/18 07:17:49 Waiting for: <tcp://prerequisites-kafka:9092>
2023/09/18 07:17:49 Connected to <tcp://prerequisites-kafka:9092>
2023/09/18 07:17:49 Connected to <tcp://cp200mysql01.ddc.network.net:3306>
2023/09/18 07:18:19 Problem with request: Get "<http://elasticsearch:9200>": dial tcp 192.168.0.104:9200: i/o timeout. Sleeping 1s it still tries to resolve to “192.168.0.104” . For now we have hard coded the correct IP like global:
graph_service_impl: elasticsearch

elasticsearch:
host: "elasticsearch"

hostAliases:
- ip: "172.16.4.143"
hostnames:
- "elasticsearch" Is there a better way to do it? Thanks

There shouldn’t be any need to modify the hostAliases and you can drop the elasticsearch host name https://github.com/acryldata/datahub-helm/blob/935171e26592497818d2b329c886c3a5827ee597/charts/datahub/values.yaml#L318|here along with the port and other settings.

Your host of elasticsearch is likely not the value you want there. I would point it to the k8 service’s hostname, depending on what the bitnami service name is. Is the service name for bitnami elasticsearch?

Use a service name that is not covered by the hostAliases, or set the hostAliases without elasticsearch if the bitnami service is also called elasticsearch

<@U03MF8MU5P0> thanks, will try!

It works now!

We are seeing 999 of 1000 shards in use on our OpenSearch cluster. We’ve never modified those settings either. Are other people setting explicit values based on the AWS recommendations for sizing OpenSearch?

DataHub definitely doesn’t require that many shards but it would depend on how you’re configuring the default sharding. Out of the box DataHub uses 1 shard per index with ~60 indices.

Cool, also I see this in the values file for the /prerequisites/charts/elasticsearch/values.yaml

I’m guessing this is where we would make our changes to specify our OpenSearch Endpoint, ReplicaCount, and max number of shards, etc…?

lifecycle:
{}

preStop:

exec:

command: [“/bin/sh”, “-c”, “echo Hello from the postStart handler > /usr/share/message”]

postStart:

exec:

command:

- bash

- -c

- |

#!/bin/bash

# Add a template to adjust number of shards/replicas

TEMPLATE_NAME=my_template

INDEX_PATTERN=“logstash-*”

SHARD_COUNT=8

REPLICA_COUNT=1

ES_URL=http://localhost:9200

while [[ “$(curl -s -o /dev/null -w ‘%{http_code}\n’ $ES_URL)” != “200” ]]; do sleep 1; done

curl -XPUT “$ES_URL/_template/$TEMPLATE_NAME” -H ‘Content-Type: application/json’ -d’{“index_patterns”:[‘"“$INDEX_PATTERN”"’],“settings”:{“number_of_shards”:‘$SHARD_COUNT’,“number_of_replicas”:‘$REPLICA_COUNT’}}’

Not at the time we are setting up OpenSearch.

In case anyone finds this later. That is an example for configuring an index template. The number of shards is by default 1 per index. Configuration can be added to increase the shard count as needed per https://github.com/acryldata/datahub-helm/blob/master/charts/datahub/values.yaml#L349|index. As of today and depending on the # of custom entities there are just under 60 indices. There is by default 1 shard and 1 replica for a minimum instance.