Setting up Dedicated Elasticsearch Nodes with Datahub Helm Chart

user-5 · March 4, 2024, 3:14pm

Hello all, is there a way to deploy dedicated nodes types with Datahub helm chart for elasticsearch? E.G 3 master nodes ,and 3 data nodes. There are a lot of shards created by Datahub which makes ES queries unresponsive since 3 nodes share both master, data and ingest roles. Or do you recommend us to set up our own cluster? Thanks <@U01GCJKA8P9> <@UV14447EU>

datahub_team · March 4, 2024, 3:14pm

<@U03MF8MU5P0> Would love your help here!

user-6 · March 4, 2024, 3:14pm

The Elasticsearch helm chart is not maintained by DataHub, we simply point to the http://elastic.co|elastic.co chart https://github.com/acryldata/datahub-helm/blob/master/charts/prerequisites/Chart.yaml#L9|here. Any configuration supported by their upstream chart will work. The configuration of ES depends on your scale and reliability requirements. Sharding can be controlled using the datahub helm chart configuration https://github.com/acryldata/datahub-helm/blob/master/charts/datahub/values.yaml#L336|here. I’d also consider the cpu and memory resources, you may simply need additional resources instead of a more complex ES configuration.

user-3 · March 4, 2024, 3:14pm

so you mean it would be OK if we dont use the elasticsearch values provided by datahub-helm, instead we spin up our own ES cluster using bitnami charts with 3 master and 6 data nodes say, then also the datahub setup would work ?

user-3 · March 4, 2024, 3:14pm

the ES chart that we see in the datahub-helm repo looks very basic setup from the one bitnami provided. We want to use the bitnami one with more configs as to number of master,data nodes.

user-6 · March 4, 2024, 3:14pm

You can definitely use your own ES cluster, the helm datastores are examples to get someone started.

user-5 · March 4, 2024, 3:14pm

<@U03MF8MU5P0> It seems like some components like Datahub GMS have hard coded values for IP resovle like: hostAliases:
- ip: "192.168.0.104"
hostnames:
- "broker"
- "mysql"
- "elasticsearch"
- "neo4j"

user-5 · March 4, 2024, 3:14pm

Is there anywhere in doc I can read more about how to connect to own cluster?

datahub_team · March 4, 2024, 3:14pm

How are you running DataHub? The helm chart has values https://github.com/acryldata/datahub-helm/blob/master/charts/datahub/values.yaml#L317|here to configure the ES connection. If you are using docker-compose then there are environment variables to set https://github.com/datahub-project/datahub/blob/master/docker/datahub-gms/env/docker-without-neo4j.env#L10|here.

user-5 · March 4, 2024, 3:14pm

We are running in kubernetes with helm but now a seperate cluster with ES since the are a lot of shards needed for datahub ingestion. The sample set up cannot handle that since all the nodes are master, data node , ingest and coordinator at once. But thanks will check!

user-5 · March 4, 2024, 3:14pm

HI <@U03MF8MU5P0> we have removed the hardcoded settings from https://github.com/acryldata/datahub-helm/blob/935171e26592497818d2b329c886c3a5827ee597/charts/datahub/values.yaml#L627|values.yml and from https://github.com/acryldata/datahub-helm/blob/935171e26592497818d2b329c886c3a5827ee597/charts/datahub/subcharts/datahub-gms/values.yaml#L245|values.yml but when we run 2023/09/18 07:17:49 Waiting for: <tcp://cp200mysql01.ddc.nework.net:3306>
2023/09/18 07:17:49 Waiting for: <tcp://prerequisites-kafka:9092>
2023/09/18 07:17:49 Connected to <tcp://prerequisites-kafka:9092>
2023/09/18 07:17:49 Connected to <tcp://cp200mysql01.ddc.network.net:3306>
2023/09/18 07:18:19 Problem with request: Get "<http://elasticsearch:9200>": dial tcp 192.168.0.104:9200: i/o timeout. Sleeping 1s it still tries to resolve to “192.168.0.104” . For now we have hard coded the correct IP like global:
graph_service_impl: elasticsearch

elasticsearch:
host: "elasticsearch"

hostAliases:
- ip: "172.16.4.143"
hostnames:
- "elasticsearch" Is there a better way to do it? Thanks

datahub_team · March 4, 2024, 3:14pm

There shouldn’t be any need to modify the hostAliases and you can drop the elasticsearch host name https://github.com/acryldata/datahub-helm/blob/935171e26592497818d2b329c886c3a5827ee597/charts/datahub/values.yaml#L318|here along with the port and other settings.

datahub_team · March 4, 2024, 3:14pm

Your host of elasticsearch is likely not the value you want there. I would point it to the k8 service’s hostname, depending on what the bitnami service name is. Is the service name for bitnami elasticsearch?

user-6 · March 4, 2024, 3:14pm

Use a service name that is not covered by the hostAliases, or set the hostAliases without elasticsearch if the bitnami service is also called elasticsearch

user-5 · March 4, 2024, 3:14pm

<@U03MF8MU5P0> thanks, will try!

user-5 · March 4, 2024, 3:14pm

It works now!

user-2 · March 4, 2024, 3:14pm

We are seeing 999 of 1000 shards in use on our OpenSearch cluster. We’ve never modified those settings either. Are other people setting explicit values based on the AWS recommendations for sizing OpenSearch?

user-6 · March 4, 2024, 3:14pm

DataHub definitely doesn’t require that many shards but it would depend on how you’re configuring the default sharding. Out of the box DataHub uses 1 shard per index with ~60 indices.

user-2 · March 4, 2024, 3:14pm

Cool, also I see this in the values file for the /prerequisites/charts/elasticsearch/values.yaml

I’m guessing this is where we would make our changes to specify our OpenSearch Endpoint, ReplicaCount, and max number of shards, etc…?

lifecycle:
{}

preStop:

exec:

command: [“/bin/sh”, “-c”, “echo Hello from the postStart handler > /usr/share/message”]

postStart:

exec:

command:

- bash

- -c

- |

#!/bin/bash

# Add a template to adjust number of shards/replicas

TEMPLATE_NAME=my_template

INDEX_PATTERN=“logstash-*”

SHARD_COUNT=8

REPLICA_COUNT=1

ES_URL=http://localhost:9200

while [[ “$(curl -s -o /dev/null -w ‘%{http_code}\n’ $ES_URL)” != “200” ]]; do sleep 1; done

curl -XPUT “$ES_URL/_template/$TEMPLATE_NAME” -H ‘Content-Type: application/json’ -d’{“index_patterns”:[‘"“$INDEX_PATTERN”"’],“settings”:{“number_of_shards”:‘$SHARD_COUNT’,“number_of_replicas”:‘$REPLICA_COUNT’}}’

Not at the time we are setting up OpenSearch.

user-6 · March 4, 2024, 3:14pm

In case anyone finds this later. That is an example for configuring an index template. The number of shards is by default 1 per index. Configuration can be added to increase the shard count as needed per https://github.com/acryldata/datahub-helm/blob/master/charts/datahub/values.yaml#L349|index. As of today and depending on the # of custom entities there are just under 60 indices. There is by default 1 shard and 1 replica for a minimum instance.

Topic		Replies	Views
Troubleshooting Shard Limit Exceeded in ES Helm Chart all-things-deployment	9	46	March 4, 2024
Challenges with Datahub Installation and Elasticsearch Component Management in Helm Deployment getting-started	2	66	March 4, 2024
Adjusting Number of Primary Shards for ElasticSearch Index in Datahub-Helm all-things-deployment	16	110	March 4, 2024
Connecting DataHub to Multiple OpenSearch Nodes: Configuration Guide ingestion	2	33	October 28, 2024
Setting Up OpenSearch Nodes with DataHub Version 2.11.0 ingestion	2	40	October 28, 2024

Setting up Dedicated Elasticsearch Nodes with Datahub Helm Chart

preStop:

exec:

command: [“/bin/sh”, “-c”, “echo Hello from the postStart handler > /usr/share/message”]

postStart:

exec:

command:

- bash

- -c

- |

#!/bin/bash

# Add a template to adjust number of shards/replicas

TEMPLATE_NAME=my_template

INDEX_PATTERN=“logstash-*”

SHARD_COUNT=8

REPLICA_COUNT=1

ES_URL=http://localhost:9200

while [[ “$(curl -s -o /dev/null -w ‘%{http_code}\n’ $ES_URL)” != “200” ]]; do sleep 1; done

curl -XPUT “$ES_URL/_template/$TEMPLATE_NAME” -H ‘Content-Type: application/json’ -d’{“index_patterns”:[‘"“$INDEX_PATTERN”"’],“settings”:{“number_of_shards”:‘$SHARD_COUNT’,“number_of_replicas”:‘$REPLICA_COUNT’}}’

Related topics