Adjusting Number of Primary Shards for ElasticSearch Index in Datahub-Helm

Original Slack Thread

Hello DataHub community,
How do we set the number of primary shards for the elastic search index datahub_usage_event-XXXXXX to 1 from the default value of 5 using datahub-helm ?

by defualt number of shards 1 per index and 1 replica.
if want changes this for already deployed datahub
resharding must be enabled by settung
ELASTICSEARCH_INDEX_BUILDER_SETTINGS_REINDEX=true
and the specific shard count overridden for the index
ELASTICSEARCH_INDEX_BUILDER_ENTITY_SETTINGS_OVERRIDES='{"datasetindex_v2":{"number_of_shards":"10"}}').
These configs in value.yml
https://github.com/acryldata/datahub-helm/blob/ecb168d9255435476311c6d6808bf214f587e72c/charts/datahub/values.yaml#L412
enable datahubSystemUpdate job while making this config chnages. any changes in the chart value for ES shards the datahubSystemUpdate will trigger the reindex based on the diff between the expected and actual shard settings

Thanks <@U0445MUD81W>. I have tried '{"datahub_usage_event":{"number_of_shards":"1"},"system_metadata_service_v1":{"number_of_shards":"5"}}' config for datahub_usage_event_XXXXX. The shards are still 5. This index is created frequently with a number as a postfix.

is it solved your issue …?
you tried with "number_of_shards":"5" or "number_of_shards":"1" ?

It did not solve my issue. datahub_usage_event_xxxx index gets created with 5 shards even after adding the config of "number_of_shards":"1"

Maybe what you are seeing are rollovers? The default number of shards for datahub_usage_event is 1.

https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-rollover-index.html#increment-index-names-for-alias

It might be helpful to explain here what is the problem that you are trying to solve.

Config is
settingsOverrides: '{"datahub_usage_event": {"number_of_shards":"1", "number_of_replicas":"1"}}'

dices | grep datahub_usage_event-
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  7788  100  7788    0     0   124k      0 --:--:-- --:--:-- --:--:--  126k
green open datahub_usage_event-000004                               H-xxxxxx 5 1   0  0     2kb     1kb
green open datahub_usage_event-000003                               f0JLy-xx 5 1   0  0     2kb     1kb
green open datahub_usage_event-000002                               4cPlC1-x 5 1   0  0     2kb     1kb
green open datahub_usage_event-000001                               xxxxxxxx 5 1   0  0     2kb     1kb```

I think the default for some reason for datahub_usage_event is 5

The default is 1

https://github.com/datahub-project/datahub/blob/4b87156fde8e428bddd6701501351a53578df2d7/docker/elasticsearch-setup/create-indices.sh#L8

Can you /_cat/indices?h=index,pri

dices?h=index,pri
chart_chartusagestatisticsaspect_v1                      1
datajobindex_v2                                          1
dataflowindex_v2                                         1
mlmodelgroupindex_v2                                     1
assertionindex_v2                                        1
roleindex_v2                                             1
dataprocessindex_v2                                      1
.opendistro-reports-definitions                          1
globalsettingsindex_v2                                   1
.opendistro_security                                     1
.opendistro-reports-instances                            1
chartindex_v2                                            1
tagindex_v2                                              1
.opensearch-observability                                1
dataplatforminstanceindex_v2                             1
telemetryindex_v2                                        1
datajob_datahubingestionrunsummaryaspect_v1              1
dataplatformindex_v2                                     1
dataproductindex_v2                                      1
dataprocessinstanceindex_v2                              1
invitetokenindex_v2                                      1
graph_service_v1                                         1
system_metadata_service_v1                               1
dataset_operationaspect_v1                               1
containerindex_v2                                        1
.tasks                                                   1
schemafieldindex_v2                                      1
domainindex_v2                                           1
notebookindex_v2                                         1
datahubupgradeindex_v2                                   1
datahubroleindex_v2                                      1
glossarytermindex_v2                                     1
postindex_v2                                             1
dataset_datasetusagestatisticsaspect_v1                  1
datahubexecutionrequestindex_v2                          1
dataset_datasetprofileaspect_v1                          1
datahubsecretindex_v2                                    1
mlmodelindex_v2                                          1
datahubpolicyindex_v2                                    1
corpuserindex_v2                                         1
datahubstepstateindex_v2                                 1
datahubviewindex_v2                                      1
queryindex_v2                                            1
mlmodeldeploymentindex_v2                                1
datajob_datahubingestioncheckpointaspect_v1              1
dashboardindex_v2                                        1
assertion_assertionruneventaspect_v1                     1
datasetindex_v2                                          1
mlfeatureindex_v2                                        1
dashboard_dashboardusagestatisticsaspect_v1              1
datahub_usage_event-000004                               5
datahub_usage_event-000003                               5
datahub_usage_event-000005                               5
glossarynodeindex_v2                                     1
datahubingestionsourceindex_v2                           1
datahubretentionindex_v2                                 1
ownershiptypeindex_v2                                    1
dataprocessinstance_dataprocessinstanceruneventaspect_v1 1
.kibana_1                                                1
datahub_usage_event-000002                               5
datahubaccesstokenindex_v2                               1
datahub_usage_event-000001                               5
testindex_v2                                             1
mlfeaturetableindex_v2                                   1
.opendistro-job-scheduler-lock                           5
mlprimarykeyindex_v2                                     1
corpgroupindex_v2                                        1```

So looks like because the number of shards wasn’t specified when this index was created, it defaults to 5. I’m guessing you were on Elasticsearch 6.x or lower.

> Previous versions of Elasticsearch defaulted to creating five shards per index. Starting with 7.0.0, the default is now one shard per index.
https://www.elastic.co/guide/en/elasticsearch/reference/7.0/breaking-changes-7.0.html#_index_creation_no_longer_defaults_to_five_shards

On the Datahub side, doesn’t look like there’s an ES Index Rebuilder for datahub_usage_event.

Your options here are either live with 5 shards, or manually re-create the index with 1 shard. There’s currently no automation in Datahub to recreate/rebuild it.

I will manually recreate it. Thanks Davi