Troubleshooting Elasticsearch Datastream Ingestion for Datahub

Original Slack Thread

Hi, I’m trying to add Elasticsearch as an ingestion source, but it seems can only fetch indexes but can’t see any index in a datastream. Is there any chance would support Elaticsearch datastream? thx~

<@U01GZEETMEZ> Any idea on this?

Thanks <@U04QRNY4ZHA> ~
I’ve allow all index-pattern like [“^.*”], but it still can only see those indexes which doesn’t belongs to any Elasticsearch datastream.

I’m not 100% sure. Our code definitely has handling for datastreams (https://github.com/datahub-project/datahub/blob/c38bb91519e0005f6e8a0a2d648bd1c77fac97bb/metadata-ingestion/src/datahub/ingestion/source/elastic_search.py#L416|https://github.com/datahub-project/datahub/blob/c38bb91519e0005f6e8a0a2d648bd1c77f[…]tadata-ingestion/src/datahub/ingestion/source/elastic_search.py). In the ingestion report, does it show anything as “dropped”?

<@U01GZEETMEZ> thanks for your response!
I’ve set the pattern
index_pattern:
allow: ["^\\.*"]
but the filtered list shows empty too,
btw, I use datahub with the version 0.11.0.5

My datastream named logstash
I’ve also set
index_pattern:
allow: ["logstash"]

It still not show any index when I run the ingest commnad
datahub ingest -c elasticsearch.dhub.yaml

My elasticsearch version is 7.17.0.

A few hypotheses for what’s happening here

  1. a permissions issue is preventing you from seeing those indexes
  2. for some reason (possibly pagination?), the call to indices.get_alias() is simply not returning the target index https://github.com/datahub-project/datahub/blob/c38bb91519e0005f6e8a0a2d648bd1c77fac97bb/metadata-ingestion/src/datahub/ingestion/source/elastic_search.py#L368|https://github.com/datahub-project/datahub/blob/c38bb91519e0005f6e8a0a2d648bd1c77f[…]tadata-ingestion/src/datahub/ingestion/source/elastic_search.py
  3. there’s a bug in our logic that silently drops that index
    Could you run with datahub --debug ingest ... to see if we ever see that index name in the first place?

Hi <@U01GZEETMEZ>
We doesn’t set any permission on this ELK stack since this is used for internal.
After I ran tthe ingest command, I still can only saw other index excluding those indexes in datastream in the debug log.

My index config is just like following, is there anything else I forget to config?
> # Options
> url_prefix: “” # optional url_prefix
> env: “PROD”
> index_pattern:
> allow: [“^.*”]

After trying using python shell for testing,
it seems the elasticsearch python lib can’t get the index in data stream…
> root@datahub-test-by-walter:~/.datahub# python
> Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] on linux
> Type “help”, “copyright”, “credits” or “license” for more information.
> >>> from elasticsearch import Elasticsearch
> >>> client = Elasticsearch(“XXXXXXXX”, use_ssl=False)
> >>> client.indices.get_alias(“*”)

OK, seems that indices in my data stream don’t have alias, so it’s not the Datahub’s problem
After tested, I found that data stream backing indices don’t support aliases, so I tried to set the alias of my data stream named logstash to logs and update the index_pattern to allow: ["logs"] ,
I got error while ingestion

File “/usr/local/lib/python3.10/dist-packages/datahub/ingestion/source/elastic_search.py”, line 373, in get_workunits_internal
for mcp in self._extract_mcps(index, is_index=True):
File “/usr/local/lib/python3.10/dist-packages/datahub/ingestion/source/elastic_search.py”, line 414, in _extract_mcps
raw_index_metadata = raw_index[index]
KeyError: ‘logstash’
it seems that because the indices in my data stream logstash are named like .ds-logstash-2023.10.27-000014 or .ds-logstash-2023.10.27-000015
the index metadata would be:
{
.ds-logstash-2023.10.27-000014”: {…}
.ds-logstash-2023.10.27-000015”: {…}

}
so the raw_index_metadata = raw_index[index] can’t find any matching key if I set the index_pattern to allow: ["logs"] (it would try to run like raw_index_metadata = raw_index["logstash"])
https://github.com/datahub-project/datahub/blob/c38bb91519e0005f6e8a0a2d648bd1c77fac97bb/metadata-ingestion/src/datahub/ingestion/source/elastic_search.py#L414|https://github.com/datahub-project/datahub/blob/c38bb91519e0005f6e8a0a2d648bd1c77f[…]tadata-ingestion/src/datahub/ingestion/source/elastic_search.py

Any idea I can try to let the datahub cli fetch those indices in data stream~?

Many appreciate for your help ><

Honestly I think I’m a bit out of my depth given my understanding ElasticSearch - if you can figure out a tweak to DataHub’s usage of the ElasticSearch sdks to make it work, we’d definitely want to collaborate around that

It’s my pleasure! I would take time to figure out how to improve the skd.