Hi, I’m trying to add Elasticsearch as an ingestion source, but it seems can only fetch indexes but can’t see any index in a datastream. Is there any chance would support Elaticsearch datastream? thx~
<@U01GZEETMEZ> Any idea on this?
Thanks <@U04QRNY4ZHA> ~
I’ve allow all index-pattern like [“^.*”], but it still can only see those indexes which doesn’t belongs to any Elasticsearch datastream.
I’m not 100% sure. Our code definitely has handling for datastreams (https://github.com/datahub-project/datahub/blob/c38bb91519e0005f6e8a0a2d648bd1c77fac97bb/metadata-ingestion/src/datahub/ingestion/source/elastic_search.py#L416|https://github.com/datahub-project/datahub/blob/c38bb91519e0005f6e8a0a2d648bd1c77f[…]tadata-ingestion/src/datahub/ingestion/source/elastic_search.py). In the ingestion report, does it show anything as “dropped”?
<@U01GZEETMEZ> thanks for your response!
I’ve set the pattern
index_pattern:
allow: ["^\\.*"]
but the filtered list shows empty too,
btw, I use datahub with the version 0.11.0.5
My datastream named logstash
I’ve also set
index_pattern:
allow: ["logstash"]
It still not show any index when I run the ingest commnad
datahub ingest -c elasticsearch.dhub.yaml
My elasticsearch version is 7.17.0.
A few hypotheses for what’s happening here
- a permissions issue is preventing you from seeing those indexes
- for some reason (possibly pagination?), the call to indices.get_alias() is simply not returning the target index https://github.com/datahub-project/datahub/blob/c38bb91519e0005f6e8a0a2d648bd1c77fac97bb/metadata-ingestion/src/datahub/ingestion/source/elastic_search.py#L368|https://github.com/datahub-project/datahub/blob/c38bb91519e0005f6e8a0a2d648bd1c77f[…]tadata-ingestion/src/datahub/ingestion/source/elastic_search.py
- there’s a bug in our logic that silently drops that index
Could you run withdatahub --debug ingest ...
to see if we ever see that index name in the first place?
Hi <@U01GZEETMEZ>
We doesn’t set any permission on this ELK stack since this is used for internal.
After I ran tthe ingest command, I still can only saw other index excluding those indexes in datastream in the debug log.
My index config is just like following, is there anything else I forget to config?
> # Options
> url_prefix: “” # optional url_prefix
> env: “PROD”
> index_pattern:
> allow: [“^.*”]
After trying using python shell for testing,
it seems the elasticsearch python lib can’t get the index in data stream…
> root@datahub-test-by-walter:~/.datahub# python
> Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] on linux
> Type “help”, “copyright”, “credits” or “license” for more information.
> >>> from elasticsearch import Elasticsearch
> >>> client = Elasticsearch(“XXXXXXXX”, use_ssl=False)
> >>> client.indices.get_alias(“*”)
OK, seems that indices in my data stream don’t have alias, so it’s not the Datahub’s problem
After tested, I found that data stream backing indices don’t support aliases, so I tried to set the alias of my data stream named logstash
to logs
and update the index_pattern to allow: ["logs"]
,
I got error while ingestion
File “/usr/local/lib/python3.10/dist-packages/datahub/ingestion/source/elastic_search.py”, line 373, in get_workunits_internal
for mcp in self._extract_mcps(index, is_index=True):
File “/usr/local/lib/python3.10/dist-packages/datahub/ingestion/source/elastic_search.py”, line 414, in _extract_mcps
raw_index_metadata = raw_index[index]
KeyError: ‘logstash’
it seems that because the indices in my data streamlogstash
are named like.ds-logstash-2023.10.27-000014
or.ds-logstash-2023.10.27-000015
the index metadata would be:
{
“.ds-logstash-2023.10.27-000014
”: {…}
“.ds-logstash-2023.10.27-000015
”: {…}
…
}
so theraw_index_metadata = raw_index[index]
can’t find any matching key if I set the index_pattern toallow: ["logs"]
(it would try to run likeraw_index_metadata = raw_index["logstash"]
)
https://github.com/datahub-project/datahub/blob/c38bb91519e0005f6e8a0a2d648bd1c77fac97bb/metadata-ingestion/src/datahub/ingestion/source/elastic_search.py#L414|https://github.com/datahub-project/datahub/blob/c38bb91519e0005f6e8a0a2d648bd1c77f[…]tadata-ingestion/src/datahub/ingestion/source/elastic_search.py
Any idea I can try to let the datahub cli fetch those indices in data stream~?
Many appreciate for your help ><
Honestly I think I’m a bit out of my depth given my understanding ElasticSearch - if you can figure out a tweak to DataHub’s usage of the ElasticSearch sdks to make it work, we’d definitely want to collaborate around that
It’s my pleasure! I would take time to figure out how to improve the skd.