Discussion on Elasticsearch Ingest Configuration in DataHub

user-1 · May 5, 2025, 12:02am

<@U06TM7M40RK> for this elasticsearch ingest,
source:
type: “elasticsearch”
config:
# Coordinates
host: ‘localhost:9200’

_# Credentials_
username: user _# optional_
password: pass _# optional_

_# SSL support_
use_ssl: False
verify_certs: False
ca_certs: "./path/ca.cert"
client_cert: "./path/client.cert"
client_key: "./path/client.key"
ssl_assert_hostname: False
ssl_assert_fingerprint: "./path/cert.fingerprint"

_# Options_
url_prefix: "" _# optional url_prefix_
env: "PROD"
index_pattern:
  allow: [".*some_index_name_pattern*"]
  deny: [".*skip_index_name_pattern*"]
ingest_index_templates: False
index_template_pattern:
  allow: [".*some_index_template_name_pattern*"]

sink:
# sink configs
what is the url_prefix. is that the same as platform_instance. Does platform_instance exist in datahub version1.0.0?

user-2 · May 5, 2025, 12:02am

The url_prefix and platform_instance are distinct concepts in DataHub’s configuration for Elasticsearch ingestion.

url_prefix: This is used to specify a prefix for the URL when connecting to Elasticsearch. It is particularly useful when you have multiple Elasticsearch clusters managed under a single endpoint and need to route requests to different clusters. The url_prefix is not related to the concept of platform_instance. According to the <https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/elasticsearch#config-details|Elasticsearch source configuration>, url_prefix is an optional parameter that helps in routing requests.
platform_instance: This is a separate configuration parameter used to distinguish between different instances of the same data platform within DataHub. It allows you to specify which instance of a platform the data belongs to, which is useful when you have multiple instances of a platform like Elasticsearch. The platform_instance is not mentioned in the provided configuration snippet, but it is a supported feature in DataHub for distinguishing between different platform instances. According to the <https://datahubproject.io/docs/0.13.1/platform-instances|Platform Instances documentation>, this feature is available in DataHub versions starting from v0.8.24+.

Regarding DataHub version 1.0.0, the concept of platform_instance should still be supported as it is a fundamental part of DataHub’s metadata model for handling multiple instances of a platform. However, specific details about version 1.0.0 are not covered in the provided data sources, so I recommend checking the latest DataHub documentation for any updates or changes related to platform_instance.

Would you like more detailed guidance on how to configure platform_instance or any other specific aspect?

Sources:

<https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/elasticsearch#config-details|Elasticsearch Source Configuration>
<https://datahubproject.io/docs/0.13.1/platform-instances|Platform Instances documentation> 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-1 · May 5, 2025, 12:02am

<@U06TM7M40RK> is this a valid approach to ingest data into data from elasticsearch source
from datahub.ingestion.run.pipeline import Pipeline

pipeline = Pipeline.create({
“run_id”: “elasticsearch-ingest”,
“source”: {
“type”: “elasticsearch”,
“config”: {
“host”: “http://localhost:9200”,
“index_pattern”: “c1a_customer_data*”, # Correct type
“platform_instance”: “c1_onguard_data_es”, # Optional folder-like grouping
“env”: “PROD”
}
},
“sink”: {
“type”: “datahub-rest”,
“config”: {
“server”: “http://localhost:8080”
}
}
})

pipeline.run()
pipeline.raise_from_status()

user-2 · May 5, 2025, 12:02am

The approach you’ve outlined for ingesting data from an Elasticsearch source into DataHub using the Pipeline class is generally valid, but there are a few details to ensure correctness based on the provided data sources:

Index Pattern: The index_pattern should be specified as a dictionary with allow and deny keys, rather than a single string. According to the <https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/elasticsearch#config-details|Elasticsearch source configuration>, it should look like this:
```
"index_pattern": {
    "allow": ["c1a_customer_data*"]
}
```
Platform Instance: The platform_instance is correctly used to specify a folder-like grouping for the Elasticsearch instance. This is optional but useful for distinguishing between different instances of the same platform.
Sink Configuration: The sink configuration for datahub-rest is correct, assuming that the DataHub server is running at http://localhost:8080.

Here is the corrected version of your script:

from datahub.ingestion.run.pipeline import Pipeline

pipeline = Pipeline.create({
    "run_id": "elasticsearch-ingest",
    "source": {
        "type": "elasticsearch",
        "config": {
            "host": "http://localhost:9200",
            "index_pattern": {
                "allow": ["c1a_customer_data*"]
            },
            "platform_instance": "c1_onguard_data_es",
            "env": "PROD"
        }
    },
    "sink": {
        "type": "datahub-rest",
        "config": {
            "server": "http://localhost:8080"
        }
    }
})

pipeline.run()
pipeline.raise_from_status()

This script sets up a pipeline to ingest metadata from an Elasticsearch source and send it to a DataHub instance using the REST API. Ensure that the Elasticsearch and DataHub servers are running and accessible at the specified URLs.

Would you like more detailed information on any specific part of this process?

Sources:

https://datahubspace.slack.com/archives/CUMUWQU66/p1720804259.828129|https://datahubspace.slack.com/archives/CUMUWQU66/p1720804259.828129
https://datahubproject.io/docs/0.14.0/generated/ingestion/sources/elasticsearch|https://datahubproject.io/docs/0.14.0/generated/ingestion/sources/elasticsearch 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-1 · May 5, 2025, 12:02am

<@U06TM7M40RK> the code is missing the authentication

user-2 · May 5, 2025, 12:02am

To include authentication in your Elasticsearch ingestion pipeline, you need to add the necessary credentials to the source configuration. Based on the <https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/elasticsearch#config-details|Elasticsearch source configuration>, you can specify the username and password for basic authentication. Here’s how you can modify your code to include these credentials:

from datahub.ingestion.run.pipeline import Pipeline

pipeline = Pipeline.create({
    "run_id": "elasticsearch-ingest",
    "source": {
        "type": "elasticsearch",
        "config": {
            "host": "http://localhost:9200",
            "index_pattern": {
                "allow": ["c1a_customer_data*"]
            },
            "platform_instance": "c1_onguard_data_es",
            "env": "PROD",
            "username": "your_username",  # Add your Elasticsearch username here
            "password": "your_password"   # Add your Elasticsearch password here
        }
    },
    "sink": {
        "type": "datahub-rest",
        "config": {
            "server": "http://localhost:8080"
        }
    }
})

pipeline.run()
pipeline.raise_from_status()

Make sure to replace "your_username" and "your_password" with your actual Elasticsearch credentials. This will enable basic authentication for your Elasticsearch source.

If you are using a more secure method of authentication, such as API keys, you would need to adjust the configuration accordingly. However, the provided data sources do not mention API key authentication for Elasticsearch, so basic authentication is the recommended approach here.

Would you like more detailed guidance on any specific authentication method?

Sources:

https://github.com/datahub-project/datahub/blob/master/docs/deploy/aws.md|https://github.com/datahub-project/datahub/blob/master/docs/deploy/aws.md
https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/elasticsearch|https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/elasticsearch 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-1 · May 5, 2025, 12:02am

<@U06TM7M40RK> i want to also

        - 'partial-*'
        - '^\..*'```

user-2 · May 5, 2025, 12:02am

To include additional patterns in the deny list for your Elasticsearch ingestion configuration, you can modify the index_pattern section to include these patterns. The deny list is used to specify regex patterns for indexes that should be excluded from ingestion. Here’s how you can update your configuration:

from datahub.ingestion.run.pipeline import Pipeline

pipeline = Pipeline.create({
    "run_id": "elasticsearch-ingest",
    "source": {
        "type": "elasticsearch",
        "config": {
            "host": "http://localhost:9200",
            "index_pattern": {
                "allow": ["c1a_customer_data*"],
                "deny": [
                    "partial-*",  # Deny indexes starting with 'partial-'
                    "^\..*"       # Deny indexes starting with a dot (.)
                ]
            },
            "platform_instance": "c1_onguard_data_es",
            "env": "PROD",
            "username": "your_username",  # Add your Elasticsearch username here
            "password": "your_password"   # Add your Elasticsearch password here
        }
    },
    "sink": {
        "type": "datahub-rest",
        "config": {
            "server": "http://localhost:8080"
        }
    }
})

pipeline.run()
pipeline.raise_from_status()

In this configuration:

The deny list now includes two patterns:
- "partial-*": This pattern will exclude any index whose name starts with “partial-”.
- "^\..*": This pattern will exclude any index whose name starts with a dot (.), which is common for system or hidden indexes in Elasticsearch.

This setup will ensure that any indexes matching these patterns are not ingested into DataHub.

Would you like more detailed information on regex patterns or any other aspect of this configuration?

Sources:

<https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/elasticsearch#config-details|Elasticsearch Source Configuration> 0 button 0 button Hint: Mention @runllm in the thread for followups.

Topic		Replies	Views
Troubleshooting Elasticsearch Datastream Ingestion for Datahub getting-started	14	128	March 4, 2024
Troubleshooting Datahub Ingestion Issue with UI Deployment and Elasticsearch Setup troubleshoot	1	183	May 20, 2024
Creating Field-to-Field Lineage from MySQL Tables to Elasticsearch Indices using DataHub API ingestion	19	80	May 5, 2025
Visualizing Lineage from DataHub: Ingesting Data from Elasticsearch and Database Tables ingestion	0	5	July 15, 2024
Correct Database Pattern Configuration for Snowflake Ingestion Source ingestion	3	6	April 21, 2025

Discussion on Elasticsearch Ingest Configuration in DataHub

Related topics