Original Slack Thread
<@U06TM7M40RK> for this elasticsearch ingest,
source:
type: “elasticsearch”
config:
# Coordinates
host: ‘localhost:9200’
_# Credentials_
username: user _# optional_
password: pass _# optional_
_# SSL support_
use_ssl: False
verify_certs: False
ca_certs: "./path/ca.cert"
client_cert: "./path/client.cert"
client_key: "./path/client.key"
ssl_assert_hostname: False
ssl_assert_fingerprint: "./path/cert.fingerprint"
_# Options_
url_prefix: "" _# optional url_prefix_
env: "PROD"
index_pattern:
allow: [".*some_index_name_pattern*"]
deny: [".*skip_index_name_pattern*"]
ingest_index_templates: False
index_template_pattern:
allow: [".*some_index_template_name_pattern*"]
sink:
# sink configs
what is the url_prefix. is that the same as platform_instance. Does platform_instance exist in datahub version1.0.0?
The url_prefix
and platform_instance
are distinct concepts in DataHub’s configuration for Elasticsearch ingestion.
-
url_prefix
: This is used to specify a prefix for the URL when connecting to Elasticsearch. It is particularly useful when you have multiple Elasticsearch clusters managed under a single endpoint and need to route requests to different clusters. The url_prefix
is not related to the concept of platform_instance
. According to the <https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/elasticsearch#config-details|Elasticsearch source configuration>, url_prefix
is an optional parameter that helps in routing requests.
-
platform_instance
: This is a separate configuration parameter used to distinguish between different instances of the same data platform within DataHub. It allows you to specify which instance of a platform the data belongs to, which is useful when you have multiple instances of a platform like Elasticsearch. The platform_instance
is not mentioned in the provided configuration snippet, but it is a supported feature in DataHub for distinguishing between different platform instances. According to the <https://datahubproject.io/docs/0.13.1/platform-instances|Platform Instances documentation>, this feature is available in DataHub versions starting from v0.8.24+.
Regarding DataHub version 1.0.0, the concept of platform_instance
should still be supported as it is a fundamental part of DataHub’s metadata model for handling multiple instances of a platform. However, specific details about version 1.0.0 are not covered in the provided data sources, so I recommend checking the latest DataHub documentation for any updates or changes related to platform_instance
.
Would you like more detailed guidance on how to configure platform_instance
or any other specific aspect?
Sources:
<@U06TM7M40RK> is this a valid approach to ingest data into data from elasticsearch source
from datahub.ingestion.run.pipeline import Pipeline
pipeline = Pipeline.create({
“run_id”: “elasticsearch-ingest”,
“source”: {
“type”: “elasticsearch”,
“config”: {
“host”: “http://localhost:9200”,
“index_pattern”: “c1a_customer_data*”, #
Correct type
“platform_instance”: “c1_onguard_data_es”, # Optional folder-like grouping
“env”: “PROD”
}
},
“sink”: {
“type”: “datahub-rest”,
“config”: {
“server”: “http://localhost:8080”
}
}
})
pipeline.run()
pipeline.raise_from_status()
The approach you’ve outlined for ingesting data from an Elasticsearch source into DataHub using the Pipeline
class is generally valid, but there are a few details to ensure correctness based on the provided data sources:
-
Index Pattern: The index_pattern
should be specified as a dictionary with allow
and deny
keys, rather than a single string. According to the <https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/elasticsearch#config-details|Elasticsearch source configuration>, it should look like this:
"index_pattern": {
"allow": ["c1a_customer_data*"]
}
-
Platform Instance: The platform_instance
is correctly used to specify a folder-like grouping for the Elasticsearch instance. This is optional but useful for distinguishing between different instances of the same platform.
-
Sink Configuration: The sink
configuration for datahub-rest
is correct, assuming that the DataHub server is running at http://localhost:8080
.
Here is the corrected version of your script:
from datahub.ingestion.run.pipeline import Pipeline
pipeline = Pipeline.create({
"run_id": "elasticsearch-ingest",
"source": {
"type": "elasticsearch",
"config": {
"host": "http://localhost:9200",
"index_pattern": {
"allow": ["c1a_customer_data*"]
},
"platform_instance": "c1_onguard_data_es",
"env": "PROD"
}
},
"sink": {
"type": "datahub-rest",
"config": {
"server": "http://localhost:8080"
}
}
})
pipeline.run()
pipeline.raise_from_status()
This script sets up a pipeline to ingest metadata from an Elasticsearch source and send it to a DataHub instance using the REST API. Ensure that the Elasticsearch and DataHub servers are running and accessible at the specified URLs.
Would you like more detailed information on any specific part of this process?
Sources:
<@U06TM7M40RK> the code is missing the authentication
To include authentication in your Elasticsearch ingestion pipeline, you need to add the necessary credentials to the source configuration. Based on the <https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/elasticsearch#config-details|Elasticsearch source configuration>, you can specify the username
and password
for basic authentication. Here’s how you can modify your code to include these credentials:
from datahub.ingestion.run.pipeline import Pipeline
pipeline = Pipeline.create({
"run_id": "elasticsearch-ingest",
"source": {
"type": "elasticsearch",
"config": {
"host": "http://localhost:9200",
"index_pattern": {
"allow": ["c1a_customer_data*"]
},
"platform_instance": "c1_onguard_data_es",
"env": "PROD",
"username": "your_username", # Add your Elasticsearch username here
"password": "your_password" # Add your Elasticsearch password here
}
},
"sink": {
"type": "datahub-rest",
"config": {
"server": "http://localhost:8080"
}
}
})
pipeline.run()
pipeline.raise_from_status()
Make sure to replace "your_username"
and "your_password"
with your actual Elasticsearch credentials. This will enable basic authentication for your Elasticsearch source.
If you are using a more secure method of authentication, such as API keys, you would need to adjust the configuration accordingly. However, the provided data sources do not mention API key authentication for Elasticsearch, so basic authentication is the recommended approach here.
Would you like more detailed guidance on any specific authentication method?
Sources:
<@U06TM7M40RK> i want to also
- 'partial-*'
- '^\..*'```
To include additional patterns in the deny
list for your Elasticsearch ingestion configuration, you can modify the index_pattern
section to include these patterns. The deny
list is used to specify regex patterns for indexes that should be excluded from ingestion. Here’s how you can update your configuration:
from datahub.ingestion.run.pipeline import Pipeline
pipeline = Pipeline.create({
"run_id": "elasticsearch-ingest",
"source": {
"type": "elasticsearch",
"config": {
"host": "http://localhost:9200",
"index_pattern": {
"allow": ["c1a_customer_data*"],
"deny": [
"partial-*", # Deny indexes starting with 'partial-'
"^\..*" # Deny indexes starting with a dot (.)
]
},
"platform_instance": "c1_onguard_data_es",
"env": "PROD",
"username": "your_username", # Add your Elasticsearch username here
"password": "your_password" # Add your Elasticsearch password here
}
},
"sink": {
"type": "datahub-rest",
"config": {
"server": "http://localhost:8080"
}
}
})
pipeline.run()
pipeline.raise_from_status()
In this configuration:
- The
deny
list now includes two patterns:
"partial-*"
: This pattern will exclude any index whose name starts with “partial-”.
"^\..*"
: This pattern will exclude any index whose name starts with a dot (.), which is common for system or hidden indexes in Elasticsearch.
This setup will ensure that any indexes matching these patterns are not ingested into DataHub.
Would you like more detailed information on regex patterns or any other aspect of this configuration?
Sources: