Subject: Making ignore_above
Property Configurable in /api/graphql
Search API
Hi Team,
We are currently facing an issue with the /api/graphql
endpoint in DataHub where string properties exceeding 256 characters are not searchable. Upon investigation, we noticed that this behavior stems from the default Elasticsearch mapping, which sets "ignore_above": **256**
for keyword fields.
We tested updating this value to 1024, and as expected, it allowed search queries to match strings beyond the 256-character limit.
What is the recommended or right way to make the ignore_above
value configurable, so that we can avoid modifying DataHub’s core schema definitions directly?
For context, we are actively using the latest DataHub releases in our project, and this limitation is impacting our ability to index and search long string fields effectively.
Looking forward to your guidance on the best approach to make this parameter configurable in a sustainable and upgrade-safe manner.
<@U03MF8MU5P0> <@U03MF8MU5P0> <@U02TYQ4SPPD> <@U0121TRV0FL> <@U07FV3UHE4D> <@U06CMSR2S1E> <@U06TM7M40RK>
It seems likely that the issue you’re encountering with the /api/graphql
endpoint in DataHub, where string properties exceeding 256 characters are not searchable, is due to the default Elasticsearch mapping setting "ignore_above": 256
for keyword fields. This setting limits the length of strings that can be indexed as keywords, which affects searchability for longer strings.
To address this, you can consider the following approach:
-
Custom Elasticsearch Mapping: It seems possible to customize the Elasticsearch mapping to adjust the
ignore_above
value. According to <Search | DataHub documentation on search customization>, you can define custom analyzers and mappings in Elasticsearch. This would involve creating a custom mapping for the fields you want to adjust and specifying a higherignore_above
value. -
Search Configuration: You might also explore the possibility of using a search configuration YAML file to customize search behavior. As per the <Search | DataHub documentation>, you can enable search query and ranking customization using environment variables and configuration files. This approach allows you to extend or replace the default Elasticsearch-based search functionality.
-
Environment Variables: Check if there are any environment variables related to Elasticsearch configuration that can be adjusted to change the
ignore_above
setting. The <https://github.com/datahub-project/datahub/blob/master/docs/deploy/environment-vars.md|DataHub environment variables documentation> might provide insights into configurable parameters. -
Consult DataHub Community: If the above approaches do not fully address your needs, consider reaching out to the DataHub community or support channels for further guidance. They might have additional insights or recommendations for making this parameter configurable in a sustainable manner.
Would you like more detailed steps on any specific part of this process?
Sources:
- <https://github.com/datahub-project/datahub/blob/master/docs/deploy/environment-vars.md|DataHub environment variables documentation>
0 button
0 button
Hint: Mention @runllm in the thread for followups.
Hi <@U07353YLQP8>, how did my answer work out for you? Need any more assistance? You can also tag <@U07FV3UHE4D> to let your fellow community members know you’re still looking for guidance.
<@U06TM7M40RK> can you please tell how to declare a normal text property in datahub?
To declare a normal text property in DataHub, you can use the concept of Structured Properties. These properties allow you to add custom, validated metadata to any entity type in DataHub. Here’s a step-by-step guide on how to create a text property:
-
Define the Structured Property: You need to define a new structured property with the desired characteristics. This can be done via the DataHub UI or programmatically using the API.
-
Property Definition: When defining a structured property, you specify the property type as
Text
. You can also set other attributes like allowed values, description, and the asset types it applies to. -
Use the DataHub UI: Navigate to Govern > Structured Properties in the DataHub UI. Click + Create to start defining your property. Provide a name, description, and select
Text
as the property type. -
Programmatic Approach: If you prefer to define the property programmatically, you can use the DataHub API. Here’s an example using Python:
from datahub.emitter.mcp import MetadataChangeProposalWrapper from datahub.emitter.rest_emitter import DatahubRestEmitter from datahub.metadata.schema_classes import StructuredPropertyDefinitionClass from datahub.metadata.urns import StructuredPropertyUrn # Create rest emitter rest_emitter = DatahubRestEmitter(gms_server="http://localhost:8080") # Define a text property text_property_urn = StructuredPropertyUrn("io.acryl.openTextProperty") text_property_definition = StructuredPropertyDefinitionClass( qualifiedName="io.acryl.openTextProperty", displayName="Open Text Property", valueType="urn:li:dataType:datahub.string", cardinality="SINGLE", entityTypes=[ "urn:li:entityType:datahub.dataset", "urn:li:entityType:datahub.container", ], description="This structured property allows a single open-ended response as a value", immutable=False, ) event_prop = MetadataChangeProposalWrapper( entityUrn=str(text_property_urn), aspect=text_property_definition, ) rest_emitter.emit(event_prop)
-
Add to Assets: Once defined, you can add the structured property to relevant assets, such as datasets or dashboards, to capture the desired metadata.
For more detailed instructions, you can refer to the <https://github.com/datahub-project/datahub/blob/master/docs/features/feature-guides/properties/create-a-property.md|DataHub documentation on creating structured properties>.
Would you like more details on any specific part of this process?
Sources:
- <https://github.com/datahub-project/datahub/blob/master/docs/features/feature-guides/properties/create-a-property.md|Create and Add a DataHub Structured Property>
0 button
0 button
Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> why in datahub the object properties are assigned the “ignore_above”: 256 by default?
It seems likely that the default setting of "ignore_above": 256
for object properties in DataHub is related to the way Elasticsearch handles keyword fields. This setting is used to limit the length of strings that can be indexed as keywords. When a string exceeds this length, it is not indexed as a keyword, which can help prevent performance issues and excessive resource usage in Elasticsearch.
Here are some reasons why this might be the default setting:
-
Performance Optimization: Indexing very long strings as keywords can lead to performance degradation in Elasticsearch. By setting a limit, DataHub ensures that only reasonably sized strings are indexed as keywords, which helps maintain search performance.
-
Resource Management: Elasticsearch has limitations on the size of terms that can be indexed. The
ignore_above
setting helps manage resource usage by preventing excessively long strings from being indexed, which could otherwise consume a significant amount of memory and storage. -
Common Practice: Using
ignore_above
is a common practice in Elasticsearch configurations to handle keyword fields. It provides a balance between indexing useful data and avoiding unnecessary resource consumption.
If you need to index longer strings as keywords, you can customize the Elasticsearch mapping to increase the ignore_above
value. This can be done by defining a custom mapping for the fields you want to adjust.
Would you like more detailed steps on how to customize this setting?
Sources:
- <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/tests/unit/test_elasticsearch_source.py|DataHub Elasticsearch Mapping>
0 button
0 button
Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> If you need to index longer strings as keywords, you can customize the Elasticsearch mapping to increase the ignore_above
value. This can be done by defining a custom mapping for the fields you want to adjust.
Would you like more detailed steps on how to customize this setting?
how can i make this configurable for object properties ind atahub
It seems likely that making the ignore_above
value configurable for object properties in DataHub involves customizing the Elasticsearch mapping. Here’s a general approach to achieve this:
-
Understand the Current Mapping: First, you need to understand how the current Elasticsearch mapping is set up for object properties. You can retrieve the current mapping using the Elasticsearch API.
-
Modify the Mapping: You can modify the Elasticsearch mapping to adjust the
ignore_above
value for specific fields. This involves creating a custom mapping that overrides the default settings. -
Use a Custom Analyzer: If you need more control over how fields are indexed, consider using a custom analyzer. This can help manage how text is tokenized and indexed.
-
Update DataHub Configuration: Ensure that your DataHub configuration is set up to use the custom mapping. This might involve updating configuration files or environment variables.
-
Reindex Data: After updating the mapping, you may need to reindex your data to apply the changes.
Here’s a step-by-step guide to customize the ignore_above
setting:
Step-by-Step Guide
-
Retrieve Current Mapping:
Use the following command to get the current mapping:curl -X GET "localhost:9200/your_index/_mapping"
-
Create a Custom Mapping:
Define a custom mapping with the desiredignore_above
value. Here’s an example JSON configuration:{ "mappings": { "properties": { "your_field": { "type": "keyword", "ignore_above": 1024 } } } }
-
Apply the Custom Mapping:
Use the Elasticsearch API to apply the custom mapping:curl -X PUT "localhost:9200/your_index/_mapping" -H 'Content-Type: application/json' -d' { "properties": { "your_field": { "type": "keyword", "ignore_above": 1024 } } } '
-
Update DataHub Configuration:
Ensure that DataHub is configured to use the updated mapping. This might involve setting environment variables or updating configuration files. -
Reindex Data:
Reindex your data to apply the new mapping settings. This can be done using the Elasticsearch reindex API.
Would you like more detailed steps on any specific part of this process?
Sources:
- https://github.com/datahub-project/datahub/issues/11070|https://github.com/datahub-project/datahub/issues/11070
0 button
0 button
Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> can this be possible without hitting directly the curl request for elastic search?
The DataHub’s documentation I have ingested does not mention how to make the ignore_above
property configurable for object properties without directly using curl requests to Elasticsearch. You might want to explore if there are any configuration options within DataHub’s settings or through its API that allow for such customizations indirectly. If this doesn’t help, please reach out to the DataHub support team for further assistance. 0 button
0 button
Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> where are the index templates defined for datahub?
It seems likely that the index templates for DataHub are defined within the Elasticsearch setup scripts and configurations used by DataHub. Here are some insights based on the available data sources:
-
Elasticsearch Setup Scripts: The index templates are typically set up during the Elasticsearch initialization process. According to the <https://github.com/datahub-project/datahub/blob/master/docker/elasticsearch-setup/create-indices.sh|DataHub GitHub repository>, there is a script named
create-indices.sh
that is responsible for creating indices and possibly setting up templates. -
Elasticsearch Configuration Files: The configuration for Elasticsearch, including index templates, might be specified in configuration files or scripts that are part of the DataHub deployment. These configurations can be customized to define how indices and templates are created.
-
DataHub Helm Charts: If you are deploying DataHub using Helm charts, the Elasticsearch configurations, including index templates, might be specified in the Helm values files. You can check the <https://github.com/acryldata/datahub-helm|DataHub Helm repository> for more details.
-
Custom Templates: If you need to define custom index templates, you can do so by creating a JSON configuration for the template and applying it using the Elasticsearch API. This allows you to specify mappings, settings, and other configurations for your indices.
Would you like more detailed steps on how to locate or customize these templates?
Sources:
- https://datahubspace.slack.com/archives/CV2KB471C/p1693825495.444719|https://datahubspace.slack.com/archives/CV2KB471C/p1693825495.444719
- https://github.com/datahub-project/datahub/issues/6924|https://github.com/datahub-project/datahub/issues/6924
0 button
0 button
Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> why is the below file used?
"index_patterns": ["*PREFIXdatahub_usage_event*"],
"data_stream": { },
"priority": 500,
"template": {
"mappings": {
"properties": {
"@timestamp": {
"type": "date"
},
"type": {
"type": "keyword"
},
"timestamp": {
"type": "date"
},
"userAgent": {
"type": "keyword"
},
"browserId": {
"type": "keyword"
}
}
},
"settings": {
"index.lifecycle.name": "PREFIXdatahub_usage_event_policy",
"index.number_of_shards": DUE_SHARDS,
"index.number_of_replicas": DUE_REPLICAS
}
}
}```
It seems likely that the JSON configuration you provided is used to define an index template for Elasticsearch in the context of DataHub. Here’s a breakdown of its purpose and components:
-
Index Patterns: The
"index_patterns": ["*PREFIXdatahub_usage_event*"]
specifies that this template applies to any index whose name matches the pattern. ThePREFIX
is likely a placeholder for a specific prefix used in your deployment, allowing you to customize the index names. -
Data Stream: The
"data_stream": { }
indicates that this template is intended for use with data streams. Data streams are a way to manage time-series data in Elasticsearch, and they automatically handle index creation and rollover. -
Priority: The
"priority": 500
sets the priority of this template. When multiple templates match an index, the one with the highest priority is applied. -
Mappings: The
"mappings"
section defines the structure of the documents in the index. It specifies the data types for various fields, such as@timestamp
,type
,timestamp
,userAgent
, andbrowserId
. This ensures that the data is indexed correctly and can be queried efficiently. -
Settings: The
"settings"
section includes configurations for the index, such as the index lifecycle management policy ("index.lifecycle.name": "PREFIXdatahub_usage_event_policy"
) and the number of shards and replicas ("index.number_of_shards": DUE_SHARDS
,"index.number_of_replicas": DUE_REPLICAS
). These settings control how the index is managed and distributed across the Elasticsearch cluster.
Overall, this template is used to ensure that indices related to datahub_usage_event
are created with the correct structure and settings, facilitating efficient storage and retrieval of usage event data in DataHub.
Would you like more detailed information on any specific part of this configuration?
Sources:
- https://github.com/datahub-project/datahub/issues/6924|https://github.com/datahub-project/datahub/issues/6924
- https://github.com/datahub-project/datahub/pull/5502|https://github.com/datahub-project/datahub/pull/5502
0 button
0 button
Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> can we create index on a property for field:object with ekyword property on a specific field that is already indexed?
The DataHub’s documentation I have ingested does not mention how to create an index on a property for a field of type object
with a keyword property on a specific field that is already indexed. You might want to provide additional context on the specific use case or explore if there are existing configurations or customizations in your Elasticsearch setup that can be leveraged. If this doesn’t help, please reach out to the DataHub support team for further assistance. 0 button
0 button
Hint: Mention @runllm in the thread for followups.