Hi <@U07FV3UHE4D>
I noticed that stateful ingestion only avoids redundant fetches for usage data(snowflake), while non-deleted metadata is reingested every time. Is this expected behavior, and are there plans to improve change detection for non-usage metadata?
Hey there! 
<@U06TM7M40RK> Can you help with the questions in this thread?
Please continue to tag RunLLM (e.g <@U06TM7M40RK>) to have it continue generating responses!
Stateful ingestion in DataHub is designed to manage metadata by comparing the current state with the previous state to identify and soft-delete entities that are no longer present. This is particularly useful for handling stale metadata. However, as you observed, stateful ingestion currently focuses on avoiding redundant fetches primarily for usage data, such as Snowflake usage statistics, by leveraging features like redundant run elimination (<https://datahubproject.io/docs/0.14.0/metadata-ingestion/docs/dev_guides/stateful#redundant-run-elimination|DataHub documentation>).
For non-usage metadata, the current behavior is to reingest metadata in each run, which means that even unchanged metadata is processed again. This is expected behavior as of now. The stateful ingestion feature does not yet include a mechanism for detecting changes in non-usage metadata to avoid reingestion of unchanged data.
As for future improvements, there is no specific mention in the provided data sources about plans to enhance change detection for non-usage metadata. However, the DataHub community is active, and features are continuously being developed and improved. It might be beneficial to keep an eye on the <https://github.com/datahub-project/datahub|DataHub GitHub repository> for any updates or feature requests related to this.
Would you like more details on how stateful ingestion works or any other specific aspect of DataHub’s ingestion capabilities?
Sources:
- <https://datahubproject.io/docs/0.14.0/metadata-ingestion/docs/dev_guides/stateful#redundant-run-elimination|DataHub documentation>
0 button
0 button
Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK>
I would like details on how stateful ingestion works
Stateful ingestion in DataHub is a feature that allows the platform to manage metadata more efficiently by keeping track of the state of metadata between ingestion runs. This feature is particularly useful for handling scenarios where metadata might be deleted or changed, and it helps in maintaining an accurate representation of the data ecosystem within DataHub. Here’s a detailed breakdown of how stateful ingestion works:
-
Checkpointing Mechanism: According to the <https://datahubproject.io/docs/0.14.0/metadata-ingestion/docs/dev_guides/stateful|Stateful Ingestion documentation>, stateful ingestion uses a checkpointing mechanism to save the state of metadata from each ingestion run. This state is then used in subsequent runs to compare against the current state of metadata.
-
Stale Entity Removal: One of the primary use cases of stateful ingestion is to automatically soft-delete entities (such as tables and views) that were present in a previous run but are missing in the current run. This is known as stale entity removal. The feature is supported for all SQL-based sources (<https://datahubproject.io/docs/0.14.0/metadata-ingestion/docs/dev_guides/stateful|Stateful Ingestion documentation>).
-
Configuration: To enable stateful ingestion, you need to configure it in your ingestion recipe. This involves setting the
stateful_ingestion.enabled
parameter toTrue
and optionally configuringremove_stale_metadata
toTrue
to enable automatic soft-deletion of stale entities. Thepipeline_name
is also crucial as it ties the state to a specific ingestion pipeline (<https://datahubproject.io/docs/0.14.0/metadata-ingestion/docs/dev_guides/stateful|Stateful Ingestion documentation>). -
Redundant Run Elimination: For usage data, such as Snowflake usage statistics, stateful ingestion can avoid redundant fetches by skipping runs that would fetch the same data as previous runs. This is achieved by storing timestamps of the last successful ingestion and using them to determine if a new run is necessary (<https://datahubproject.io/docs/0.14.0/metadata-ingestion/docs/dev_guides/stateful#redundant-run-elimination|Stateful Ingestion documentation>).
-
State Provider: The ingestion state provider is responsible for saving and retrieving the checkpoint state. The default state provider is implemented on top of the
datahub_api
client and uses the timeseries aspect capabilities of the DataHub backend (<https://datahubproject.io/docs/0.14.0/metadata-ingestion/docs/dev_guides/stateful|Stateful Ingestion documentation>). -
Limitations: Currently, stateful ingestion does not include change detection for non-usage metadata, meaning that unchanged metadata is reingested in each run. This is an area where future improvements could be made to enhance efficiency.
Here’s a sample configuration for enabling stateful ingestion:
source:
type: "snowflake"
config:
username: <user_name>
password: <password>
host_port: <host_port>
warehouse: <ware_house>
role: <role>
include_tables: True
include_views: True
stateful_ingestion:
enabled: True
remove_stale_metadata: True
pipeline_name: "my_snowflake_pipeline_1"
sink:
type: "datahub-rest"
config:
server: '<http://localhost:8080>'
Would you like more information on any specific aspect of stateful ingestion or its configuration?
Sources:
- <https://datahubproject.io/docs/0.14.0/metadata-ingestion/docs/dev_guides/stateful|Stateful Ingestion documentation>
- <https://datahubproject.io/docs/0.14.0/metadata-ingestion/docs/dev_guides/stateful#redundant-run-elimination|Stateful Ingestion documentation>
0 button
0 button
Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK>
irrespective of stateful ingestion, all metadata from source must be getting fetched again, query parsing must be getting done again.
Yes, you are correct. Regardless of whether stateful ingestion is enabled, the process of fetching metadata from the source and parsing queries is typically repeated in each ingestion run. Here’s a detailed explanation:
-
Metadata Fetching: During each ingestion run, DataHub connectors fetch metadata from the source systems. This includes information about databases, tables, columns, and other relevant metadata. The fetching process is necessary to ensure that DataHub has the most up-to-date view of the data ecosystem.
-
Query Parsing: If the ingestion involves parsing queries (e.g., for lineage extraction), this parsing is performed anew in each run. This is because the queries themselves might have changed, and parsing is required to extract the latest lineage information.
-
Stateful Ingestion’s Role: While stateful ingestion helps manage metadata by keeping track of changes between runs, it does not eliminate the need to fetch and parse metadata. Instead, it focuses on identifying and handling stale metadata—entities that were present in a previous run but are missing in the current run (<https://datahubproject.io/docs/0.14.0/metadata-ingestion/docs/dev_guides/stateful|Stateful Ingestion documentation>).
-
Redundant Fetch Avoidance: For specific use cases like usage data (e.g., Snowflake usage statistics), stateful ingestion can avoid redundant fetches by skipping runs that would fetch the same data as previous runs. However, this is not applicable to the general metadata fetching and parsing process (<https://datahubproject.io/docs/0.14.0/metadata-ingestion/docs/dev_guides/stateful#redundant-run-elimination|Stateful Ingestion documentation>).
In summary, while stateful ingestion optimizes certain aspects of metadata management, the fundamental process of fetching and parsing metadata is repeated in each run to ensure accuracy and completeness. This is necessary to capture any changes or updates in the source systems.
Would you like more information on how to optimize ingestion processes or any other specific aspect of DataHub’s capabilities?
Sources:
- <https://datahubproject.io/docs/0.14.0/metadata-ingestion/docs/dev_guides/stateful|Stateful Ingestion documentation>
- <https://datahubproject.io/docs/0.14.0/metadata-ingestion/docs/dev_guides/stateful#redundant-run-elimination|Stateful Ingestion documentation>
0 button
0 button
Hint: Mention @runllm in the thread for followups.
Hi ! Did this interaction resolve your issue or do you still require assistance ?
Please tag @ RunLLM if you still need further assistance !
Or tag @Datahub Community Support if you require a person to take a look
Hope this helps ! ><