Configuring Profiling and Stateful Ingestion Features in DataHub

Original Slack Thread

<@U06TM7M40RK> Can you please confirm that Stats tab enables in every case of cli ingestion? Or this tab is conditional to any flag?

The activation of the Stats tab in DataHub is conditional and not automatically enabled in every case of CLI ingestion. According to the <https://datahubproject.io/docs/0.13.1/features/dataset-usage-and-query-history|DataHub documentation>, the Stats tab is enabled when profiling is configured and successfully executed during the ingestion process. Specifically, the data_profiling or profiling needs to be enabled in the ingestion recipe to see the Stats tab.

For example, in a <Slack conversation>, it was mentioned that the Stats tab appears only if data_profiling is enabled. Similarly, another <Slack thread> highlighted that profiling needs to be enabled in the recipe to see stats.

Would you like more detailed steps on how to configure profiling in your ingestion recipe?

Sources:

<@U06TM7M40RK> waiting for your response.

To enable the Stats tab in DataHub, it is necessary to configure profiling in your ingestion recipe. The Stats tab is not automatically enabled for every CLI ingestion; it depends on whether profiling is set up and executed successfully.

  1. Profiling Configuration: Ensure that profiling is enabled in your ingestion recipe. This involves setting the profiling parameter to true. For example, in a <Slack discussion>, it was mentioned that profiling needs to be enabled to see stats.

  2. Conditional Flags: There are specific flags and configurations that need to be set for profiling to work. For instance, the profile_table_level_only flag can be used to control whether profiling is done at the table level or includes column-level profiling as well.

  3. Data Source Support: Not all data sources support profiling in the same way. For example, <https://datahubproject.io/docs/0.13.1/features/dataset-usage-and-query-history|DataHub documentation> mentions that some sources may require additional configuration or may not support certain profiling features.

Here is an example configuration snippet for enabling profiling:

profiling:
    enabled: true
    profile_table_level_only: false
    include_field_distinct_count: true
    include_field_min_value: true
    include_field_max_value: true
    include_field_mean_value: true
    include_field_median_value: true
    include_field_stddev_value: true
    include_field_histogram: true

Would you like more detailed steps on setting up profiling for a specific data source or any other aspect?

Sources:

<@U06TM7M40RK> Stateful_ingestion is also a part of profiling?

Stateful ingestion and profiling are related but distinct features in DataHub. They serve different purposes and are configured separately in the ingestion process.

  1. Stateful Ingestion: This feature is primarily used for managing metadata state across ingestion runs. It allows DataHub to track changes in metadata over time and can be used to automatically soft-delete entities that were present in a previous run but are missing in the current run. This is particularly useful for maintaining an up-to-date catalog without re-ingesting all metadata every time. According to the <https://datahubproject.io/docs/0.13.1/metadata-ingestion/docs/dev_guides/stateful|Stateful Ingestion documentation>, stateful ingestion is an explicit opt-in feature and is not enabled by default.

  2. Profiling: Profiling, on the other hand, is focused on collecting statistics and metadata about the data itself, such as row counts, column statistics, and data distributions. Profiling is used to populate the Stats tab in DataHub and provides insights into the data’s characteristics. Profiling is configured separately from stateful ingestion and requires enabling specific profiling parameters in the ingestion recipe.

While both features can be used together in an ingestion pipeline, they address different aspects of data management. Stateful ingestion deals with metadata state management, while profiling deals with data statistics and insights.

Would you like more detailed information on how to configure either of these features?

Sources:

<@U06TM7M40RK> I want to keep stats tab disable but it keep enabling without profiling parameter. Please guide me how can i disable it.

It seems likely that the Stats tab in DataHub is being enabled due to profiling being inadvertently activated in your configuration. To keep the Stats tab disabled, you need to ensure that profiling is not enabled in your ingestion recipe.

Here are some steps you can take to disable the Stats tab:

  1. Check Profiling Configuration: Make sure that the profiling parameter is set to false or is not included in your ingestion recipe. According to a <Slack discussion>, the Stats tab appears only if data_profiling is enabled.

  2. Review Ingestion Recipe: Double-check your ingestion recipe to ensure that there are no profiling-related parameters inadvertently set to true. This includes parameters like profile_table_level_only and any other profiling-specific settings.

  3. Disable Profiling in Data Source Configuration: If you are using a specific data source, ensure that its configuration does not have profiling enabled by default. Some data sources might have profiling settings that need to be explicitly disabled.

  4. Remove Profiling-Related Code: If you have any custom code or scripts that might be enabling profiling, ensure that they are not being executed during the ingestion process.

If you have already checked these settings and the Stats tab is still being enabled, it might be helpful to review the ingestion logs for any indications of profiling being executed.

Would you like more detailed steps on how to review your configuration or any other aspect?

Sources:

  • <Slack discussion> :+1: 1 button :-1: 0 button :bulb: Hint: Mention <@U06TM7M40RK> in the thread for followups.

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><