<@U06TM7M40RK> For trino+hive datasource profiling how do we get all the distinct sample values ?
To get distinct sample values for a Trino+Hive data source in DataHub, you need to enable profiling and configure it to include sample values. Here’s a general approach based on the available data sources:
-
Enable Profiling: Ensure that profiling is enabled in your configuration. This is necessary to collect any profiling data, including sample values.
-
Include Sample Values: Set the
include_field_sample_values
option totrue
in your profiling configuration. This will ensure that sample values are collected for each column. -
Set Sample Values Limit: You can specify the number of sample values to collect using the
field_sample_values_limit
parameter. The default limit is 20, but you can adjust this as needed. -
Configuration Example:
source: type: trino config: host_port: localhost:5300 database: dbname profiling: enabled: true include_field_sample_values: true field_sample_values_limit: 50 # Adjust the limit as needed
-
Profiling Pattern: If you want to profile specific tables or schemas, use the
profile_pattern
to include or exclude specific assets. -
Check for Errors: Ensure there are no errors in the logs related to profiling, as these might prevent sample values from being collected.
According to the <https://datahubproject.io/docs/0.14.0/generated/ingestion/sources/trino|DataHub documentation>, profiling can be optionally enabled via configuration, and it supports collecting table, row, and column statistics, including sample values.
Would you like more detailed guidance on any specific part of this process?
Sources:
- <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/ge_profiling_config.py|DataHub Profiling Configuration>
- <https://datahubproject.io/docs/0.14.0/generated/ingestion/sources/trino|DataHub documentation>
0 button
0 button
Hint: Mention @runllm in the thread for followups.
<@U07FV3UHE4D> Any option to get distinct sample values for a profiled dataset ?
Hi ! Did this interaction resolve your issue or do you still require assistance ?
Please tag @ RunLLM if you still need further assistance !
Or tag @Datahub Community Support if you require a person to take a look
Hope this helps ! ><