Getting Distinct Sample Values for Trino+Hive DataSource Profiling

Original Slack Thread

<@U06TM7M40RK> For trino+hive datasource profiling how do we get all the distinct sample values ?

To get distinct sample values for a Trino+Hive data source in DataHub, you need to enable profiling and configure it to include sample values. Here’s a general approach based on the available data sources:

  1. Enable Profiling: Ensure that profiling is enabled in your configuration. This is necessary to collect any profiling data, including sample values.

  2. Include Sample Values: Set the include_field_sample_values option to true in your profiling configuration. This will ensure that sample values are collected for each column.

  3. Set Sample Values Limit: You can specify the number of sample values to collect using the field_sample_values_limit parameter. The default limit is 20, but you can adjust this as needed.

  4. Configuration Example:

    source:
      type: trino
      config:
        host_port: localhost:5300
        database: dbname
        profiling:
          enabled: true
          include_field_sample_values: true
          field_sample_values_limit: 50  # Adjust the limit as needed
    
  5. Profiling Pattern: If you want to profile specific tables or schemas, use the profile_pattern to include or exclude specific assets.

  6. Check for Errors: Ensure there are no errors in the logs related to profiling, as these might prevent sample values from being collected.

According to the <https://datahubproject.io/docs/0.14.0/generated/ingestion/sources/trino|DataHub documentation>, profiling can be optionally enabled via configuration, and it supports collecting table, row, and column statistics, including sample values.

Would you like more detailed guidance on any specific part of this process?

Sources:

<@U07FV3UHE4D> Any option to get distinct sample values for a profiled dataset ?

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><