<@U06TM7M40RK> Is there a way to get the cardinality of sample values in a profiled dataset
Datahub version : v0.14.0.2
<@U06TM7M40RK> Is there a way to get the cardinality of sample values in a profiled dataset
Datahub version : v0.14.0.2
Yes, DataHub provides a way to get the cardinality of sample values in a profiled dataset through its profiling capabilities. According to the <https://datahubproject.io/docs/0.14.0/generated/metamodel/entities/dataset#datasetprofile-timeseries|DataHub documentation>, profiling can capture various statistics for each column in a dataset, including unique count and unique proportion, which relate to the cardinality of the data.
To access these profiling statistics, you need to ensure that profiling is enabled in your ingestion configuration. The profiling configuration allows you to specify whether to include field distinct counts and other statistics. You can enable these options in the profiling configuration as shown in the <https://datahubproject.io/docs/0.14.0/generated/ingestion/sources/bigquery#config-details|BigQuery profiling configuration>.
Once profiling is enabled and configured, you can query the profiling statistics using DataHub’s GraphQL API. The <https://github.com/datahub-project/datahub/blob/master/datahub-web-react/src/graphql/dataset.graphql|GraphQL query> for dataset profiles includes fields such as uniqueCount
and uniqueProportion
, which provide information about the cardinality of the sample values.
Would you like more detailed guidance on how to configure profiling or query these statistics using GraphQL?
Sources:
<@U06TM7M40RK> which profiling parameter is used to get these values ?
including unique count and unique proportion, which relate to the cardinality of the data.
<@U06TM7M40RK> including unique count and unique proportion, which relate to the cardinality of the data.
Which parameter to use to get these values ?
To obtain the unique count and unique proportion values during profiling in DataHub, you need to configure the profiling parameters in your ingestion setup. Specifically, you should use the following parameters:
unique_count
: This parameter is used to specify the name of the parameter for the count of unique values in a column. It helps in capturing the unique count of values in the dataset.
unique_proportion
: This parameter is used to specify the name of the parameter for the proportion of unique values in a column. It helps in capturing the proportion of unique values relative to the total number of values.
These parameters are part of the profiling configuration and can be set in the ingestion source configuration. For example, in the Glue profiling configuration, these parameters are defined as part of the GlueProfilingConfig
object, as shown in the <https://datahubproject.io/docs/0.14.0/generated/ingestion/sources/glue#config-details|Glue profiling configuration documentation>.
If you need more detailed guidance on how to set these parameters in your specific ingestion source configuration, please let me know.
Sources:
Hi ! Did this interaction resolve your issue or do you still require assistance ?
Please tag @ RunLLM if you still need further assistance !
Or tag @Datahub Community Support if you require a person to take a look
Hope this helps ! ><