I was having issues with usage stats in BigQuery earlier but found out the someone was excluding the data access logs in the GCP log router. I’ve since resolved that and can see usage being ingested.
Now I do see “Top Users” under the Stats tab of my assets. However, I do not see any stats around “Monthly Queries”. Is there something I specifically need to add in my config or do I just need to wait until DataHub ingests a months worth of usage logs?
Hey there! Make sure your message includes the following information if relevant, so we can help more effectively!
Are you using UI or CLI for ingestion?
Which DataHub version are you using? (e.g. 0.12.0)
What data source(s) are you integrating with DataHub? (e.g. BigQuery)
<@U06TM7M40RK> Can you help with the questions in this thread?
We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a or so other folks in the Community know if you’re still looking for help on the topic!
To address your issue with “Monthly Queries” stats not appearing, there are a few points to consider:
Configuration: Ensure that your ingestion recipe is correctly configured to capture all necessary usage statistics. Based on the <https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/bigquery|BigQuery Ingestion Configuration>, you should have include_usage_statistics set to true and other relevant usage parameters configured.
Time Frame: As you mentioned, DataHub might need to ingest a month’s worth of usage logs to display “Monthly Queries” stats. This is because the statistics are aggregated over time, and without sufficient data, the monthly aggregation might not be visible.
Logs and Permissions: Ensure that the GCP log router is correctly configured to include data access logs, as these are crucial for capturing usage statistics. Missing logs can result in incomplete data.
Profiling and Usage Settings: Double-check your profiling and usage settings in the ingestion recipe. Here is an example configuration that includes all necessary settings:
Debugging: Enable debug logging to capture more detailed information about the ingestion process. This can help identify any issues with the ingestion of usage statistics.
The include_top_n_queries setting is specifically for capturing the top N queries executed on your datasets. While it is useful for detailed query analysis, it is not strictly necessary for capturing overall usage statistics, including “Monthly Queries.”
To ensure that “Monthly Queries” stats are captured, you should focus on the following settings:
include_usage_statistics: This should be set to true to enable the ingestion of usage statistics.
usage: Ensure that the usage configuration is correctly set to capture the necessary usage data.
Here is a simplified version of the configuration focusing on the essential settings for usage statistics:
I’m not able to find any data sources to answer your question.
You might want to specify the exact DataHub version you are using and any specific configurations you have set up for usage statistics.
If this doesn’t help, please reach out to the DataHub support team for further assistance. 0 button 0 button Hint: Mention @runllm in the thread for followups.
To query for the number of monthly queries using GraphQL in DataHub, you can use the getLastMonthUsageAggregations query. This query retrieves usage statistics, including the total number of SQL queries executed in the last month.
Here is an example of how you can structure your GraphQL query to get the number of monthly queries for a specific dataset:
You need to replace $urn with the URN of the dataset you are interested in. This query will return the total number of SQL queries executed in the last month for the specified dataset.
Here is an example of how you might execute this query using a GraphQL client in Python:
To get the monthly usage statistics for an entire dataset, you can use the getLastMonthUsageAggregations query in GraphQL. This query retrieves usage statistics, including the total number of SQL queries executed in the last month, unique user count, and other relevant metrics.
Here is an example of how you can structure your GraphQL query to get the monthly usage for an entire dataset:
The discrepancy where you have a uniqueUserCount but totalSqlQueries is 0 can occur due to the way DataHub captures and processes usage statistics. This issue has been discussed in the community and can be attributed to several factors:
Read Events Without Query Events: It is possible to have read events (e.g., table scans, materialized view refreshes) that are not associated with explicit SQL query events. This can result in user counts being incremented without corresponding query counts. According to a <https://github.com/datahub-project/datahub/pull/7811|GitHub pull request>, DataHub can capture read events even if the corresponding query event is missing.
Log Configuration: Ensure that your GCP log router is configured to capture all relevant logs, including both data access logs and query logs. Missing logs can lead to incomplete data.
Ingestion Configuration: Double-check your ingestion configuration to ensure that it is set up to capture all necessary usage statistics. Here is an example configuration:
DataHub Version: Ensure you are using the latest version of DataHub, as there may have been bug fixes or improvements related to usage statistics. The issue you are experiencing might have been addressed in a newer release.
Debugging: Enable debug logging to capture more detailed information about the ingestion process. This can help identify any issues with the ingestion of usage statistics.
If you have verified all the above and still encounter the issue, it might be worth reaching out to the DataHub community or support team for further assistance.
Would you like more detailed steps or additional troubleshooting tips?
The include_read_operational_stats configuration parameter in DataHub is used to report read operational statistics. This is an experimental feature that, when enabled, captures and reports statistics related to read operations on your datasets.
According to the <BigQuery | DataHub Ingestion Configuration>, the include_read_operational_stats parameter is set to false by default. When set to true, it enables the ingestion of read operational statistics, which can provide insights into how datasets are being accessed and used.
Here is a summary of what include_read_operational_stats does:
Captures Read Operations: It captures statistics related to read operations, such as table scans and materialized view refreshes, which are not necessarily tied to explicit SQL queries.
Provides Additional Insights: By enabling this parameter, you can gain additional insights into the usage patterns of your datasets, including how often they are being read and by whom.
Experimental Feature: This feature is marked as experimental, which means it might be subject to changes and improvements in future releases of DataHub.
Here is an example of how to include this parameter in your ingestion configuration:
I have copied the config file and confirmed the logs exist in the project and the service account has the correct permissions. I still get “totalSqlQueries”: 0. What else can I try?
Given that your ingestion configuration appears correct and you have confirmed that the logs exist and the service account has the correct permissions, there are a few additional steps you can take to troubleshoot the issue:
Check Permissions: Ensure that the service account used for ingestion has the necessary permissions to access BigQuery usage logs. Missing permissions can prevent usage statistics from being captured.
Examine Logs: Review the ingestion logs for any errors or warnings related to usage statistics. The logs you provided show that no usage log entries were parsed ('num_usage_parsed_log_entries': {}), which suggests that the logs might not be accessible or are empty.
Enable Detailed Debug Logs: Increase the logging level to debug to capture more detailed information about the ingestion process. This can help identify where the issue might be occurring.
Check BigQuery Audit Logs: Verify that BigQuery audit logs are being generated and contain the expected usage information. If the audit logs are not being generated, DataHub cannot extract usage statistics.
Review Stateful Ingestion: Ensure that stateful ingestion is not causing issues. Sometimes, stateful ingestion settings can interfere with the proper capture of usage statistics.
Here is an updated version of your ingestion recipe with debug logging enabled: