Discrepancy in dataset count between database and Analytics tab for hive datasets

Original Slack Thread

Hi, I’m trying to get the percentage of assets with lineage querying database but when getting the total number of datasets from database I get a totally different number than the one appearing in Analytics tab. This number is specially different for datasets from hive, much higher in the Analytics tab than from database. If I’m getting this right, Datahub number of datasets is calculated with https://github.com/datahub-project/datahub/blob/2f0616ea5b2c1927107a4726773c907a59a0483f/datahub-graphql-core/src/main/java/com/linkedin/datahub/graphql/analytics/resolver/GetHighlightsResolver.java#L154|https://github.com/datahub-project/datahub/blob/2f0616ea5b2c1927107a4726773c907a59[…]n/datahub/graphql/analytics/resolver/GetHighlightsResolver.java using datahub_usage_event index? From database I’m counting the total of rows with aspect datasetKey excluding the urns that have status removed and picking only last version. Should my query in database represent exactly the amount of datasets? Do I’m missing something? Is there a way to know to which ES query is resolving the qraphql query used in the frontend? Thanks in advance

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Which DataHub version are you using? (e.g. 0.12.0)
  2. Please post any relevant error logs on the thread!

I tried with both version 0.11.0 and version 0.12.1