Accessing Datahub Analytics Data from MySQL and Elasticsearch on Kubernetes Deployment

Original Slack Thread

How do I connect to the internal datahub analytics database (I think it’s MySQL) on a kubernetes deployment?

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Which DataHub version are you using? (e.g. 0.12.0)
  2. Please post any relevant error logs on the thread!

There is not a database specific for analytics. If using the helm chart prerequisites the mysql database is called prerequisites-mysql with standard port 3306 and the credentials are in a k8 secret if not overridden called mysql-secrets. Connecting to the database can be done using port forwarding using kubectl commands: https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/

<@U03MF8MU5P0> thanks! we will try this out as we use helm. I presume there is only one database and looks to be called datahub? Is there some documentation on the database schema and table structure so we know what each schema contains?

There is one database and a single table. The schema DDL is located in this file for mysql for example: https://github.com/datahub-project/datahub/blob/master/docker/mysql-setup/init.sql#L6

so the analytics lie somewhere within here?

I suppose I am not sure what you’re looking for exactly. If you’re referencing the data behind the analytics charts (which makes sense) then that data is only in Elasticsearch in the datahub__usage__⁣events indices. If you’re looking to generate analytics on your metadata for creating charts/dashboards or other visualizations then the data may be in mysql and/or elasticsearch.

yep was looking for the analytic charts data. How would I pull that out of Elasticsearch? The Mysql stuff was useful for us too so thanks for that.

Use the elasticsearch api [https://www.elastic.co/guide/en/elasticsearch/reference/7.17/query-dsl-match-all-query.html] to query the datahub_usage_events index or the openapi [https://demo.datahubproject.io/openapi/swagger-ui/index.html#/DataHub%20Usage/raw] which passes the elasticsearch query to right index.

Example curl

  '<https://demo.datahubproject.io/openapi/v1/analytics/datahub_usage_events/_search>' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "query": {
        "match_all": {}
    }
}'```

Thanks, so I ran that curl and it worked. I’m interested in the results here. In the json it looks like a couple of events happened and were recorded. However, when I check the analytics board it says 23 active users etc. We redeployed this morning so I guess that explains why our logs have not got much in them but how does datahub have a history of what happened? if so, how can I also access that as at the moment it seems to only give me a portion of the data neededattachmentattachment

That’s just 1 page of results. You can customize the page size and use the elasticsearch search_after mechanism to page through all the results. Using the elastisearch api you can filter and do other things like aggregations. For example link to the search_after docs is here -> https://www.elastic.co/guide/en/elasticsearch/reference/7.17/paginate-search-results.html#search-after

This is very much a raw and experimental interface, so depending on what you would like to do, you can always directly query elasticsearch and use clients for your preferred language, etc.

Thanks! this is what we are looking for- Can now bring that out for analysis and make our own dashboards

<@U02TYQ4SPPD> Hi, If I query elasticsearch for all the results and filtered for LogInEvent it seems that the users that sign in with SSO dont come through. Do you know what could be the issue?

The events in this index are generated by the frontend javascript code, you can see them in the /track calls made by your browser. It is possible that this tracking is only on the datahub login page itself. A page that is bypassed by an SSO login. A different way to detect user access might be to count all events by user using a histogram. This is a random example that would count events by user per month for approx the last 3 months excluding the admin user. This might be used to power a leader board for example of the top users per month based on the # of events for that user.

    "size": 0,
    "aggregations": {
        "filtered": {
            "filter": {
                "bool": {
                    "must": [
                        {
                            "range": {
                                "timestamp": {
                                    "from": "1698401323000",
                                    "to": "1706285323000",
                                    "include_lower": true,
                                    "include_upper": false,
                                    "boost": 1
                                }
                            }
                        }
                    ],
                    "must_not": [
                        {
                            "term": {
                                "actorUrn.keyword": "urn:li:corpuser:admin"
                            }
                        }
                    ],
                    "adjust_pure_negative": true,
                    "boost": 1
                }
            },
            "aggregations": {
                "date_histogram": {
                    "date_histogram": {
                        "field": "timestamp",
                        "calendar_interval": "month",
                        "offset": 0,
                        "order": {
                            "_key": "asc"
                        },
                        "keyed": false,
                        "min_doc_count": 0
                    },
                    "aggregations": {
                        "usage": {
                            "terms": {
                                "field": "actorUrn.keyword",
                                "size": 10000
                            }
                        }
                    }
                }
            }
        }
    }
}```

The result of which if run on a real system would have more users, but this at least give you an idea of the response

  "took": 318,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 8218,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "filter#filtered": {
      "meta": {},
      "doc_count": 8218,
      "date_histogram#date_histogram": {
        "buckets": [
          {
            "key_as_string": "2024-01-01T00:00:00.000Z",
            "key": 1704067200000,
            "doc_count": 8218,
            "sterms#usage": {
              "doc_count_error_upper_bound": 0,
              "sum_other_doc_count": 0,
              "buckets": [
                {
                  "key": "urn:li:corpuser:datahub",
                  "doc_count": 8218
                }
              ]
            }
          }
        ]
      }
    }
  }
}```

you would get a bucket per month if you have history (this demo system doesn’t)