Accessing Datahub Analytics Data from MySQL and Elasticsearch on Kubernetes Deployment

user-1 · March 4, 2024, 4:38pm

How do I connect to the internal datahub analytics database (I think it’s MySQL) on a kubernetes deployment?

datahub_team · March 4, 2024, 4:38pm

Hey there! Make sure your message includes the following information if relevant, so we can help more effectively!

Which DataHub version are you using? (e.g. 0.12.0)
Please post any relevant error logs on the thread!

datahub_team · March 4, 2024, 4:38pm

There is not a database specific for analytics. If using the helm chart prerequisites the mysql database is called prerequisites-mysql with standard port 3306 and the credentials are in a k8 secret if not overridden called mysql-secrets. Connecting to the database can be done using port forwarding using kubectl commands: https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/

user-1 · March 4, 2024, 4:38pm

<@U03MF8MU5P0> thanks! we will try this out as we use helm. I presume there is only one database and looks to be called datahub? Is there some documentation on the database schema and table structure so we know what each schema contains?

datahub_team · March 4, 2024, 4:38pm

There is one database and a single table. The schema DDL is located in this file for mysql for example: https://github.com/datahub-project/datahub/blob/master/docker/mysql-setup/init.sql#L6

user-1 · March 4, 2024, 4:38pm

so the analytics lie somewhere within here?

datahub_team · March 4, 2024, 4:38pm

I suppose I am not sure what you’re looking for exactly. If you’re referencing the data behind the analytics charts (which makes sense) then that data is only in Elasticsearch in the datahub__usage__⁣events indices. If you’re looking to generate analytics on your metadata for creating charts/dashboards or other visualizations then the data may be in mysql and/or elasticsearch.

user-1 · March 4, 2024, 4:38pm

yep was looking for the analytic charts data. How would I pull that out of Elasticsearch? The Mysql stuff was useful for us too so thanks for that.

datahub_team · March 4, 2024, 4:38pm

Use the elasticsearch api [https://www.elastic.co/guide/en/elasticsearch/reference/7.17/query-dsl-match-all-query.html] to query the datahub_usage_events index or the openapi [https://demo.datahubproject.io/openapi/swagger-ui/index.html#/DataHub%20Usage/raw] which passes the elasticsearch query to right index.

user-3 · March 4, 2024, 4:38pm

Example curl

  '<https://demo.datahubproject.io/openapi/v1/analytics/datahub_usage_events/_search>' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "query": {
        "match_all": {}
    }
}'```

user-1 · March 4, 2024, 4:38pm

Thanks, so I ran that curl and it worked. I’m interested in the results here. In the json it looks like a couple of events happened and were recorded. However, when I check the analytics board it says 23 active users etc. We redeployed this morning so I guess that explains why our logs have not got much in them but how does datahub have a history of what happened? if so, how can I also access that as at the moment it seems to only give me a portion of the data needed attachment

datahub_team · March 4, 2024, 4:38pm

That’s just 1 page of results. You can customize the page size and use the elasticsearch search_after mechanism to page through all the results. Using the elastisearch api you can filter and do other things like aggregations. For example link to the search_after docs is here -> https://www.elastic.co/guide/en/elasticsearch/reference/7.17/paginate-search-results.html#search-after

datahub_team · March 4, 2024, 4:38pm

This is very much a raw and experimental interface, so depending on what you would like to do, you can always directly query elasticsearch and use clients for your preferred language, etc.

user-1 · March 4, 2024, 4:38pm

Thanks! this is what we are looking for- Can now bring that out for analysis and make our own dashboards

user-2 · March 4, 2024, 4:38pm

<@U02TYQ4SPPD> Hi, If I query elasticsearch for all the results and filtered for LogInEvent it seems that the users that sign in with SSO dont come through. Do you know what could be the issue?

datahub_team · March 4, 2024, 4:38pm

The events in this index are generated by the frontend javascript code, you can see them in the /track calls made by your browser. It is possible that this tracking is only on the datahub login page itself. A page that is bypassed by an SSO login. A different way to detect user access might be to count all events by user using a histogram. This is a random example that would count events by user per month for approx the last 3 months excluding the admin user. This might be used to power a leader board for example of the top users per month based on the # of events for that user.

    "size": 0,
    "aggregations": {
        "filtered": {
            "filter": {
                "bool": {
                    "must": [
                        {
                            "range": {
                                "timestamp": {
                                    "from": "1698401323000",
                                    "to": "1706285323000",
                                    "include_lower": true,
                                    "include_upper": false,
                                    "boost": 1
                                }
                            }
                        }
                    ],
                    "must_not": [
                        {
                            "term": {
                                "actorUrn.keyword": "urn:li:corpuser:admin"
                            }
                        }
                    ],
                    "adjust_pure_negative": true,
                    "boost": 1
                }
            },
            "aggregations": {
                "date_histogram": {
                    "date_histogram": {
                        "field": "timestamp",
                        "calendar_interval": "month",
                        "offset": 0,
                        "order": {
                            "_key": "asc"
                        },
                        "keyed": false,
                        "min_doc_count": 0
                    },
                    "aggregations": {
                        "usage": {
                            "terms": {
                                "field": "actorUrn.keyword",
                                "size": 10000
                            }
                        }
                    }
                }
            }
        }
    }
}```

datahub_team · March 4, 2024, 4:39pm

The result of which if run on a real system would have more users, but this at least give you an idea of the response

  "took": 318,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 8218,
      "relation": "eq"
    },
    "max_score": null,
    "hits": []
  },
  "aggregations": {
    "filter#filtered": {
      "meta": {},
      "doc_count": 8218,
      "date_histogram#date_histogram": {
        "buckets": [
          {
            "key_as_string": "2024-01-01T00:00:00.000Z",
            "key": 1704067200000,
            "doc_count": 8218,
            "sterms#usage": {
              "doc_count_error_upper_bound": 0,
              "sum_other_doc_count": 0,
              "buckets": [
                {
                  "key": "urn:li:corpuser:datahub",
                  "doc_count": 8218
                }
              ]
            }
          }
        ]
      }
    }
  }
}```

datahub_team · March 4, 2024, 4:39pm

you would get a bucket per month if you have history (this demo system doesn’t)

Topic		Replies	Views
Setting up datahub's Elasticsearch backend with ECK on Kubernetes and configuring credentials all-things-deployment	8	86	March 4, 2024
Seeking Help for Stable Deployment of DataHub on GCP Using Kubernetes getting-started	2	58	March 4, 2024
Enabling SSO with Google on DataHub in Kubernetes: Accessing the Pod for Exposing Endpoints all-things-deployment	2	58	March 4, 2024
Integrating DataHub with Presto for Querying Tables Synced from Spark on Kubernetes getting-started	2	41	March 4, 2024
Setting up MySQL and Kafka Clusters in Datahub with Docker Compose Override all-things-deployment	3	71	March 4, 2024

Accessing Datahub Analytics Data from MySQL and Elasticsearch on Kubernetes Deployment

Related topics