Populating the "Stats" Tab with Data from Internal Tools using DataHub GraphQL API

Original Slack Thread

is there a way, via API, to populate the “Stats” tab with data from our own internal tool? i’m not looking to add new info that isn’t already there by default, i just want to populate it. for example, “Row Count Over Time” “Column Count Over Time” etc. any exchanges how i can populate from external internal tools and data?

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?

:information_source: We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a :thumbsup: or :thumbsdown: so other folks in the Community know if you’re still looking for help on the topic!

To populate the “Stats” tab with data from your own internal tool, you can use the DataHub GraphQL API to ingest profiling statistics. Here’s a high-level overview of how to achieve this:

  1. Enable Profiling Stats in Ingestion: Ensure that profiling stats are enabled in your ingestion process. This will allow you to capture and query profiling stats via the GraphQL API. According to <Slack Slack conversation>, profiling stats can be queried if they are enabled during ingestion.

  2. Use GraphQL API to Ingest Stats: You can use the GraphQL API to push profiling data. The relevant GraphQL objects for dataset profiling are detailed in the <Objects | DataHub documentation>.

  3. Example Code: Here is an example of how you might structure a GraphQL mutation to push profiling data. This example assumes you are using Python to interact with the GraphQL API.

import requests

url = "https://your-datahub-instance.com/api/graphql"
headers = {
    "Content-Type": "application/json",
    "Authorization": "Bearer YOUR_ACCESS_TOKEN"

query = """
mutation {
  updateDatasetProfile(input: {
    urn: "urn:li:dataset:(urn:li:dataPlatform:bigquery,your-dataset,PROD)",
    profile: {
      timestampMillis: 1622547800000,
      columnProfiles: [
          columnName: "column1",
          nullCount: 5,
          nullProportion: 0.05,
          distinctCount: 10,
          distinctProportion: 0.1,
          min: "1",
          max: "100",
          mean: "50.5",
          median: "50",
          stdev: "10.5",
          quantiles: ["10", "25", "50", "75", "90"]
      rowCount: 100,
      columnCount: 5
  }) {

response = requests.post(url, headers=headers, json={"query": query})
  1. Custom Ingestion Source: If you need to ingest data from a custom internal tool, you can write a custom ingestion source using the DataHub Python SDK. The <https://datahubproject.io/docs/metadata-ingestion/as-a-library|metadata ingestion documentation> provides guidance on how to use the SDK for this purpose.

Would you like more detailed steps or specific examples on any of these points?