Populating the "Stats" Tab with Data from Internal Tools using DataHub GraphQL API

user-2 · June 24, 2024, 12:03am

is there a way, via API, to populate the “Stats” tab with data from our own internal tool? i’m not looking to add new info that isn’t already there by default, i just want to populate it. for example, “Row Count Over Time” “Column Count Over Time” etc. any exchanges how i can populate from external internal tools and data?

datahub_team · June 24, 2024, 12:03am

Hey there! Make sure your message includes the following information if relevant, so we can help more effectively!

Are you using UI or CLI for ingestion?
Which DataHub version are you using? (e.g. 0.12.0)
What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?

We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a or so other folks in the Community know if you’re still looking for help on the topic!

user-3 · June 24, 2024, 12:03am

To populate the “Stats” tab with data from your own internal tool, you can use the DataHub GraphQL API to ingest profiling statistics. Here’s a high-level overview of how to achieve this:

Enable Profiling Stats in Ingestion: Ensure that profiling stats are enabled in your ingestion process. This will allow you to capture and query profiling stats via the GraphQL API. According to <Slack Slack conversation>, profiling stats can be queried if they are enabled during ingestion.
Use GraphQL API to Ingest Stats: You can use the GraphQL API to push profiling data. The relevant GraphQL objects for dataset profiling are detailed in the <Objects | DataHub documentation>.
Example Code: Here is an example of how you might structure a GraphQL mutation to push profiling data. This example assumes you are using Python to interact with the GraphQL API.

import requests

url = "https://your-datahub-instance.com/api/graphql"
headers = {
    "Content-Type": "application/json",
    "Authorization": "Bearer YOUR_ACCESS_TOKEN"
}

query = """
mutation {
  updateDatasetProfile(input: {
    urn: "urn:li:dataset:(urn:li:dataPlatform:bigquery,your-dataset,PROD)",
    profile: {
      timestampMillis: 1622547800000,
      columnProfiles: [
        {
          columnName: "column1",
          nullCount: 5,
          nullProportion: 0.05,
          distinctCount: 10,
          distinctProportion: 0.1,
          min: "1",
          max: "100",
          mean: "50.5",
          median: "50",
          stdev: "10.5",
          quantiles: ["10", "25", "50", "75", "90"]
        }
      ],
      rowCount: 100,
      columnCount: 5
    }
  }) {
    urn
  }
}
"""

response = requests.post(url, headers=headers, json={"query": query})
print(response.json())

Custom Ingestion Source: If you need to ingest data from a custom internal tool, you can write a custom ingestion source using the DataHub Python SDK. The <https://datahubproject.io/docs/metadata-ingestion/as-a-library|metadata ingestion documentation> provides guidance on how to use the SDK for this purpose.

Would you like more detailed steps or specific examples on any of these points?

Sources:

<Slack conversation on profiling stats>
<https://datahubproject.io/docs/api/graphql/overview|GraphQL API overview>
<Objects | DataHub objects for dataset profiling>
<https://datahubproject.io/docs/metadata-ingestion/as-a-library|Metadata ingestion as a library> 0 button 0 button Hint: Mention @runllm in the thread for followups.

Topic		Replies	Views
Enabling Stats Tab and Usage Statistics in DataHub GUI ingestion	7	13	March 17, 2025
Using DataHubGraph for Monitoring and Analytics ingestion	10	4	April 28, 2025
Configuring Profiling and Stateful Ingestion Features in DataHub ingestion	8	32	February 10, 2025
Troubleshooting column level stats and empty query tab in DataHub v.0.13.0 UI ingestion ingestion	1	52	April 8, 2024
Troubleshooting "Monthly Queries" Stats in DataHub with BigQuery Integration ingestion	17	9	July 29, 2024

Populating the "Stats" Tab with Data from Internal Tools using DataHub GraphQL API

Related topics