Querying Actual Data in DataHub: Understanding Metadata Management and Limitations

Original Slack Thread

Hi folks, I am running datahub v0.13.1, if I try to ingest the actual data into opensearch/elasticsearch, can we query the actual data in datahub? how? I saw query button under any dataset, can we run any query under the dataset?@

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?


:information_source: We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a :thumbsup: or :thumbsdown: so other folks in the Community know if you’re still looking for help on the topic!

To address your question about querying actual data in DataHub:

  1. DataHub Version: You mentioned you are using DataHub v0.13.1.
  2. Ingestion Method: Are you using the UI or CLI for ingestion?
  3. Data Source: What data source(s) are you integrating with DataHub (e.g., BigQuery, MongoDB, etc.)?

Querying Actual Data in DataHub

DataHub primarily focuses on metadata management rather than storing and querying actual data. According to <About DataHub Search | DataHub documentation>, DataHub does not provide a means to query the actual data directly from the database. It is designed to catalog metadata, such as schema information, descriptions, and lineage.

Query Button Under Dataset

The query button you see under any dataset in the DataHub UI is typically used for metadata queries rather than querying the actual data. This button allows you to explore metadata aspects like schema, lineage, and other related information.

Profiling and Sample Values

If you have profiling enabled, DataHub can show sample values in the UI. Profiling helps in understanding the data distribution and quality but does not allow for querying the actual data directly. For more details on enabling profiling, you can refer to the <https://datahubproject.io/docs/next/metadata-ingestion/docs/dev_guides/sql_profiles/|profiling documentation>.

Example GraphQL Query for Metadata

Here is an example of a GraphQL query to fetch metadata about a dataset, including its schema and lineage:

query {
  dataset(urn: "your_dataset_urn") {
    name
    description
    schema {
      columns {
        name
        type
        description
      }
    }
    upstreamLineage {
      upstreams {
        dataset {
          name
          urn
        }
      }
    }
    downstreamLineage {
      downstreams {
        dataset {
          name
          urn
        }
      }
    }
  }
}

Replace "your_dataset_urn" with the actual URN of the dataset you are interested in.

Would you like more details on any specific aspect of querying metadata or setting up profiling?

Sources: