Writing Queries on a MSSQL DB Dataset Ingested into DataHub

Original Slack Thread

<@U06TM7M40RK> how to write queries on a MSSQL DB dataset ingested into datahub?

To write queries on a MSSQL DB dataset ingested into DataHub, you can follow these steps:

  1. Create a Query File: Create a JSON file containing your SQL queries. Each line in this file should be a separate JSON object with the following fields:

    • query: The SQL query to parse.
    • timestamp (optional): The timestamp of the query, in seconds since the epoch.
    • user (optional): The user who ran the query.
    • operation_type (optional): Platform-specific operation type, used if the operation type can’t be parsed.
    • downstream_tables (optional): Fallback list of tables that the query writes to, used if the query can’t be parsed.
    • upstream_tables (optional): Fallback list of tables the query reads from, used if the query can’t be parsed.

    Example queries.json file:

    { "query": "SELECT * FROM my_table", "timestamp": 1689232738, "user": "user_a", "upstream_tables": ["my_database.my_schema.my_table"] }
    { "query": "INSERT INTO my_table VALUES (1, 'a')", "timestamp": 1689232737, "user": "user_b", "downstream_tables": ["my_database.my_schema.my_table"] }
    
  2. Install the SQL Queries Plugin: Install the necessary plugin for SQL queries ingestion.

    pip install 'acryl-datahub[sql-queries]'
    
  3. Create a Recipe File: Create a YAML file for the ingestion recipe. This file will specify the source type, configuration, and the query file path.

    datahub_api:
      server: http://localhost:8080
      timeout_sec: 60
    source:
      type: sql-queries
      config:
        platform: "mssql"
        default_db: "MY_DATABASE"
        query_file: "./queries.json"
    
  4. Run the CLI Ingest Command: Use the DataHub CLI to run the ingestion process with the created recipe file.

    datahub ingest -c ingest_queries.yml
    

For more detailed information, you can refer to the <https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/sql-queries|SQL Queries Ingestion Documentation>.

Would you like more detail on any specific step?

Sources:

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?


:information_source: We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a :thumbsup: or :thumbsdown: so other folks in the Community know if you’re still looking for help on the topic!

To help you effectively, I need a bit more information:

  1. Are you using the UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g., 0.12.0)?
  3. What data source(s) are you integrating with DataHub (e.g., BigQuery, MSSQL)?

Based on the information you provide, I can give you a more precise answer. However, I can provide some general guidance on querying metadata in DataHub.

General Guidance on Querying Metadata in DataHub

DataHub primarily focuses on metadata management rather than storing and querying actual data. According to the <About DataHub Search | DataHub documentation>, DataHub does not provide a means to query the actual data directly from the database. It is designed to catalog metadata, such as schema information, descriptions, and lineage.

Query Button Under Dataset

The query button you see under any dataset in the DataHub UI is typically used for metadata queries rather than querying the actual data. This button allows you to explore metadata aspects like schema, lineage, and other related information.

Profiling and Sample Values

If you have profiling enabled, DataHub can show sample values in the UI. Profiling helps in understanding the data distribution and quality but does not allow for querying the actual data directly. For more details on enabling profiling, you can refer to the <https://datahubproject.io/docs/next/metadata-ingestion/docs/dev_guides/sql_profiles/|profiling documentation>.

Example GraphQL Query for Metadata

Here is an example of a GraphQL query to fetch metadata about a dataset, including its schema and lineage:

query {
  dataset(urn: "your_dataset_urn") {
    name
    description
    schema {
      columns {
        name
        type
        description
      }
    }
    upstreamLineage {
      upstreams {
        dataset {
          name
          urn
        }
      }
    }
    downstreamLineage {
      downstreams {
        dataset {
          name
          urn
        }
      }
    }
  }
}

Replace "your_dataset_urn" with the actual URN of the dataset you are interested in.

Would you like more details on any specific aspect of querying metadata or setting up profiling?

Sources:

<@U06TM7M40RK> how can I use query section in datahub UI for a dataset already ingested?

Sure thing! I’m looking through the available data to answer your question. Hold on a second… :hourglass_flowing_sand: