Extracting Schema and Table/View Name from PostgreSQL Dataset URN and Passing Great Expectations Test Results to DataHub via DataHubValidationAction

Original Slack Thread

<@U06TM7M40RK>
Is it possible to write a query that extracts the schema and table/view name from the Dataset URN in the postgres database for me?

To extract the schema and table/view name from a Dataset URN in a PostgreSQL database using DataHub, you can utilize GraphQL queries. The URN format for a PostgreSQL dataset typically follows the pattern: urn:li:dataset:(urn:li:dataPlatform:postgres,database.schema.table,PROD).

Here’s a GraphQL query that can help you retrieve the schema and table/view name from a Dataset URN:

query getDatasetSchema($urn: String!) {
  dataset(urn: $urn) {
    schemaMetadata {
      fields {
        fieldPath
        type
        nativeDataType
      }
    }
  }
}

In this query, replace $urn with the actual URN of your dataset. This query will return the schema metadata, including the fields, which can help you identify the schema and table/view name.

If you need to extract the schema and table/view name directly from the URN string, you can parse the URN using a script or a function in your preferred programming language. The URN string can be split by commas and periods to extract the database, schema, and table/view names.

Would you like more detailed guidance on how to implement this in a specific programming language or environment?

Sources:

<@U06TM7M40RK> Yes, give me example of split URN in Kotlin

I’m not able to find any data sources to answer your question. You might want to specify the format of the URN you are working with. If this doesn’t help, please reach out to the DataHub support team for further assistance. :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

<@U06TM7M40RK>
I have one more question. I am doing tests (Expectations) using GX (Great expectation). How can I pass the test results to DataHub?
DataHub 0.14.1
GX 0.17.23

To pass test results from Great Expectations (GX) to DataHub, you can use the DataHubValidationAction within your Great Expectations Checkpoint. This integration allows you to send assertions (expectations) and their results to DataHub using DataHub’s Python Rest emitter.

Here’s a step-by-step guide on how to set this up:

  1. Install the Required Dependency: Ensure that you have the acryl-datahub package installed with the Great Expectations plugin. You can install it using pip:

    pip install 'acryl-datahub[great-expectations]'
    
  2. Configure the DataHubValidationAction: Add the DataHubValidationAction to your Great Expectations Checkpoint configuration. This action will send the validation results to DataHub.

    Here’s an example configuration for a Checkpoint:

    name: my_checkpoint
    config_version: 1
    class_name: Checkpoint
    validations:
      - batch_request:
          datasource_name: my_datasource
          data_connector_name: my_data_connector
          data_asset_name: my_table
        expectation_suite_name: my_expectation_suite
    action_list:
      - name: store_validation_result
        action:
          class_name: StoreValidationResultAction
      - name: update_data_docs
        action:
          class_name: UpdateDataDocsAction
      - name: datahub_action
        action:
          module_name: datahub.integrations.great_expectations.action
          class_name: DataHubValidationAction
          server_url: http://localhost:8080  # Replace with your DataHub GMS endpoint
          token: YOUR_DATAHUB_TOKEN  # Optional: If authentication is required
    
  3. Run the Checkpoint: Execute the Checkpoint to run your expectations and send the results to DataHub.

    from great_expectations.checkpoint import Checkpoint
    from great_expectations.data_context import DataContext
    
    context = DataContext("/path/to/your/great_expectations")
    checkpoint = context.get_checkpoint("my_checkpoint")
    checkpoint_result = checkpoint.run()
    
  4. Verify in DataHub: After running the Checkpoint, you should be able to see the validation results in the DataHub UI under the “Validation” tab for the respective dataset.

Please note that the DataHubValidationAction currently supports v3 API datasources using the SqlAlchemyExecutionEngine. If you are using a different execution engine, such as Spark or Pandas, this integration may not work as expected.

Would you like more detailed guidance on any specific part of this process?

Sources:

<@U06TM7M40RK>
Can it be done in a different way? Without using DataHubValidationAction

It seems likely that integrating Great Expectations (GX) with DataHub without using DataHubValidationAction would require a custom approach. The DataHubValidationAction is specifically designed to facilitate the integration between GX and DataHub by sending validation results directly to DataHub’s metadata store. However, if you want to explore alternative methods, here are a few possibilities:

  1. Custom Emitter: You could create a custom script or tool that extracts validation results from GX and formats them into DataHub’s metadata model. This would involve using DataHub’s REST API to manually send the validation results as metadata change proposals.

  2. Use of DataHub’s Open API: DataHub provides an open API that you can use to ingest metadata. You could write a script that takes the output of your GX validations and sends it to DataHub using this API. This would require understanding the API endpoints and the expected data format.

  3. Intermediate Storage: Store the GX validation results in an intermediate storage system (e.g., a database or a file system) and then use a separate process to read from this storage and send the data to DataHub.

  4. Direct Database Integration: If your GX results are stored in a database, you could write a custom integration that reads from this database and sends the data to DataHub.

These approaches would require a good understanding of both GX’s output format and DataHub’s metadata model. Additionally, you would need to handle authentication and error handling when interacting with DataHub’s API.

Would you like more detailed guidance on any of these alternative methods?

Sources:

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><