Steps to Push Great Expectations Assertions to DataHub Using Acryl-DataHub Plugin

Original Slack Thread

<@U06TM7M40RK> I have datahub setup and I have configure great expectation assertions. Assertions are working fine locally.
I want to push them to datahub. I want to use great_expectations and acryl-datahub plugins to push assertions to datahub. Can you give me what all steps will be required.

To push Great Expectations assertions to DataHub using the acryl-datahub plugin, you can follow these steps:

  1. Install the Required Plugin: Ensure that you have the acryl-datahub-gx-plugin installed in your Great Expectations environment. You can do this by running the following command:

    pip install 'acryl-datahub-gx-plugin'
    
  2. Configure DataHubValidationAction: You need to add the DataHubValidationAction to your Great Expectations Checkpoint configuration. This action will send the assertions and their results to DataHub. Here is an example configuration:

    action_list:
      - name: datahub_action
        action:
          module_name: datahub_gx_plugin.action
          class_name: DataHubValidationAction
          server_url: http://localhost:8080 # Replace with your DataHub server URL
    

    Configuration Options:

    • server_url (required): The URL of the DataHub GMS endpoint.
    • env (optional): The environment to use in the namespace when constructing dataset URNs.
    • exclude_dbname (optional): Exclude the database name/catalog when constructing dataset URNs.
    • platform_alias (optional): Platform alias when constructing dataset URNs.
    • platform_instance_map (optional): Maps the GX ‘data source’ name to a platform instance on DataHub.
    • graceful_exceptions (optional): If set to true, most runtime errors in the lineage backend will be suppressed.
    • token (optional): Bearer token used for authentication.
    • timeout_sec (optional): Per-HTTP request timeout.
    • retry_status_codes (optional): Retry HTTP request on these status codes.
    • retry_max_times (optional): Maximum times to retry if HTTP request fails.
    • extra_headers (optional): Extra headers to be added to the DataHub request.
    • parse_table_names_from_sql (optional): Enable SQL parsing to try to parse the datasets being asserted.
    • convert_urns_to_lowercase (optional): Convert dataset URNs to lowercase.
  3. Run the Checkpoint: After configuring the DataHubValidationAction, run your Great Expectations Checkpoint. This will execute the assertions and send the results to DataHub.

  4. Debugging: If you encounter issues, you can enable debug logging by setting the environment variable DATAHUB_DEBUG to true.

For more detailed information, you can refer to the <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/integration_docs/great-expectations.md|Great Expectations integration guide>.

Would you like more detailed guidance on any specific step?

Sources:

<@U06TM7M40RK> I dont want to use datahub-gx-plugin. Can i keep using great_expectation or acryl-datahub. Just like you have suggested in below

Sources:

It seems likely that you can use Great Expectations with DataHub without relying on the datahub-gx-plugin. Instead, you can use the acryl-datahub package to push assertions to DataHub. Here are some steps you might consider:

  1. Install Acryl-DataHub: Ensure that you have the acryl-datahub package installed. You can do this by running:

    pip install acryl-datahub
    
  2. Configure Great Expectations: You can configure your Great Expectations setup to use the DataHubValidationAction directly in your Checkpoint configuration. This action will send the validation results to DataHub. Here is an example configuration:

    action_list:
      - name: datahub_action
        action:
          module_name: datahub.integrations.great_expectations.action
          class_name: DataHubValidationAction
          server_url: http://localhost:8080 # Replace with your DataHub server URL
    
  3. Run the Checkpoint: Execute your Great Expectations Checkpoint. This will run the assertions and send the results to DataHub.

  4. Troubleshooting: If you encounter issues, ensure that your DataHub server URL and any required authentication tokens are correctly configured.

It seems that the integration between Great Expectations and DataHub is supported, but it requires the use of the DataHubValidationAction to send results to DataHub. This approach does not require the datahub-gx-plugin.

Would you like more detailed guidance on any specific step?

Sources:

<@U06TM7M40RK> We tried above steps given by you but we are getting below error when we run the checkpoint.
No module named datahub.integrations.great_expectations.action could be found in your plugins directory.
- Please verify your plugins directory is configured correctly.
- Please verify you have a module named datahub.integrations.great_expectations.action in your plugins directory.

The error message indicates that the module datahub.integrations.great_expectations.action is not found in your environment. This suggests that the necessary integration module is not installed or not correctly configured. Here are some steps to troubleshoot and resolve this issue:

  1. Verify Installation: Ensure that the acryl-datahub package is installed correctly. You can verify this by running:

    pip list | grep acryl-datahub
    

    If it’s not installed, you can install it using:

    pip install acryl-datahub
    
  2. Check the Module Path: Ensure that the module path datahub.integrations.great_expectations.action is correct. The module should be part of the acryl-datahub package. If the path is incorrect, you may need to adjust your configuration.

  3. Reinstall the Package: Sometimes, reinstalling the package can resolve issues with missing modules. Try uninstalling and then reinstalling the acryl-datahub package:

    pip uninstall acryl-datahub
    pip install acryl-datahub
    
  4. Verify Python Environment: Ensure that you are using the correct Python environment where acryl-datahub is installed. If you are using a virtual environment, make sure it is activated.

  5. Check for Known Issues: There might be known issues or updates related to the integration. You can check the <https://datahubproject.io/docs/0.13.1/metadata-ingestion/integration_docs/great-expectations|DataHub Great Expectations Integration Guide> for any updates or additional configuration steps.

If these steps do not resolve the issue, please provide more details about your environment and configuration, and I can assist further.

Sources:

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><