Integrating Dremio with DataHub: Tag Ingestion and Troubleshooting

Original Slack Thread

Hello everyone, I’m considering integrating Dremio with DataHub by following the instructions on this page.
https://blog.datahubproject.io/introducing-datahub-integration-with-dremio-322198bac697

I’d like to ask: Can the tags set on Virtual Datasets and Physical Datasets in Dremio also be ingested into DataHub as part of the viewable metadata through this ingestion feature?

Hey there! :wave:

<@U06TM7M40RK> Can you help with the questions in this thread?

:information_source: Please continue to tag RunLLM (e.g <@U06TM7M40RK>) to have it continue generating responses!

It seems likely that the integration of Dremio with DataHub can include the ingestion of tags for both Virtual and Physical Datasets. According to the <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/dremio/dremio_source.py|Dremio Source Code>, the Dremio integration extracts metadata for both physical and virtual datasets, which includes schema and column information, lineage, ownership, and glossary terms. This suggests that tags, which are a form of metadata, could also be ingested.

Additionally, the <Dataset | DataHub documentation on adding tags or glossary terms> indicates that tags can be attached to datasets and their fields, which supports the idea that tags from Dremio datasets could be ingested and viewed in DataHub.

Would you like more detailed information on how to set this up or any specific part of the process?

Sources:

I have another question. In this documentation https://blog.datahubproject.io/introducing-datahub-integration-with-dremio-322198bac697, it mentions that version v0.14.1.12 can seamlessly integrate metadata from Dremio into DataHub. However, after configuring the ingestion, I encountered the following error:
ERROR {datahub.entrypoints:218} - Command failed: Failed to find a registered source for type Dremio: 'Did not find a registered class for Dremio'.
Could you please provide some guidance on how to resolve it?
Thank you.

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><

Hi <@U07FV3UHE4D>, the answer from RunLLM almost resolve my first question.
But the second one still need to be take a look. Many thanks :slight_smile:

<@U06TM7M40RK> Please probide a recipe for ingesting Dremio metadata to DataHub.

It seems likely that the error you’re encountering is due to a misconfiguration or an issue with the DataHub version you are using. The error message indicates that the system is unable to find a registered source for the type “Dremio,” which suggests that the Dremio source might not be properly set up or recognized in your DataHub instance.

To address this, you might want to check the following:

  1. DataHub Version: Ensure that you are using a version of DataHub that supports Dremio integration. The documentation you referred to mentions version v0.14.1.12, so make sure your DataHub instance is at least this version or newer.

  2. Configuration: Double-check your ingestion configuration to ensure that it is correctly set up for Dremio. According to the <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/docs/sources/dremio/dremio.md|Dremio Starter Recipe>, your configuration should look something like this:

    source:
      type: dremio
      config:
        authentication_method: PAT
        password: &lt;your_api_token&gt;
        is_dremio_cloud: True
        dremio_cloud_project_id: &lt;project_id&gt;
        include_query_lineage: True
        source_mappings:
          - platform: s3
            source_name: samples
        schema_pattern:
          allow:
            - "&lt;source_name&gt;.&lt;table_name&gt;"
    
    sink:
      # Define your sink configuration here
    
  3. Plugin Installation: Ensure that the Dremio plugin is installed and properly registered in your DataHub environment. This might involve updating your DataHub installation or ensuring that the necessary Python packages are installed.

  4. Logs and Debugging: Check the logs for any additional error messages or stack traces that might provide more context on why the Dremio source is not being recognized.

If you need more detailed guidance on any of these steps, please let me know!

Sources:

<@U06TM7M40RK> About the plugin installation, what’s are the necessary Python packages for Dremio ingestion?

To enable Dremio ingestion in DataHub, you need to install the appropriate Python package that includes the Dremio source plugin. According to the <https://github.com/datahub-project/datahub/blob/master/docs/cli.md|DataHub CLI documentation>, you can install the Dremio plugin using the following command:

pip install 'acryl-datahub[dremio]'

This command will install the necessary dependencies for the Dremio source, allowing you to configure and run metadata ingestion from Dremio into DataHub.

Would you like more detailed instructions on setting up the ingestion process or any other aspect of the integration?

Sources:

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Just checking in a final time, Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look at this thread :slight_smile:

Hope this helps ! ><