Troubleshooting DataHub Ingestion for MLflow Data

Original Slack Thread

<@U06TM7M40RK> what is a process to ingest data from mlflow?

To ingest data from MLflow into DataHub, you can use the MLflow ingestion source provided by DataHub. Here is a summarized process based on the available documentation:

  1. Install the Plugin: First, you need to install the MLflow plugin for DataHub. You can do this using pip:

    pip install 'acryl-datahub[mlflow]'
    
  2. Create an Ingestion Recipe: Define a YAML file that specifies the MLflow source and the DataHub sink. Here is a basic example of what the configuration might look like:

    source:
      type: mlflow
      config:
        tracking_uri: "your_mlflow_tracking_uri"
        registry_uri: "your_mlflow_registry_uri"  # Optional, defaults to tracking_uri
        model_name_separator: "_"  # Optional, default is "_"
        env: "PROD"  # Optional, default is "PROD"
    
    sink:
      type: datahub-rest
      config:
        server: "http://localhost:8080"
    
  3. Run the Ingestion: Use the DataHub CLI to run the ingestion process with your configuration file:

    datahub ingest -c path/to/your/ingestion_recipe.yml
    

This process will extract metadata from MLflow, such as registered models and model versions, and ingest it into DataHub.

Would you like more detailed steps or further assistance on any specific part of this process?

Sources:

<@U06TM7M40RK> I am getting this error while executing datahub ingest for mlflow

(venv) usawant@WM-GQXPFPQ695 mlflow % datahub ingest -c mlflow-recipe.yaml
[2024-11-26 15:17:33,042] INFO {datahub.cli.ingest_cli:145} - DataHub CLI version: 0.10.5.2
[2024-11-26 15:17:34,136] INFO {datahub.ingestion.run.pipeline:213} - Sink configured successfully. DataHubRestEmitter: configured to talk to http://sl73dpdlcapd032.visa.com:8080 with token: eyJh**********u3v4
[2024-11-26 15:17:34,937] ERROR {datahub.entrypoints:199} - Command failed: Failed to find a registered source for type mlflow: ‘Did not find a registered class for mlflow’
Traceback (most recent call last):
File “/opt/homebrew/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py”, line 120, in _add_init_error_context
yield
File “/opt/homebrew/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py”, line 223, in init
source_class = source_registry.get(source_type)
File “/opt/homebrew/lib/python3.10/site-packages/datahub/ingestion/api/registry.py”, line 181, in get
raise KeyError(f"Did not find a registered class for {key}")
KeyError: ‘Did not find a registered class for mlflow’

The above exception was the direct cause of the following exception:

The error message you’re encountering indicates that the DataHub ingestion pipeline is unable to recognize the MLflow source type. This typically happens when the necessary plugin for the MLflow source is not installed or not properly configured. Here are some steps to troubleshoot and resolve this issue:

  1. Ensure Proper Plugin Installation:
    Make sure you have installed the MLflow plugin for DataHub. You can do this by running the following command:

    pip install 'acryl-datahub[mlflow]'
    

    This command installs the necessary dependencies for the MLflow source.

  2. Verify Plugin Installation:
    After installing the plugin, verify that it is correctly installed and recognized by DataHub. You can check the installed plugins using:

    datahub check plugins --verbose
    

    Ensure that the MLflow plugin is listed among the available sources.

  3. Check Your Virtual Environment:
    Make sure you are operating within the correct virtual environment where the plugin is installed. You can verify this by checking the path of the datahub command:

    which datahub
    

    This should point to the location within your virtual environment.

  4. Update DataHub CLI:
    Ensure that you are using a compatible version of the DataHub CLI. The error message shows you are using version 0.10.5.2, which might not support the MLflow source. Consider upgrading to a newer version if available:

    pip install --upgrade acryl-datahub
    
  5. Review Your Ingestion Recipe:
    Double-check your mlflow-recipe.yaml file to ensure that the source type is correctly specified as mlflow and that all required configurations are provided.

If you have followed these steps and are still encountering issues, please let me know, and I can provide more detailed troubleshooting steps.

Sources:

  • <Slack Slack Discussion> :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

<@U06TM7M40RK> it seems mlflow plugin is not installed even after trying

If the MLflow plugin is not recognized even after attempting to install it, there are a few steps you can take to troubleshoot and resolve the issue:

  1. Verify Installation:
    Ensure that the installation command is executed in the correct Python environment. You can check the installed packages in your environment by running:

    pip list | grep acryl-datahub
    

    This should list acryl-datahub along with the [mlflow] extras if installed correctly.

  2. Check for Errors During Installation:
    Sometimes, installation errors can be missed. Re-run the installation command and carefully check for any error messages:

    pip install 'acryl-datahub[mlflow]'
    
  3. Upgrade pip, setuptools, and wheel:
    Ensure that your pip, setuptools, and wheel are up to date, as outdated versions can cause installation issues:

    python3 -m pip install --upgrade pip setuptools wheel
    
  4. Reinstall the Plugin:
    Try uninstalling and then reinstalling the plugin to ensure a clean installation:

    pip uninstall acryl-datahub
    pip install 'acryl-datahub[mlflow]'
    
  5. Check Python Environment:
    Make sure you are in the correct virtual environment where you intend to use DataHub. Activate your virtual environment and then run the installation command.

  6. Verify Plugin Recognition:
    After installation, verify that the plugin is recognized by DataHub:

    datahub check plugins --verbose
    

    Ensure that the MLflow source is listed.

  7. Check DataHub CLI Version:
    Ensure that you are using a compatible version of the DataHub CLI. If necessary, upgrade to the latest version:

    pip install --upgrade acryl-datahub
    

If these steps do not resolve the issue, please let me know, and I can provide further assistance.

Sources:

<@U06TM7M40RK> nope even these steps do not seem to resolve my issue

I’m not able to find any data sources to answer your question. You might want to provide details about your operating system and Python environment setup. If this doesn’t help, please reach out to the DataHub support team for further assistance. :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

<@U06TM7M40RK> how to manually register a plugin with datahub cli?

To manually register a plugin with the DataHub CLI, you need to ensure that the plugin is correctly installed and recognized by the DataHub ingestion framework. Here are the steps you can follow:

  1. Ensure Plugin Installation: First, make sure that the plugin is installed in your Python environment. You can do this by running the appropriate pip install command for your plugin. For example, if you are trying to register a custom source, ensure that the source package is installed.

  2. Verify Plugin Recognition: After installation, verify that the plugin is recognized by DataHub. You can check the available plugins using the following command:

    datahub check plugins --verbose
    

    This command will list all the registered plugins and their statuses.

  3. Register the Plugin in Code: If the plugin is not automatically recognized, you may need to register it manually in your code. This involves adding the plugin to the source registry. Here is an example of how you can register a custom source:

    from datahub.ingestion.source.source_registry import source_registry
    from your_custom_source_module import YourCustomSource
    
    # Register the custom source
    source_registry.register("your-custom-source", YourCustomSource)
    

    Replace your_custom_source_module and YourCustomSource with the actual module and class names of your custom source.

  4. Update Ingestion Recipe: Ensure that your ingestion recipe YAML file correctly references the custom source type. For example:

    source:
      type: your-custom-source
      config:
        # Add your custom source configuration here
    
  5. Run the Ingestion: Execute the ingestion command with your updated recipe:

    datahub ingest -c path/to/your/ingestion_recipe.yml
    

If you have followed these steps and are still encountering issues, please let me know, and I can provide more detailed troubleshooting steps.

Sources:

<@U06TM7M40RK> I tried this source_registry.register to register mlflow as a new source but datahub cli still don’t show it listed installed plugins

If you’ve attempted to manually register the MLflow source using source_registry.register and it still doesn’t appear in the list of installed plugins, there might be a few areas to check:

  1. Correct Module and Class: Ensure that you are importing the correct module and class for the MLflow source. The class should be defined in the DataHub ingestion framework. For MLflow, it should be something like:

    from datahub.ingestion.source.mlflow import MLflowSource
    

    Then register it:

    from datahub.ingestion.source.source_registry import source_registry
    
    source_registry.register("mlflow", MLflowSource)
    
  2. Python Environment: Make sure you are working within the correct Python environment where the DataHub CLI and the MLflow plugin are installed. You can verify this by checking the path of the datahub command:

    which datahub
    
  3. Restart the CLI: After making changes to the source registry, ensure that you restart any running instances of the DataHub CLI or any related services to pick up the changes.

  4. Check for Errors: Look for any error messages during the registration process or when running the CLI. These might provide clues as to why the source isn’t being recognized.

  5. Verify Plugin Installation: Double-check that the MLflow plugin is installed correctly. You can list installed packages to confirm:

    pip list | grep acryl-datahub
    
  6. Update DataHub CLI: Ensure that you are using a compatible version of the DataHub CLI. If necessary, upgrade to the latest version:

    pip install --upgrade acryl-datahub
    

If these steps do not resolve the issue, it might be helpful to reach out to the DataHub community or support for further assistance, as there could be a more specific issue with the plugin or environment setup.

Sources:

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><