<@U06TM7M40RK> what is a process to ingest data from mlflow?
To ingest data from MLflow into DataHub, you can use the MLflow ingestion source provided by DataHub. Here is a summarized process based on the available documentation:
-
Install the Plugin: First, you need to install the MLflow plugin for DataHub. You can do this using pip:
pip install 'acryl-datahub[mlflow]'
-
Create an Ingestion Recipe: Define a YAML file that specifies the MLflow source and the DataHub sink. Here is a basic example of what the configuration might look like:
source: type: mlflow config: tracking_uri: "your_mlflow_tracking_uri" registry_uri: "your_mlflow_registry_uri" # Optional, defaults to tracking_uri model_name_separator: "_" # Optional, default is "_" env: "PROD" # Optional, default is "PROD" sink: type: datahub-rest config: server: "http://localhost:8080"
-
Run the Ingestion: Use the DataHub CLI to run the ingestion process with your configuration file:
datahub ingest -c path/to/your/ingestion_recipe.yml
This process will extract metadata from MLflow, such as registered models and model versions, and ingest it into DataHub.
Would you like more detailed steps or further assistance on any specific part of this process?
Sources:
- <https://datahubproject.io/docs/0.14.0/generated/ingestion/sources/mlflow|MLflow Ingestion Source Documentation (0.14.0)>
- <https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/mlflow|MLflow Ingestion Source Documentation (0.13.1)> 0 button 0 button Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> I am getting this error while executing datahub ingest for mlflow
(venv) usawant@WM-GQXPFPQ695 mlflow % datahub ingest -c mlflow-recipe.yaml
[2024-11-26 15:17:33,042] INFO {datahub.cli.ingest_cli:145} - DataHub CLI version: 0.10.5.2
[2024-11-26 15:17:34,136] INFO {datahub.ingestion.run.pipeline:213} - Sink configured successfully. DataHubRestEmitter: configured to talk to http://sl73dpdlcapd032.visa.com:8080 with token: eyJh**********u3v4
[2024-11-26 15:17:34,937] ERROR {datahub.entrypoints:199} - Command failed: Failed to find a registered source for type mlflow: ‘Did not find a registered class for mlflow’
Traceback (most recent call last):
File “/opt/homebrew/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py”, line 120, in _add_init_error_context
yield
File “/opt/homebrew/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py”, line 223, in init
source_class = source_registry.get(source_type)
File “/opt/homebrew/lib/python3.10/site-packages/datahub/ingestion/api/registry.py”, line 181, in get
raise KeyError(f"Did not find a registered class for {key}")
KeyError: ‘Did not find a registered class for mlflow’
The above exception was the direct cause of the following exception:
The error message you’re encountering indicates that the DataHub ingestion pipeline is unable to recognize the MLflow source type. This typically happens when the necessary plugin for the MLflow source is not installed or not properly configured. Here are some steps to troubleshoot and resolve this issue:
-
Ensure Proper Plugin Installation:
Make sure you have installed the MLflow plugin for DataHub. You can do this by running the following command:pip install 'acryl-datahub[mlflow]'
This command installs the necessary dependencies for the MLflow source.
-
Verify Plugin Installation:
After installing the plugin, verify that it is correctly installed and recognized by DataHub. You can check the installed plugins using:datahub check plugins --verbose
Ensure that the MLflow plugin is listed among the available sources.
-
Check Your Virtual Environment:
Make sure you are operating within the correct virtual environment where the plugin is installed. You can verify this by checking the path of thedatahub
command:which datahub
This should point to the location within your virtual environment.
-
Update DataHub CLI:
Ensure that you are using a compatible version of the DataHub CLI. The error message shows you are using version 0.10.5.2, which might not support the MLflow source. Consider upgrading to a newer version if available:pip install --upgrade acryl-datahub
-
Review Your Ingestion Recipe:
Double-check yourmlflow-recipe.yaml
file to ensure that the source type is correctly specified asmlflow
and that all required configurations are provided.
If you have followed these steps and are still encountering issues, please let me know, and I can provide more detailed troubleshooting steps.
Sources:
- <Slack Slack Discussion> 0 button 0 button Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> it seems mlflow plugin is not installed even after trying
If the MLflow plugin is not recognized even after attempting to install it, there are a few steps you can take to troubleshoot and resolve the issue:
-
Verify Installation:
Ensure that the installation command is executed in the correct Python environment. You can check the installed packages in your environment by running:pip list | grep acryl-datahub
This should list
acryl-datahub
along with the[mlflow]
extras if installed correctly. -
Check for Errors During Installation:
Sometimes, installation errors can be missed. Re-run the installation command and carefully check for any error messages:pip install 'acryl-datahub[mlflow]'
-
Upgrade pip, setuptools, and wheel:
Ensure that yourpip
,setuptools
, andwheel
are up to date, as outdated versions can cause installation issues:python3 -m pip install --upgrade pip setuptools wheel
-
Reinstall the Plugin:
Try uninstalling and then reinstalling the plugin to ensure a clean installation:pip uninstall acryl-datahub pip install 'acryl-datahub[mlflow]'
-
Check Python Environment:
Make sure you are in the correct virtual environment where you intend to use DataHub. Activate your virtual environment and then run the installation command. -
Verify Plugin Recognition:
After installation, verify that the plugin is recognized by DataHub:datahub check plugins --verbose
Ensure that the MLflow source is listed.
-
Check DataHub CLI Version:
Ensure that you are using a compatible version of the DataHub CLI. If necessary, upgrade to the latest version:pip install --upgrade acryl-datahub
If these steps do not resolve the issue, please let me know, and I can provide further assistance.
Sources:
- https://datahubproject.io/docs/0.14.0/generated/ingestion/sources/mlflow|https://datahubproject.io/docs/0.14.0/generated/ingestion/sources/mlflow
- https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/mlflow|https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/mlflow 0 button 0 button Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> nope even these steps do not seem to resolve my issue
I’m not able to find any data sources to answer your question. You might want to provide details about your operating system and Python environment setup. If this doesn’t help, please reach out to the DataHub support team for further assistance. 0 button 0 button Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> how to manually register a plugin with datahub cli?
To manually register a plugin with the DataHub CLI, you need to ensure that the plugin is correctly installed and recognized by the DataHub ingestion framework. Here are the steps you can follow:
-
Ensure Plugin Installation: First, make sure that the plugin is installed in your Python environment. You can do this by running the appropriate
pip install
command for your plugin. For example, if you are trying to register a custom source, ensure that the source package is installed. -
Verify Plugin Recognition: After installation, verify that the plugin is recognized by DataHub. You can check the available plugins using the following command:
datahub check plugins --verbose
This command will list all the registered plugins and their statuses.
-
Register the Plugin in Code: If the plugin is not automatically recognized, you may need to register it manually in your code. This involves adding the plugin to the source registry. Here is an example of how you can register a custom source:
from datahub.ingestion.source.source_registry import source_registry from your_custom_source_module import YourCustomSource # Register the custom source source_registry.register("your-custom-source", YourCustomSource)
Replace
your_custom_source_module
andYourCustomSource
with the actual module and class names of your custom source. -
Update Ingestion Recipe: Ensure that your ingestion recipe YAML file correctly references the custom source type. For example:
source: type: your-custom-source config: # Add your custom source configuration here
-
Run the Ingestion: Execute the ingestion command with your updated recipe:
datahub ingest -c path/to/your/ingestion_recipe.yml
If you have followed these steps and are still encountering issues, please let me know, and I can provide more detailed troubleshooting steps.
Sources:
- <https://datahubproject.io/docs/0.14.0/how/add-custom-ingestion-source|Using a Custom Ingestion Source (0.14.0)>
- <https://datahubproject.io/docs/0.13.1/how/add-custom-ingestion-source|Using a Custom Ingestion Source (0.13.1)> 0 button 0 button Hint: Mention @runllm in the thread for followups.
<@U06TM7M40RK> I tried this source_registry.register to register mlflow as a new source but datahub cli still don’t show it listed installed plugins
If you’ve attempted to manually register the MLflow source using source_registry.register
and it still doesn’t appear in the list of installed plugins, there might be a few areas to check:
-
Correct Module and Class: Ensure that you are importing the correct module and class for the MLflow source. The class should be defined in the DataHub ingestion framework. For MLflow, it should be something like:
from datahub.ingestion.source.mlflow import MLflowSource
Then register it:
from datahub.ingestion.source.source_registry import source_registry source_registry.register("mlflow", MLflowSource)
-
Python Environment: Make sure you are working within the correct Python environment where the DataHub CLI and the MLflow plugin are installed. You can verify this by checking the path of the
datahub
command:which datahub
-
Restart the CLI: After making changes to the source registry, ensure that you restart any running instances of the DataHub CLI or any related services to pick up the changes.
-
Check for Errors: Look for any error messages during the registration process or when running the CLI. These might provide clues as to why the source isn’t being recognized.
-
Verify Plugin Installation: Double-check that the MLflow plugin is installed correctly. You can list installed packages to confirm:
pip list | grep acryl-datahub
-
Update DataHub CLI: Ensure that you are using a compatible version of the DataHub CLI. If necessary, upgrade to the latest version:
pip install --upgrade acryl-datahub
If these steps do not resolve the issue, it might be helpful to reach out to the DataHub community or support for further assistance, as there could be a more specific issue with the plugin or environment setup.
Sources:
- https://datahubproject.io/docs/0.14.0/generated/ingestion/sources/mlflow|https://datahubproject.io/docs/0.14.0/generated/ingestion/sources/mlflow
- https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/mlflow|https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/mlflow
Note: I wasn’t able to find highly relevant data sources, but above are a few potentially relevant links. 0 button 0 button Hint: Mention @runllm in the thread for followups.
Hi ! Did this interaction resolve your issue or do you still require assistance ?
Please tag @ RunLLM if you still need further assistance !
Or tag @Datahub Community Support if you require a person to take a look
Hope this helps ! ><