Installing 'pyiceberg[glue]' Python Package in DataHub-Actions Pod on Kubernetes

user-2 · February 10, 2025, 12:05am

<@U06TM7M40RK> I need help I’m trying to setup Iceberg table in my datahub recipe and I get this message, how do I install the python package in my datahub-actions pod?
‘failures’: [{‘message’: ‘get-catalog’, ‘context’: [“Failed to get catalog: AWS glue support not installed: pip install ‘pyiceberg[glue]’”]}],
‘warnings’: [{‘title’: ‘No metadata was produced by the source’,
‘message’: ‘Please check the source configuration, filters, and permissions.’,
‘context’: }],
‘infos’: }

user-1 · February 10, 2025, 12:05am

To resolve the issue where the DataHub actions pod is unable to access the AWS Glue catalog due to missing dependencies, you need to ensure that the pyiceberg[glue] package is installed in the environment where the DataHub actions are running. Here are the steps you can follow:

Verify Dependency Installation in DataHub Actions Container:
Ensure that the pyiceberg[glue] dependency is installed in the DataHub actions container. You can do this by accessing the container and checking the installed packages.
```
docker exec -it <datahub-actions-container-id> /bin/bash
pip list | grep pyiceberg
```
Install Missing Dependencies:
If the dependency is not installed, you can manually install it within the container:
```
docker exec -it <datahub-actions-container-id> /bin/bash
pip install 'pyiceberg[glue]'
```
Update Dockerfile:
To ensure that the dependency is always installed, you can update the Dockerfile used to build the DataHub actions container to include the pyiceberg[glue] installation.
```
FROM datahub-actions:latest
RUN pip install 'pyiceberg[glue]'
```
Rebuild and Redeploy the Container:
After updating the Dockerfile, rebuild and redeploy the DataHub actions container.
```
docker build -t datahub-actions:custom .
docker-compose up -d --build
```
Check Virtual Environment:
Ensure that the virtual environment used by the DataHub actions container is correctly set up and activated. Sometimes, the issue might be due to the virtual environment not being activated properly.
Scheduled Ingestion Configuration:
Ensure that the scheduled ingestion configuration is correctly set up to use the same environment as the CLI. This might involve setting the correct paths and environment variables.

For more details, you can refer to the <Slack Slack discussion> where similar issues were addressed.

Would you like more detailed steps on any of these points?

Sources:

<Slack Slack discussion> 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-2 · February 10, 2025, 12:05am

<@U06TM7M40RK> But I’m working on kuberentes thats mean I need personalize the datahub image? there is no way to install those libraries using the helm chart?

user-1 · February 10, 2025, 12:05am

When deploying DataHub on Kubernetes using Helm, you typically need to customize the Docker image to include additional Python packages like pyiceberg[glue]. However, there are a few approaches you can consider to manage this without directly modifying the image:

Custom Docker Image:
- Create a custom Docker image that includes the required Python packages. This is a common approach when you need to ensure specific dependencies are available in your environment.
- Here’s an example Dockerfile to create a custom image:
```
FROM datahubproject/datahub-actions:latest
RUN pip install 'pyiceberg[glue]'
```
- Build and push this image to your container registry:
```
docker build -t your-registry/datahub-actions:custom .
docker push your-registry/datahub-actions:custom
```
- Update your Helm chart values to use this custom image:
```
datahub:
  actions:
    image:
      repository: your-registry/datahub-actions
      tag: custom
```
Helm Chart Customization:
- If you prefer not to create a custom image, you can attempt to use the “Extra Pip Libraries” setting in the DataHub UI, although this might not work for all scenarios, especially in Kubernetes environments.
- Unfortunately, the Helm chart itself does not directly support installing additional Python packages at runtime. This is why creating a custom image is often the recommended approach.
Advanced Settings in UI:
- If you are using the DataHub UI for ingestion, you can specify additional pip libraries in the “Advanced Settings” section. However, this is more applicable to UI-based ingestion and might not apply to Kubernetes deployments.
Using Init Containers:
- Another approach is to use an init container to install the necessary packages before the main container starts. This can be more complex to set up and manage.

If you need further assistance with any of these approaches or have specific questions about customizing your deployment, feel free to ask!

Sources:

<Slack Slack discussion> 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-3 · February 10, 2025, 12:05am

<@U086ZA7BKEH> The way we have done this is to create a custom ingestion image that contains all the ingestion-related dependencies. Using this image, we run our ingestion tasks as jobs on our Datahub kuberetes namespace.

Maybe this will help you

datahub_team · February 10, 2025, 12:05am

Hi ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look

Hope this helps ! ><

user-2 · February 10, 2025, 12:05am

Thanks <@U049WUH4155> I’ll go by this way, I gussed that there was a way to pass a requirements configmap and get it installed on the pod startup, like airflow from bitnami does.

user-3 · February 10, 2025, 12:05am

You are welcome

That would be neat!
I’m not aware of such a functionality being available for DataHub’s images.
Would be a great contribution then, if you are interested

user-2 · February 10, 2025, 12:05am

is there an issue for this?
I can work on it on my free time.

Topic		Replies	Views
Differences and Lineage in Iceberg Table Ingestion and Lineage Configuration ingestion	6	25	December 16, 2024
Storing AWS Glue Catalog Table in DataHub for Exploration via UI ingestion	2	84	March 4, 2024
Troubleshooting datahub ingress issues and solutions troubleshoot	9	87	March 4, 2024
Troubleshooting Iceberg Source Recipe for Ingestion via Glue Catalog ingestion	13	33	July 8, 2024
Dealing with Pydantic Schema Generation Error and Dependency Versions in DataHub Ingestion ingestion	12	155	March 4, 2024

Installing 'pyiceberg[glue]' Python Package in DataHub-Actions Pod on Kubernetes

Related topics