Installing 'pyiceberg[glue]' Python Package in DataHub-Actions Pod on Kubernetes

Original Slack Thread

<@U06TM7M40RK> I need help I’m trying to setup Iceberg table in my datahub recipe and I get this message, how do I install the python package in my datahub-actions pod?
‘failures’: [{‘message’: ‘get-catalog’, ‘context’: [“Failed to get catalog: AWS glue support not installed: pip install ‘pyiceberg[glue]’”]}],
‘warnings’: [{‘title’: ‘No metadata was produced by the source’,
‘message’: ‘Please check the source configuration, filters, and permissions.’,
‘context’: }],
‘infos’: }

To resolve the issue where the DataHub actions pod is unable to access the AWS Glue catalog due to missing dependencies, you need to ensure that the pyiceberg[glue] package is installed in the environment where the DataHub actions are running. Here are the steps you can follow:

  1. Verify Dependency Installation in DataHub Actions Container:
    Ensure that the pyiceberg[glue] dependency is installed in the DataHub actions container. You can do this by accessing the container and checking the installed packages.

    docker exec -it <datahub-actions-container-id> /bin/bash
    pip list | grep pyiceberg
    
  2. Install Missing Dependencies:
    If the dependency is not installed, you can manually install it within the container:

    docker exec -it <datahub-actions-container-id> /bin/bash
    pip install 'pyiceberg[glue]'
    
  3. Update Dockerfile:
    To ensure that the dependency is always installed, you can update the Dockerfile used to build the DataHub actions container to include the pyiceberg[glue] installation.

    FROM datahub-actions:latest
    RUN pip install 'pyiceberg[glue]'
    
  4. Rebuild and Redeploy the Container:
    After updating the Dockerfile, rebuild and redeploy the DataHub actions container.

    docker build -t datahub-actions:custom .
    docker-compose up -d --build
    
  5. Check Virtual Environment:
    Ensure that the virtual environment used by the DataHub actions container is correctly set up and activated. Sometimes, the issue might be due to the virtual environment not being activated properly.

  6. Scheduled Ingestion Configuration:
    Ensure that the scheduled ingestion configuration is correctly set up to use the same environment as the CLI. This might involve setting the correct paths and environment variables.

For more details, you can refer to the <Slack Slack discussion> where similar issues were addressed.

Would you like more detailed steps on any of these points?

Sources:

  • <Slack Slack discussion> :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

<@U06TM7M40RK> But I’m working on kuberentes thats mean I need personalize the datahub image? there is no way to install those libraries using the helm chart?

When deploying DataHub on Kubernetes using Helm, you typically need to customize the Docker image to include additional Python packages like pyiceberg[glue]. However, there are a few approaches you can consider to manage this without directly modifying the image:

  1. Custom Docker Image:

    • Create a custom Docker image that includes the required Python packages. This is a common approach when you need to ensure specific dependencies are available in your environment.

    • Here’s an example Dockerfile to create a custom image:

      FROM datahubproject/datahub-actions:latest
      RUN pip install 'pyiceberg[glue]'
      
    • Build and push this image to your container registry:

      docker build -t your-registry/datahub-actions:custom .
      docker push your-registry/datahub-actions:custom
      
    • Update your Helm chart values to use this custom image:

      datahub:
        actions:
          image:
            repository: your-registry/datahub-actions
            tag: custom
      
  2. Helm Chart Customization:

    • If you prefer not to create a custom image, you can attempt to use the “Extra Pip Libraries” setting in the DataHub UI, although this might not work for all scenarios, especially in Kubernetes environments.
    • Unfortunately, the Helm chart itself does not directly support installing additional Python packages at runtime. This is why creating a custom image is often the recommended approach.
  3. Advanced Settings in UI:

    • If you are using the DataHub UI for ingestion, you can specify additional pip libraries in the “Advanced Settings” section. However, this is more applicable to UI-based ingestion and might not apply to Kubernetes deployments.
  4. Using Init Containers:

    • Another approach is to use an init container to install the necessary packages before the main container starts. This can be more complex to set up and manage.

If you need further assistance with any of these approaches or have specific questions about customizing your deployment, feel free to ask!

Sources:

  • <Slack Slack discussion> :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

<@U086ZA7BKEH> The way we have done this is to create a custom ingestion image that contains all the ingestion-related dependencies. Using this image, we run our ingestion tasks as jobs on our Datahub kuberetes namespace.

Maybe this will help you :slightly_smiling_face:

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><

Thanks <@U049WUH4155> I’ll go by this way, I gussed that there was a way to pass a requirements configmap and get it installed on the pod startup, like airflow from bitnami does.

You are welcome :slightly_smiling_face:

That would be neat!
I’m not aware of such a functionality being available for DataHub’s images.
Would be a great contribution then, if you are interested :slightly_smiling_face:

is there an issue for this?
I can work on it on my free time.