Understanding the Creation of Virtual Environments (venvs) in DataHub's `datahub-actions` Container

Original Slack Thread

Hello! Where can I find the code that creates venvs after triggering datahub-actions injest?

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?


:information_source: We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a :thumbsup: or :thumbsdown: so other folks in the Community know if you’re still looking for help on the topic!

To help you effectively, I need a bit more information:

  1. Are you using the UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g., 0.12.0)?
  3. What data source(s) are you integrating with DataHub (e.g., BigQuery)?

Once you provide these details, I can give you a more precise answer. However, based on the information available, here is a general direction:

The code that governs the creation of virtual environments (venvs) for ingestion in DataHub is not directly accessible in the open-source repositories. According to <Slack Slack thread>, the venv creation logic resides in a file called ingestion_common.sh within the datahub-actions Docker image. Unfortunately, this part of the code is not open-source.

For more details, you can refer to the following sources:

  • <Slack thread on venv creation>
  • <Slack thread on executor action>
  • <Slack thread on custom ingestion library>

Sources

  • Slack
  • Slack
  • Slack :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.
  1. UI
  2. 0.13.2
  3. mysql
    <@U06TM7M40RK>

Thank you for providing the details. Based on your setup (UI ingestion, DataHub version 0.13.2, and MySQL as the data source), here is a more specific answer:

The code that creates virtual environments (venvs) for ingestion when using DataHub Actions is part of the datahub-actions Docker image. This process is managed by a script called ingestion_common.sh, which is not open-source.

Key Points:

  1. Virtual Environment Creation: The virtual environments are created dynamically at runtime by the ingestion_common.sh script within the datahub-actions Docker image.
  2. Customization: As of now, there is no direct way to modify this script or the virtual environment’s requirements from the UI or the open-source codebase.

Relevant Information:

  • According to <Slack Slack thread>, the ingestion_common.sh script is responsible for creating the virtual environments, but it is not open-source.
  • In <Issues · datahub-project/datahub · GitHub GitHub issue>, it is mentioned that the virtual environments are created at runtime, and the dependencies are installed in these environments.

Example Scenario:
If you need to add custom dependencies (e.g., a specific SQLAlchemy dialect for MySQL), you would typically need to modify the Docker image to include these dependencies. However, since the virtual environments are created at runtime, this approach might not work directly.

Steps to Troubleshoot or Customize:

  1. Check Docker Image: Ensure that the datahub-actions Docker image has all the necessary dependencies for MySQL ingestion.
  2. Custom Docker Image: If needed, create a custom Docker image based on datahub-actions and include any additional dependencies.
  3. Sync with DataHub Version: Ensure that the custom image is in sync with the DataHub version you are using (0.13.2).

Example Dockerfile for Custom Image:

FROM acryldata/datahub-actions:head

USER 0
RUN apt-get update && \
    apt-get install -y libmysqlclient-dev
RUN pip install mysqlclient
RUN pip install 'acryl-datahub[mysql]'
USER datahub

Sources

If you have any further questions or need more specific guidance, feel free to ask! :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

I’m looking for the code that creates the venv at run time in the docker file <@U06TM7M40RK>

The code that creates the virtual environments (venvs) at runtime in the datahub-actions Docker container is managed by the ingestion_common.sh script. This script is part of the datahub-actions Docker image and is not directly accessible in the open-source repositories.

Steps to Access and Modify the Script:

  1. Extract the Script: You can extract the ingestion_common.sh script from the Docker image.
  2. Modify the Script: Make necessary modifications to the script.
  3. Build a Custom Docker Image: Create a custom Docker image that includes your modified script.

Example Steps to Extract and Modify the Script:

  1. Extract the Script:

    docker run --rm -it --entrypoint /bin/bash acryldata/datahub-actions:head
    # Inside the container
    cp /usr/local/bin/ingestion_common.sh /path/to/local/machine
    exit
    
  2. Modify the Script: Edit the ingestion_common.sh script on your local machine to include any customizations you need.

  3. Build a Custom Docker Image:

    FROM acryldata/datahub-actions:head
    
    USER root
    COPY ingestion_common.sh /usr/local/bin/ingestion_common.sh
    RUN chmod +x /usr/local/bin/ingestion_common.sh
    USER datahub
    
  4. Build and Push the Custom Image:

    docker build -t your-custom-datahub-actions:latest .
    docker push your-custom-datahub-actions:latest
    
  5. Update Your Deployment: Update your Docker Compose or Kubernetes deployment to use the custom image.

Example Dockerfile for Custom Image:

FROM acryldata/datahub-actions:head

USER root
COPY ingestion_common.sh /usr/local/bin/ingestion_common.sh
RUN chmod +x /usr/local/bin/ingestion_common.sh
USER datahub

Additional Information:

  • According to <Slack Slack thread>, the ingestion_common.sh script is responsible for creating the virtual environments.
  • In <Issues · datahub-project/datahub · GitHub GitHub issue>, it is mentioned that the virtual environments are created at runtime, and the dependencies are installed in these environments.

Sources

If you have any further questions or need more specific guidance, feel free to ask! :+1: 1 button :-1: 0 button :bulb: Hint: Mention <@U06TM7M40RK> in the thread for followups.

Wow this is the first time I’ve had one of these LLM chatbots be helpful!

What code calls ingestion_common.sh? <@U06TM7M40RK>

I’m glad to hear that you found the information helpful! The ingestion_common.sh script is typically invoked by the DataHub ingestion process within the datahub-actions Docker container. This script is responsible for setting up the virtual environment and installing the necessary dependencies for the ingestion tasks.

Where ingestion_common.sh is Called:
The script is called as part of the ingestion workflow managed by the datahub-actions container. While the exact entry point in the open-source codebase is not explicitly documented, it is generally triggered by the ingestion orchestration logic within the container.

Example of How ingestion_common.sh Might be Called:
In a typical setup, the ingestion process might involve a series of steps managed by a higher-level script or orchestration tool that eventually calls ingestion_common.sh. Here is a simplified example of how this might look:

#!/bin/bash

# Example entry point script for datahub-actions container

# Set up environment variables
export DATAHUB_VERSION="0.13.2"
export PLUGIN="mysql"
export TASK_ID="example-task-id"
export TMP_DIR="/tmp/datahub"

# Call the ingestion_common.sh script
/usr/local/bin/ingestion_common.sh $TASK_ID $DATAHUB_VERSION $PLUGIN $TMP_DIR

Relevant Information:

  • According to <Issues · datahub-project/datahub · GitHub GitHub issue>, the ingestion_common.sh script is responsible for creating the virtual environments at runtime.
  • In <Slack Slack thread>, it is mentioned that the script is part of the datahub-actions Docker image and is not open-source.

Steps to Investigate Further:

  1. Inspect the Docker Image: You can inspect the datahub-actions Docker image to see how the scripts are organized and invoked.
  2. Check Container Logs: Look at the logs of the datahub-actions container to see the sequence of commands executed during an ingestion run.
  3. Custom Entry Point: If you need more control, you can create a custom entry point script that explicitly calls ingestion_common.sh.

Example Dockerfile with Custom Entry Point:

FROM acryldata/datahub-actions:head

USER root
COPY ingestion_common.sh /usr/local/bin/ingestion_common.sh
COPY custom_entrypoint.sh /usr/local/bin/custom_entrypoint.sh
RUN chmod +x /usr/local/bin/ingestion_common.sh
RUN chmod +x /usr/local/bin/custom_entrypoint.sh
USER datahub

ENTRYPOINT ["/usr/local/bin/custom_entrypoint.sh"]

Example Custom Entry Point Script:

#!/bin/bash

# Custom entry point script for datahub-actions container

# Set up environment variables
export DATAHUB_VERSION="0.13.2"
export PLUGIN="mysql"
export TASK_ID="example-task-id"
export TMP_DIR="/tmp/datahub"

# Call the ingestion_common.sh script
/usr/local/bin/ingestion_common.sh $TASK_ID $DATAHUB_VERSION $PLUGIN $TMP_DIR

Sources

If you have any further questions or need more specific guidance, feel free to ask! :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

You are looking for the acryl-executor package (the ingestion_common.sh script is included there), unfortunately there isn’t a public repository available for this package…but you can download the source code from PyPI (“Source Distribution”): https://pypi.org/project/acryl-executor/#files

Within the package the SubProcessIngestionTask class in /src/acryl/executor/execution/sub_process_ingestion_task.py is calling the run_ingest.sh script in a subprocess, which is using the ingestion_common.py script for creating the venvs…

I’m currently doing a security audit for the actions container and this makes things pretty dicey. Installing venvs at runtime means that the mitigations I’ve been doing on the installed python dependencies moot if acryl executor is going to install a bunch of dependencies every time a pipeline is run.

I do see that some people are pointing out that you can use --site-packages in the ingestion_common.sh to tell pip to pull from the base python install but I’m unsure if this will break anything.