Understanding the Creation of Virtual Environments (venvs) in DataHub's `datahub-actions` Container

user-4 · May 27, 2024, 12:03am

Hello! Where can I find the code that creates venvs after triggering datahub-actions injest?

datahub_team · May 27, 2024, 12:03am

Hey there! Make sure your message includes the following information if relevant, so we can help more effectively!

Are you using UI or CLI for ingestion?
Which DataHub version are you using? (e.g. 0.12.0)
What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?

We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a or so other folks in the Community know if you’re still looking for help on the topic!

user-1 · May 27, 2024, 12:03am

To help you effectively, I need a bit more information:

Are you using the UI or CLI for ingestion?
Which DataHub version are you using? (e.g., 0.12.0)?
What data source(s) are you integrating with DataHub (e.g., BigQuery)?

Once you provide these details, I can give you a more precise answer. However, based on the information available, here is a general direction:

The code that governs the creation of virtual environments (venvs) for ingestion in DataHub is not directly accessible in the open-source repositories. According to <Slack Slack thread>, the venv creation logic resides in a file called ingestion_common.sh within the datahub-actions Docker image. Unfortunately, this part of the code is not open-source.

For more details, you can refer to the following sources:

<Slack thread on venv creation>
<Slack thread on executor action>
<Slack thread on custom ingestion library>

Sources

Slack
Slack
Slack 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-4 · May 27, 2024, 12:03am

UI
0.13.2
mysql
<@U06TM7M40RK>

user-1 · May 27, 2024, 12:03am

Thank you for providing the details. Based on your setup (UI ingestion, DataHub version 0.13.2, and MySQL as the data source), here is a more specific answer:

The code that creates virtual environments (venvs) for ingestion when using DataHub Actions is part of the datahub-actions Docker image. This process is managed by a script called ingestion_common.sh, which is not open-source.

Key Points:

Virtual Environment Creation: The virtual environments are created dynamically at runtime by the ingestion_common.sh script within the datahub-actions Docker image.
Customization: As of now, there is no direct way to modify this script or the virtual environment’s requirements from the UI or the open-source codebase.

Relevant Information:

According to <Slack Slack thread>, the ingestion_common.sh script is responsible for creating the virtual environments, but it is not open-source.
In <Issues · datahub-project/datahub · GitHub GitHub issue>, it is mentioned that the virtual environments are created at runtime, and the dependencies are installed in these environments.

Example Scenario:
If you need to add custom dependencies (e.g., a specific SQLAlchemy dialect for MySQL), you would typically need to modify the Docker image to include these dependencies. However, since the virtual environments are created at runtime, this approach might not work directly.

Steps to Troubleshoot or Customize:

Check Docker Image: Ensure that the datahub-actions Docker image has all the necessary dependencies for MySQL ingestion.
Custom Docker Image: If needed, create a custom Docker image based on datahub-actions and include any additional dependencies.
Sync with DataHub Version: Ensure that the custom image is in sync with the DataHub version you are using (0.13.2).

Example Dockerfile for Custom Image:

FROM acryldata/datahub-actions:head

USER 0
RUN apt-get update && \
    apt-get install -y libmysqlclient-dev
RUN pip install mysqlclient
RUN pip install 'acryl-datahub[mysql]'
USER datahub

Sources

If you have any further questions or need more specific guidance, feel free to ask! 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-4 · May 27, 2024, 12:03am

I’m looking for the code that creates the venv at run time in the docker file <@U06TM7M40RK>

user-1 · May 27, 2024, 12:03am

The code that creates the virtual environments (venvs) at runtime in the datahub-actions Docker container is managed by the ingestion_common.sh script. This script is part of the datahub-actions Docker image and is not directly accessible in the open-source repositories.

Steps to Access and Modify the Script:

Extract the Script: You can extract the ingestion_common.sh script from the Docker image.
Modify the Script: Make necessary modifications to the script.
Build a Custom Docker Image: Create a custom Docker image that includes your modified script.

Example Steps to Extract and Modify the Script:

Extract the Script:

docker run --rm -it --entrypoint /bin/bash acryldata/datahub-actions:head
# Inside the container
cp /usr/local/bin/ingestion_common.sh /path/to/local/machine
exit

Modify the Script: Edit the ingestion_common.sh script on your local machine to include any customizations you need.

Build a Custom Docker Image:

FROM acryldata/datahub-actions:head

USER root
COPY ingestion_common.sh /usr/local/bin/ingestion_common.sh
RUN chmod +x /usr/local/bin/ingestion_common.sh
USER datahub

Build and Push the Custom Image:

docker build -t your-custom-datahub-actions:latest .
docker push your-custom-datahub-actions:latest

Update Your Deployment: Update your Docker Compose or Kubernetes deployment to use the custom image.

Example Dockerfile for Custom Image:

FROM acryldata/datahub-actions:head

USER root
COPY ingestion_common.sh /usr/local/bin/ingestion_common.sh
RUN chmod +x /usr/local/bin/ingestion_common.sh
USER datahub

Additional Information:

According to <Slack Slack thread>, the ingestion_common.sh script is responsible for creating the virtual environments.
In <Issues · datahub-project/datahub · GitHub GitHub issue>, it is mentioned that the virtual environments are created at runtime, and the dependencies are installed in these environments.

Sources

If you have any further questions or need more specific guidance, feel free to ask! 1 button 0 button Hint: Mention <@U06TM7M40RK> in the thread for followups.

user-4 · May 27, 2024, 12:03am

Wow this is the first time I’ve had one of these LLM chatbots be helpful!

user-4 · May 27, 2024, 12:03am

What code calls ingestion_common.sh? <@U06TM7M40RK>

user-1 · May 27, 2024, 12:03am

I’m glad to hear that you found the information helpful! The ingestion_common.sh script is typically invoked by the DataHub ingestion process within the datahub-actions Docker container. This script is responsible for setting up the virtual environment and installing the necessary dependencies for the ingestion tasks.

Where ingestion_common.sh is Called:
The script is called as part of the ingestion workflow managed by the datahub-actions container. While the exact entry point in the open-source codebase is not explicitly documented, it is generally triggered by the ingestion orchestration logic within the container.

Example of How ingestion_common.sh Might be Called:
In a typical setup, the ingestion process might involve a series of steps managed by a higher-level script or orchestration tool that eventually calls ingestion_common.sh. Here is a simplified example of how this might look:

#!/bin/bash

# Example entry point script for datahub-actions container

# Set up environment variables
export DATAHUB_VERSION="0.13.2"
export PLUGIN="mysql"
export TASK_ID="example-task-id"
export TMP_DIR="/tmp/datahub"

# Call the ingestion_common.sh script
/usr/local/bin/ingestion_common.sh $TASK_ID $DATAHUB_VERSION $PLUGIN $TMP_DIR

Relevant Information:

According to <Issues · datahub-project/datahub · GitHub GitHub issue>, the ingestion_common.sh script is responsible for creating the virtual environments at runtime.
In <Slack Slack thread>, it is mentioned that the script is part of the datahub-actions Docker image and is not open-source.

Steps to Investigate Further:

Inspect the Docker Image: You can inspect the datahub-actions Docker image to see how the scripts are organized and invoked.
Check Container Logs: Look at the logs of the datahub-actions container to see the sequence of commands executed during an ingestion run.
Custom Entry Point: If you need more control, you can create a custom entry point script that explicitly calls ingestion_common.sh.

Example Dockerfile with Custom Entry Point:

FROM acryldata/datahub-actions:head

USER root
COPY ingestion_common.sh /usr/local/bin/ingestion_common.sh
COPY custom_entrypoint.sh /usr/local/bin/custom_entrypoint.sh
RUN chmod +x /usr/local/bin/ingestion_common.sh
RUN chmod +x /usr/local/bin/custom_entrypoint.sh
USER datahub

ENTRYPOINT ["/usr/local/bin/custom_entrypoint.sh"]

Example Custom Entry Point Script:

#!/bin/bash

# Custom entry point script for datahub-actions container

# Set up environment variables
export DATAHUB_VERSION="0.13.2"
export PLUGIN="mysql"
export TASK_ID="example-task-id"
export TMP_DIR="/tmp/datahub"

# Call the ingestion_common.sh script
/usr/local/bin/ingestion_common.sh $TASK_ID $DATAHUB_VERSION $PLUGIN $TMP_DIR

Sources

hyejin.yoon · May 27, 2024, 12:03am

If you have any further questions or need more specific guidance, feel free to ask! 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-2 · May 27, 2024, 12:03am

You are looking for the acryl-executor package (the ingestion_common.sh script is included there), unfortunately there isn’t a public repository available for this package…but you can download the source code from PyPI (“Source Distribution”): https://pypi.org/project/acryl-executor/#files

Within the package the SubProcessIngestionTask class in /src/acryl/executor/execution/sub_process_ingestion_task.py is calling the run_ingest.sh script in a subprocess, which is using the ingestion_common.py script for creating the venvs…

user-4 · May 27, 2024, 12:03am

I’m currently doing a security audit for the actions container and this makes things pretty dicey. Installing venvs at runtime means that the mitigations I’ve been doing on the installed python dependencies moot if acryl executor is going to install a bunch of dependencies every time a pipeline is run.

I do see that some people are pointing out that you can use --site-packages in the ingestion_common.sh to tell pip to pull from the base python install but I’m unsure if this will break anything.

Topic		Replies	Views
Error in Datahub Ingestion Recipe Due to None Values in Fields troubleshoot	8	52	March 4, 2024
Troubleshooting DataHub Ingestion Issues ingestion	2	72	September 2, 2024
Finding and Utilizing the Source Code for Example Ingestion in DataHub ingestion	11	17	December 30, 2024
Creating a Custom Ingestion Source in DataHub with CSV Data ingestion	5	52	August 12, 2024
Creating New Datasets with CSV Ingestion using DataHub UI in v0.13.0 ingestion	1	92	April 15, 2024

Understanding the Creation of Virtual Environments (venvs) in DataHub's `datahub-actions` Container

Related topics