Troubleshooting DataHub Python SDK Ingestion Pipelines in Lambda

hyejin.yoon · May 6, 2024, 12:02am

Hi,
Having issues to understand how to create ingestion pipelines using python sdk. inside lambda
I have create a layer with the latest datahub-acryl from pypy
This is my code sample:

from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph
from datahub.configuration.config_loader import load_config_file

server = 'server'
token = 'token'
# Create an emitter to DataHub over REST

graph = DataHubGraph(DatahubClientConfig(server=server, token=token))

pipeline_config = {
"pipeline_name": "prog-s3-stg-test",
"source": {
  "type": "s3",
  "config": {
    "env": "STG",
    "path_specs": [{
      "include": "<s3://bbkt/path/*.*>"
    }],
    "aws_config": {
      "aws_region": "region"
    }
  }
}
}


pipeline = Pipeline.create(pipeline_config)
pipeline.run()
pipeline.raise_from_status()```
It seems that is trying to look for the lock .datahubenv
Here is the error
`{
  "errorMessage": "[Errno 2] No such file or directory: '/home/sbx_user1051/.datahubenv'",
  "errorType": "FileNotFoundError",
  "stackTrace": [
    "  File \"/var/t`


Any ideas on how to run this in a lambda ? if possbile :slightly_smiling_face:

datahub_team · May 6, 2024, 12:03am

Hey there! Make sure your message includes the following information if relevant, so we can help more effectively!

Are you using UI or CLI for ingestion?
Which DataHub version are you using? (e.g. 0.12.0)
What data source(s) are you integrating with DataHub? (e.g. BigQuery)

user-1 · May 6, 2024, 12:03am

    pipeline = Pipeline.create(
        {
            "source": {
                "type": "s3",
                "config": {
                    "path_specs": [{
                        "include": "/Users/sst/install000.csv"}],
                    "profiling": {"enabled": True},
                },
            },
            "sink": {
                "type": "datahub-rest",
                "config": {
                    "server": "<http://localhost:8080>",
                }
            }
        }
    )

    pipeline.run()
    pipeline.pretty_print_summary()
    pipeline.log_ingestion_stats()```
make sure your lambda environment Datahub ingest packages are installed

user-3 · May 6, 2024, 12:03am

if you look close my code is the same - and yes i have installed the layer with success

it seems that the init is trying to place the .datahubenv in the home dir of the running os, but Lambda does not allow it
START RequestId: cdbeda15-51cf-4dad-9fbf-d0c536ac30e2 Version: $LATEST
No ~/.datahubenv file found, generating one for you...
LAMBDA_WARNING: Unhandled exception. The most likely cause is an issue in the function code. However, in rare cases, a Lambda runtime update can cause unexpected function behavior. For functions using managed runtimes, runtime updates can be triggered by a function change, or can be applied automatically. To determine if the runtime has been updated, check the runtime version in the INIT_START log entry. If this error correlates with a change in the runtime version, you may be able to mitigate this error by temporarily rolling back to the previous runtime version. For more information, see <https://docs.aws.amazon.com/lambda/latest/dg/runtimes-update.html>
[ERROR] FileNotFoundError: [Errno 2] No such file or directory: '/home/sbx_user1051/.datahubenv'
Traceback (most recent call last):

user-3 · May 6, 2024, 12:03am

managed to fix it
by adding a bit of code in the config_util.py

user-3 · May 6, 2024, 12:03am

if os.environ.get("AWS_LAMBDA_FUNCTION_NAME"):
DEFAULT_GMS_HOST = "<http://localhost:8080>"
CONDENSED_DATAHUB_CONFIG_PATH = "/tmp/.datahubenv"
DATAHUB_CONFIG_PATH = os.path.expanduser(CONDENSED_DATAHUB_CONFIG_PATH)
DATAHUB_ROOT_FOLDER = "/tmp/.datahub"
ENV_SKIP_CONFIG = "DATAHUB_SKIP_CONFIG"
else:
DEFAULT_GMS_HOST = "<http://localhost:8080>"
CONDENSED_DATAHUB_CONFIG_PATH = "~/.datahubenv"
DATAHUB_CONFIG_PATH = os.path.expanduser(CONDENSED_DATAHUB_CONFIG_PATH)
DATAHUB_ROOT_FOLDER = os.path.expanduser("~/.datahub")
ENV_SKIP_CONFIG = "DATAHUB_SKIP_CONFIG"

user-3 · May 6, 2024, 12:03am

if the run is in a lambda env then use /tmp as the config path and root folder - in lambda we are allowed to write to /tmp

Topic		Replies	Views
Troubleshooting datahub ingress issues and solutions troubleshoot	9	145	March 4, 2024
Updating Datahub Environment in the Cloud from CLI and Managing Ingestion Issues ingestion	6	85	March 4, 2024
Troubleshooting ingestion errors in Datahub using docker-compose QuickStart troubleshoot	4	106	March 4, 2024
Error while trying to ingest dbt run_results.json from S3 bucket using DataHub v0.13.2 ingestion	1	84	May 13, 2024
Troubleshooting DataHub Ingestion for DBT Core ingestion	22	188	March 4, 2024

Troubleshooting DataHub Python SDK Ingestion Pipelines in Lambda

Related topics