Original Slack Thread
Hi,
Having issues to understand how to create ingestion pipelines using python sdk. inside lambda
I have create a layer with the latest datahub-acryl from pypy
This is my code sample:
from datahub.ingestion.graph.client import DatahubClientConfig, DataHubGraph
from datahub.configuration.config_loader import load_config_file
server = 'server'
token = 'token'
# Create an emitter to DataHub over REST
graph = DataHubGraph(DatahubClientConfig(server=server, token=token))
pipeline_config = {
"pipeline_name": "prog-s3-stg-test",
"source": {
"type": "s3",
"config": {
"env": "STG",
"path_specs": [{
"include": "<s3://bbkt/path/*.*>"
}],
"aws_config": {
"aws_region": "region"
}
}
}
}
pipeline = Pipeline.create(pipeline_config)
pipeline.run()
pipeline.raise_from_status()```
It seems that is trying to look for the lock .datahubenv
Here is the error
`{
"errorMessage": "[Errno 2] No such file or directory: '/home/sbx_user1051/.datahubenv'",
"errorType": "FileNotFoundError",
"stackTrace": [
" File \"/var/t`
Any ideas on how to run this in a lambda ? if possbile :slightly_smiling_face:
Hey there!
Make sure your message includes the following information if relevant, so we can help more effectively!
- Are you using UI or CLI for ingestion?
- Which DataHub version are you using? (e.g. 0.12.0)
- What data source(s) are you integrating with DataHub? (e.g. BigQuery)
pipeline = Pipeline.create(
{
"source": {
"type": "s3",
"config": {
"path_specs": [{
"include": "/Users/sst/install000.csv"}],
"profiling": {"enabled": True},
},
},
"sink": {
"type": "datahub-rest",
"config": {
"server": "<http://localhost:8080>",
}
}
}
)
pipeline.run()
pipeline.pretty_print_summary()
pipeline.log_ingestion_stats()```
make sure your lambda environment Datahub ingest packages are installed
if you look close my code is the same - and yes i have installed the layer with success
it seems that the init is trying to place the .datahubenv in the home dir of the running os, but Lambda does not allow it
START RequestId: cdbeda15-51cf-4dad-9fbf-d0c536ac30e2 Version: $LATEST
No ~/.datahubenv file found, generating one for you...
LAMBDA_WARNING: Unhandled exception. The most likely cause is an issue in the function code. However, in rare cases, a Lambda runtime update can cause unexpected function behavior. For functions using managed runtimes, runtime updates can be triggered by a function change, or can be applied automatically. To determine if the runtime has been updated, check the runtime version in the INIT_START log entry. If this error correlates with a change in the runtime version, you may be able to mitigate this error by temporarily rolling back to the previous runtime version. For more information, see <https://docs.aws.amazon.com/lambda/latest/dg/runtimes-update.html>
[ERROR] FileNotFoundError: [Errno 2] No such file or directory: '/home/sbx_user1051/.datahubenv'
Traceback (most recent call last):
managed to fix it
by adding a bit of code in the config_util.py
if os.environ.get("AWS_LAMBDA_FUNCTION_NAME"):
DEFAULT_GMS_HOST = "<http://localhost:8080>"
CONDENSED_DATAHUB_CONFIG_PATH = "/tmp/.datahubenv"
DATAHUB_CONFIG_PATH = os.path.expanduser(CONDENSED_DATAHUB_CONFIG_PATH)
DATAHUB_ROOT_FOLDER = "/tmp/.datahub"
ENV_SKIP_CONFIG = "DATAHUB_SKIP_CONFIG"
else:
DEFAULT_GMS_HOST = "<http://localhost:8080>"
CONDENSED_DATAHUB_CONFIG_PATH = "~/.datahubenv"
DATAHUB_CONFIG_PATH = os.path.expanduser(CONDENSED_DATAHUB_CONFIG_PATH)
DATAHUB_ROOT_FOLDER = os.path.expanduser("~/.datahub")
ENV_SKIP_CONFIG = "DATAHUB_SKIP_CONFIG"
if the run is in a lambda env then use /tmp
as the config path and root folder - in lambda we are allowed to write to /tmp