Troubleshooting Trino Data Ingestion Issue in Datahub UI

user-1 · July 8, 2024, 12:05am

hi all , <@U06TM7M40RK>
i have been facing this issue when i’m trying to ingest data from trino in Datahub UI , the job is stucked after
Installing collected packages: webencodings, wcwidth, vininfo, pytz, python-stdnum, pure-eval, ptyprocess, phonenumbers, makefun, ipython-genutils, ipaddress, ijson, fastjsonschema, cymem, confluent-kafka, click-spinner, cached-property, zipp, wrapt, widgetsnbextension, websocket-client, wasabi, urllib3, tzlocal, tzdata, typing-extensions, traitlets, tqdm, tornado, toolz, toml, tinycss2, termcolor, tabulate, sqlparse, spacy-loggers, spacy-legacy, soupsieve, sniffio, smart-open, six, send2trash, ruamel.yaml, rstr, rpds-py, pyzmq, PyYAML, pyparsing, pygments, pycparser, pycountry, psutil, prompt-toolkit, prometheus-client, pexpect, pathlib-abc, parso, pandocfilters, packaging, numpy, nest-asyncio, mypy-extensions, murmurhash, multidict, mistune, MarkupSafe, marisa-trie, jupyterlab-widgets, jupyterlab-pygments, jsonref, jsonpointer, importlib-resources, idna, humanfriendly, greenlet, frozenlist, fastavro, expandvars, executing, exceptiongroup, entrypoints, defusedxml, decorator, debugpy, colorama, click, charset-normalizer, certifi, catalogue, avro, attrs, async-timeout, acryl-sqlglot, yarl, typing-inspect, typer, terminado, srsly, sqlalchemy, sentry-sdk, scipy, schwifty, requests, referencing, python-utils, python-dateutil, pydantic, preshed, matplotlib-inline, marshmallow, language-data, jupyter-core, jsonpatch, jinja2, jedi, importlib-metadata, Deprecated, comm, click-default-group, cffi, blis, bleach, beautifulsoup4, avro-gen3, asttokens, anyio, aiosignal, trino, stack-data, requests-file, progressbar2, pathy, pandas, mixpanel, langcodes, jupyter-client, jsonschema-specifications, docker, cryptography, confection, argon2-cffi-bindings, aiohttp, thinc, jsonschema, Ipython, argon2-cffi, spacy, nbformat, ipywidgets, ipykernel, altair, acryl-datahub, nbclient, acryl-datahub-classify, nbconvert, jupyter-server, notebook-shim, nbclassic, notebook, great-expectations
the total log is attached here as well attachment

user-3 · July 8, 2024, 12:05am

Sure thing! I’m looking through the available data to answer your question. Hold on a second…

datahub_team · July 8, 2024, 12:05am

Hey there! Make sure your message includes the following information if relevant, so we can help more effectively!

Which DataHub version are you using? (e.g. 0.12.0)
Please post any relevant error logs on the thread!

user-1 · July 8, 2024, 12:05am

<@U06TM7M40RK>

0.4.19
the job is stucked at
Installing collected packages: webencodings, wcwidth, vininfo, pytz, python-stdnum, pure-eval, ptyprocess, phonenumbers, makefun, ipython-genutils, ipaddress, ijson, fastjsonschema, cymem, confluent-kafka, click-spinner, cached-property, zipp, wrapt, widgetsnbextension, websocket-client, wasabi, urllib3, tzlocal, tzdata, typing-extensions, traitlets, tqdm, tornado, toolz, toml, tinycss2, termcolor, tabulate, sqlparse, spacy-loggers, spacy-legacy, soupsieve, sniffio, smart-open, six, send2trash, ruamel.yaml, rstr, rpds-py, pyzmq, PyYAML, pyparsing, pygments, pycparser, pycountry, psutil, prompt-toolkit, prometheus-client, pexpect, pathlib-abc, parso, pandocfilters, packaging, numpy, nest-asyncio, mypy-extensions, murmurhash, multidict, mistune, MarkupSafe, marisa-trie, jupyterlab-widgets, jupyterlab-pygments, jsonref, jsonpointer, importlib-resources, idna, humanfriendly, greenlet, frozenlist, fastavro, expandvars, executing, exceptiongroup, entrypoints, defusedxml, decorator, debugpy, colorama, click, charset-normalizer, certifi, catalogue, avro, attrs, async-timeout, acryl-sqlglot, yarl, typing-inspect, typer, terminado, srsly, sqlalchemy, sentry-sdk, scipy, schwifty, requests, referencing, python-utils, python-dateutil, pydantic, preshed, matplotlib-inline, marshmallow, language-data, jupyter-core, jsonpatch, jinja2, jedi, importlib-metadata, Deprecated, comm, click-default-group, cffi, blis, bleach, beautifulsoup4, avro-gen3, asttokens, anyio, aiosignal, trino, stack-data, requests-file, progressbar2, pathy, pandas, mixpanel, langcodes, jupyter-client, jsonschema-specifications, docker, cryptography, confection, argon2-cffi-bindings, aiohttp, thinc, jsonschema, Ipython, argon2-cffi, spacy, nbformat, ipywidgets, ipykernel, altair, acryl-datahub, nbclient, acryl-datahub-classify, nbconvert, jupyter-server, notebook-shim, nbclassic, notebook, great-expectations

user-3 · July 8, 2024, 12:05am

To address the issue of your Trino ingestion job getting stuck during the installation of collected packages in DataHub UI, here are some steps and considerations based on the provided data sources:

Check Network Connectivity:
Ensure that the machine or container running the ingestion job has network access to download the required packages. Network issues can cause the installation process to hang.
Custom Docker Image:
Create a custom Docker image that includes the necessary dependencies for Trino ingestion. This can help avoid the need to install packages at runtime.
```
FROM linkedin/datahub-ingestion:latest

RUN apt-get update && apt-get install -y unixodbc-dev
RUN pip install pyodbc
RUN pip install 'acryl-datahub[trino]'
```
Build the Docker image with the following command:
```
docker build -f Dockerfile -t datahub-ingestion-trino .
```
Helm Chart Configuration:
If you are using Helm for deployment, ensure that your Helm chart is configured to use the custom Docker image. Update the values.yaml file to point to your custom image.
```
image:
  repository: your-docker-repo/datahub-ingestion-trino
  tag: latest
```
Increase Resource Limits:
Ensure that the Kubernetes pod running the ingestion job has sufficient resources (CPU and memory) allocated. You can increase the resource limits in your Helm chart or Kubernetes deployment configuration.
Enable Debug Logging:
Enable debug logging for the ingestion job to get more detailed information about where it is getting stuck. You can do this by adding the --debug flag to the ingestion command or setting the DATAHUB_DEBUG environment variable to true.
```
datahub --debug ingest -c your_recipe.yml
```
Check Logs:
Review the logs from the DataHub ingestion executor and the DataHub GMS (Global Metadata Service) to identify any specific error messages. This can provide insights into why the jobs are getting stuck.

Would you like more detailed steps on any of these points or further assistance on a specific part of the process?

Sources:

<Slack Discussion on MSSQL Ingestion>
<Create new page · mkleehammer/pyodbc Wiki · GitHub Installation Wiki>
<Slack Discussion on Azure SQL Ingestion> 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-1 · July 8, 2024, 12:05am

<@U06TM7M40RK> the GMS logs shows the following

, startTimeMs=1720145566850, status=RUNNING}, systemMetadata={lastObserved=1720145627181, runId=no-run-id-provided}}]
2024-07-05 02:13:47,214 [qtp1908143486-328] INFO  c.l.m.entity.EntityServiceImpl:2219 - Ingesting aspect with name dataHubExecutionRequestResult, urn urn:li:dataHubExecutionRequest:e052d171-3eed-464e-8f30-860749450072
2024-07-05 02:13:47,230 [qtp1908143486-328] INFO  c.l.m.entity.EntityServiceImpl:639 - Ingestion of aspects batch to database took 36 ms
2024-07-05 02:13:47,230 [qtp1908143486-328] INFO  c.l.m.entity.EntityServiceImpl:1534 - Skipped producing MCL for ingested aspect dataHubExecutionRequestResult, urn urn:li:dataHubExecutionRequest:e052d171-3eed-464e-8f30-860749450072. Aspect has not changed.
2024-07-05 02:13:47,232 [pool-12-thread-9] INFO  c.l.m.filter.RestliLoggingFilter:52 - POST /aspects?action=ingestProposal - ingestProposal - 200 - 56ms
2024-07-05 02:13:50,743 [qtp1908143486-327] INFO  c.l.m.r.entity.AspectResource:122 - GET ASPECT urn: urn:li:telemetry:clientId aspect: telemetryClientId version: 0
2024-07-05 02:13:50,776 [pool-12-thread-2] INFO  c.l.m.filter.RestliLoggingFilter:52 - GET /aspects/urn%3Ali%3Atelemetry%3AclientId?aspect=telemetryClientId&amp;version=0 - get - 200 - 33ms```

user-3 · July 8, 2024, 12:05am

Based on the provided GMS logs, it appears that your ingestion job is getting stuck during the installation of a large number of Python packages. This can be due to several reasons such as network issues, resource constraints, or package conflicts. Here are some steps to help troubleshoot and resolve the issue:

Network Connectivity:
Ensure that the machine or container running the ingestion job has stable network access to download the required packages. You can test this by trying to manually install a few packages using pip within the same environment.
Custom Docker Image:
Creating a custom Docker image with pre-installed dependencies can help avoid the need to install packages at runtime. Here’s an example Dockerfile:
```
FROM linkedin/datahub-ingestion:latest

RUN apt-get update && apt-get install -y unixodbc-dev
RUN pip install pyodbc
RUN pip install 'acryl-datahub[trino]'
```
Build the Docker image with the following command:
```
docker build -f Dockerfile -t datahub-ingestion-trino .
```
Helm Chart Configuration:
If you are using Helm for deployment, ensure that your Helm chart is configured to use the custom Docker image. Update the values.yaml file to point to your custom image.
```
image:
  repository: your-docker-repo/datahub-ingestion-trino
  tag: latest
```
Increase Resource Limits:
Ensure that the Kubernetes pod running the ingestion job has sufficient resources (CPU and memory) allocated. You can increase the resource limits in your Helm chart or Kubernetes deployment configuration.
Enable Debug Logging:
Enable debug logging for the ingestion job to get more detailed information about where it is getting stuck. You can do this by adding the --debug flag to the ingestion command or setting the DATAHUB_DEBUG environment variable to true.
```
datahub --debug ingest -c your_recipe.yml
```
Check Logs:
Review the logs from the DataHub ingestion executor and the DataHub GMS (Global Metadata Service) to identify any specific error messages. This can provide insights into why the jobs are getting stuck.
Retry Mechanism:
If the issue is intermittent, consider implementing a retry mechanism in your ingestion job to handle temporary network issues.

Would you like more detailed steps on any of these points or further assistance on a specific part of the process?

Sources:

<Slack Discussion on MSSQL Ingestion>
<Create new page · mkleehammer/pyodbc Wiki · GitHub Installation Wiki>
<Slack Discussion on Azure SQL Ingestion> 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-1 · July 8, 2024, 12:05am

<@U06TM7M40RK> how can i edit the helm chart to include the datahub-ingestion-trino

user-3 · July 8, 2024, 12:05am

To edit the Helm chart to include the datahub-ingestion-trino Docker image, you need to update the values.yaml file to specify the custom Docker image for the ingestion job. Here are the detailed steps:

Create a Custom Docker Image:
First, create a custom Docker image that includes the necessary dependencies for Trino ingestion.

FROM linkedin/datahub-ingestion:latest

RUN apt-get update && apt-get install -y unixodbc-dev
RUN pip install pyodbc
RUN pip install 'acryl-datahub[trino]'

Build and push the Docker image to your container registry:

docker build -t your-docker-repo/datahub-ingestion-trino:latest .
docker push your-docker-repo/datahub-ingestion-trino:latest

Update Helm values.yaml File:
Modify the values.yaml file of your Helm chart to use the custom Docker image for the ingestion job.
```
ingestion:
  image:
    repository: your-docker-repo/datahub-ingestion-trino
    tag: latest
```
Deploy the Helm Chart:
Apply the updated Helm chart to your Kubernetes cluster.
```
helm upgrade --install datahub datahub/datahub -f values.yaml
```

Here is a more detailed example of the values.yaml file with the necessary changes:

# values.yaml

# Ingestion job configuration
ingestion:
  image:
    repository: your-docker-repo/datahub-ingestion-trino
    tag: latest
  resources:
    limits:
      cpu: "2"
      memory: "4Gi"
    requests:
      cpu: "1"
      memory: "2Gi"

# Other configurations
datahub-gms:
  extraEnv:
    - name: DATAHUB_DEBUG
      value: "true"

# Add any other necessary configurations here

Verify the Deployment:
After deploying the Helm chart, verify that the ingestion job is using the custom Docker image and that it has the necessary resources allocated.
```
kubectl get pods
kubectl describe pod <ingestion-pod-name>
```
Monitor Logs:
Monitor the logs of the ingestion job to ensure that it is running correctly.
```
kubectl logs -f <ingestion-pod-name>
```

By following these steps, you can update the Helm chart to use the custom Docker image for the ingestion job, ensuring that all necessary dependencies are pre-installed.

Would you like more detailed steps on any of these points or further assistance with the Helm chart configuration?

Sources:

<Slack Discussion on MSSQL Ingestion>
<Create new page · mkleehammer/pyodbc Wiki · GitHub Installation Wiki>
<Slack Discussion on Azure SQL Ingestion> 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-1 · July 8, 2024, 12:05am

<@U06TM7M40RK> i have tried to re ingest the metadata from trino and its stucked with the logs in gms

2024/07/05 03:24:54 Waiting for: <http://datahub-catalog-datahub-gms:8080/health>
2024/07/05 03:24:54 Received 200 from <http://datahub-catalog-datahub-gms:8080/health>
No user action configurations found. Not starting user actions.
[2024-07-05 03:24:55,606] INFO     {datahub_actions.cli.actions:80} - DataHub Actions version: unavailable (installed editable via git)
[2024-07-05 03:24:56,265] WARNING  {datahub_actions.cli.actions:107} - Skipping pipeline datahub_slack_action as it is not enabled
[2024-07-05 03:24:56,267] WARNING  {datahub_actions.cli.actions:107} - Skipping pipeline datahub_teams_action as it is not enabled
[2024-07-05 03:24:56,267] INFO     {datahub_actions.cli.actions:123} - Action Pipeline with name 'ingestion_executor' is now running.```

user-3 · July 8, 2024, 12:05am

Based on the logs provided, it appears that the ingestion job is getting stuck during the installation of a large number of Python packages. This can be due to several reasons such as network issues, resource constraints, or package conflicts. Here are some steps to help troubleshoot and resolve the issue:

Network Connectivity:
Ensure that the machine or container running the ingestion job has stable network access to download the required packages. You can test this by trying to manually install a few packages using pip within the same environment.

Custom Docker Image:
Creating a custom Docker image with pre-installed dependencies can help avoid the need to install packages at runtime. Here’s an example Dockerfile:

FROM linkedin/datahub-ingestion:latest

RUN apt-get update && apt-get install -y unixodbc-dev
RUN pip install pyodbc
RUN pip install 'acryl-datahub[trino]'

Build and push the Docker image to your container registry:

docker build -t your-docker-repo/datahub-ingestion-trino:latest .
docker push your-docker-repo/datahub-ingestion-trino:latest

Update Helm values.yaml File:
Modify the values.yaml file of your Helm chart to use the custom Docker image for the ingestion job.
```
ingestion:
  image:
    repository: your-docker-repo/datahub-ingestion-trino
    tag: latest
```
Deploy the Helm Chart:
Apply the updated Helm chart to your Kubernetes cluster.
```
helm upgrade --install datahub datahub/datahub -f values.yaml
```
Increase Resource Limits:
Ensure that the Kubernetes pod running the ingestion job has sufficient resources (CPU and memory) allocated. You can increase the resource limits in your Helm chart or Kubernetes deployment configuration.
Enable Debug Logging:
Enable debug logging for the ingestion job to get more detailed information about where it is getting stuck. You can do this by adding the --debug flag to the ingestion command or setting the DATAHUB_DEBUG environment variable to true.
```
datahub --debug ingest -c your_recipe.yml
```
Check Logs:
Review the logs from the DataHub ingestion executor and the DataHub GMS (Global Metadata Service) to identify any specific error messages. This can provide insights into why the jobs are getting stuck.
Retry Mechanism:
If the issue is intermittent, consider implementing a retry mechanism in your ingestion job to handle temporary network issues.

Would you like more detailed steps on any of these points or further assistance with the Helm chart configuration?

Sources:

<Slack Discussion on MSSQL Ingestion>
<Create new page · mkleehammer/pyodbc Wiki · GitHub Installation Wiki>
<Slack Discussion on Azure SQL Ingestion> 0 button 0 button Hint: Mention @runllm in the thread for followups.

Topic		Replies	Views
Troubleshooting Trino DataHub Ingestion Issues ingestion	9	46	July 8, 2024
Troubleshooting datahub ingress issues and solutions troubleshoot	9	142	March 4, 2024
Troubleshooting ingestion errors in Datahub using docker-compose QuickStart troubleshoot	4	103	March 4, 2024
Troubleshooting DataHub Ingestion Issues ingestion	2	95	September 2, 2024
Troubleshooting Datahub Ingestion Issue with UI Deployment and Elasticsearch Setup troubleshoot	1	186	May 20, 2024

Troubleshooting Trino Data Ingestion Issue in Datahub UI

Related topics