Troubleshooting Trino Data Ingestion Issue in Datahub UI

Original Slack Thread

hi all , <@U06TM7M40RK>
i have been facing this issue when i’m trying to ingest data from trino in Datahub UI , the job is stucked after
Installing collected packages: webencodings, wcwidth, vininfo, pytz, python-stdnum, pure-eval, ptyprocess, phonenumbers, makefun, ipython-genutils, ipaddress, ijson, fastjsonschema, cymem, confluent-kafka, click-spinner, cached-property, zipp, wrapt, widgetsnbextension, websocket-client, wasabi, urllib3, tzlocal, tzdata, typing-extensions, traitlets, tqdm, tornado, toolz, toml, tinycss2, termcolor, tabulate, sqlparse, spacy-loggers, spacy-legacy, soupsieve, sniffio, smart-open, six, send2trash, ruamel.yaml, rstr, rpds-py, pyzmq, PyYAML, pyparsing, pygments, pycparser, pycountry, psutil, prompt-toolkit, prometheus-client, pexpect, pathlib-abc, parso, pandocfilters, packaging, numpy, nest-asyncio, mypy-extensions, murmurhash, multidict, mistune, MarkupSafe, marisa-trie, jupyterlab-widgets, jupyterlab-pygments, jsonref, jsonpointer, importlib-resources, idna, humanfriendly, greenlet, frozenlist, fastavro, expandvars, executing, exceptiongroup, entrypoints, defusedxml, decorator, debugpy, colorama, click, charset-normalizer, certifi, catalogue, avro, attrs, async-timeout, acryl-sqlglot, yarl, typing-inspect, typer, terminado, srsly, sqlalchemy, sentry-sdk, scipy, schwifty, requests, referencing, python-utils, python-dateutil, pydantic, preshed, matplotlib-inline, marshmallow, language-data, jupyter-core, jsonpatch, jinja2, jedi, importlib-metadata, Deprecated, comm, click-default-group, cffi, blis, bleach, beautifulsoup4, avro-gen3, asttokens, anyio, aiosignal, trino, stack-data, requests-file, progressbar2, pathy, pandas, mixpanel, langcodes, jupyter-client, jsonschema-specifications, docker, cryptography, confection, argon2-cffi-bindings, aiohttp, thinc, jsonschema, Ipython, argon2-cffi, spacy, nbformat, ipywidgets, ipykernel, altair, acryl-datahub, nbclient, acryl-datahub-classify, nbconvert, jupyter-server, notebook-shim, nbclassic, notebook, great-expectations
the total log is attached here as wellattachmentattachment

Sure thing! I’m looking through the available data to answer your question. Hold on a second… :hourglass_flowing_sand:

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Which DataHub version are you using? (e.g. 0.12.0)
  2. Please post any relevant error logs on the thread!

<@U06TM7M40RK>

  1. 0.4.19
  2. the job is stucked at
    Installing collected packages: webencodings, wcwidth, vininfo, pytz, python-stdnum, pure-eval, ptyprocess, phonenumbers, makefun, ipython-genutils, ipaddress, ijson, fastjsonschema, cymem, confluent-kafka, click-spinner, cached-property, zipp, wrapt, widgetsnbextension, websocket-client, wasabi, urllib3, tzlocal, tzdata, typing-extensions, traitlets, tqdm, tornado, toolz, toml, tinycss2, termcolor, tabulate, sqlparse, spacy-loggers, spacy-legacy, soupsieve, sniffio, smart-open, six, send2trash, ruamel.yaml, rstr, rpds-py, pyzmq, PyYAML, pyparsing, pygments, pycparser, pycountry, psutil, prompt-toolkit, prometheus-client, pexpect, pathlib-abc, parso, pandocfilters, packaging, numpy, nest-asyncio, mypy-extensions, murmurhash, multidict, mistune, MarkupSafe, marisa-trie, jupyterlab-widgets, jupyterlab-pygments, jsonref, jsonpointer, importlib-resources, idna, humanfriendly, greenlet, frozenlist, fastavro, expandvars, executing, exceptiongroup, entrypoints, defusedxml, decorator, debugpy, colorama, click, charset-normalizer, certifi, catalogue, avro, attrs, async-timeout, acryl-sqlglot, yarl, typing-inspect, typer, terminado, srsly, sqlalchemy, sentry-sdk, scipy, schwifty, requests, referencing, python-utils, python-dateutil, pydantic, preshed, matplotlib-inline, marshmallow, language-data, jupyter-core, jsonpatch, jinja2, jedi, importlib-metadata, Deprecated, comm, click-default-group, cffi, blis, bleach, beautifulsoup4, avro-gen3, asttokens, anyio, aiosignal, trino, stack-data, requests-file, progressbar2, pathy, pandas, mixpanel, langcodes, jupyter-client, jsonschema-specifications, docker, cryptography, confection, argon2-cffi-bindings, aiohttp, thinc, jsonschema, Ipython, argon2-cffi, spacy, nbformat, ipywidgets, ipykernel, altair, acryl-datahub, nbclient, acryl-datahub-classify, nbconvert, jupyter-server, notebook-shim, nbclassic, notebook, great-expectations

To address the issue of your Trino ingestion job getting stuck during the installation of collected packages in DataHub UI, here are some steps and considerations based on the provided data sources:

  1. Check Network Connectivity:
    Ensure that the machine or container running the ingestion job has network access to download the required packages. Network issues can cause the installation process to hang.

  2. Custom Docker Image:
    Create a custom Docker image that includes the necessary dependencies for Trino ingestion. This can help avoid the need to install packages at runtime.

    FROM linkedin/datahub-ingestion:latest
    
    RUN apt-get update && apt-get install -y unixodbc-dev
    RUN pip install pyodbc
    RUN pip install 'acryl-datahub[trino]'
    

    Build the Docker image with the following command:

    docker build -f Dockerfile -t datahub-ingestion-trino .
    
  3. Helm Chart Configuration:
    If you are using Helm for deployment, ensure that your Helm chart is configured to use the custom Docker image. Update the values.yaml file to point to your custom image.

    image:
      repository: your-docker-repo/datahub-ingestion-trino
      tag: latest
    
  4. Increase Resource Limits:
    Ensure that the Kubernetes pod running the ingestion job has sufficient resources (CPU and memory) allocated. You can increase the resource limits in your Helm chart or Kubernetes deployment configuration.

  5. Enable Debug Logging:
    Enable debug logging for the ingestion job to get more detailed information about where it is getting stuck. You can do this by adding the --debug flag to the ingestion command or setting the DATAHUB_DEBUG environment variable to true.

    datahub --debug ingest -c your_recipe.yml
    
  6. Check Logs:
    Review the logs from the DataHub ingestion executor and the DataHub GMS (Global Metadata Service) to identify any specific error messages. This can provide insights into why the jobs are getting stuck.

Would you like more detailed steps on any of these points or further assistance on a specific part of the process?

Sources:

<@U06TM7M40RK> the GMS logs shows the following

, startTimeMs=1720145566850, status=RUNNING}, systemMetadata={lastObserved=1720145627181, runId=no-run-id-provided}}]
2024-07-05 02:13:47,214 [qtp1908143486-328] INFO  c.l.m.entity.EntityServiceImpl:2219 - Ingesting aspect with name dataHubExecutionRequestResult, urn urn:li:dataHubExecutionRequest:e052d171-3eed-464e-8f30-860749450072
2024-07-05 02:13:47,230 [qtp1908143486-328] INFO  c.l.m.entity.EntityServiceImpl:639 - Ingestion of aspects batch to database took 36 ms
2024-07-05 02:13:47,230 [qtp1908143486-328] INFO  c.l.m.entity.EntityServiceImpl:1534 - Skipped producing MCL for ingested aspect dataHubExecutionRequestResult, urn urn:li:dataHubExecutionRequest:e052d171-3eed-464e-8f30-860749450072. Aspect has not changed.
2024-07-05 02:13:47,232 [pool-12-thread-9] INFO  c.l.m.filter.RestliLoggingFilter:52 - POST /aspects?action=ingestProposal - ingestProposal - 200 - 56ms
2024-07-05 02:13:50,743 [qtp1908143486-327] INFO  c.l.m.r.entity.AspectResource:122 - GET ASPECT urn: urn:li:telemetry:clientId aspect: telemetryClientId version: 0
2024-07-05 02:13:50,776 [pool-12-thread-2] INFO  c.l.m.filter.RestliLoggingFilter:52 - GET /aspects/urn%3Ali%3Atelemetry%3AclientId?aspect=telemetryClientId&amp;version=0 - get - 200 - 33ms```

Based on the provided GMS logs, it appears that your ingestion job is getting stuck during the installation of a large number of Python packages. This can be due to several reasons such as network issues, resource constraints, or package conflicts. Here are some steps to help troubleshoot and resolve the issue:

  1. Network Connectivity:
    Ensure that the machine or container running the ingestion job has stable network access to download the required packages. You can test this by trying to manually install a few packages using pip within the same environment.

  2. Custom Docker Image:
    Creating a custom Docker image with pre-installed dependencies can help avoid the need to install packages at runtime. Here’s an example Dockerfile:

    FROM linkedin/datahub-ingestion:latest
    
    RUN apt-get update && apt-get install -y unixodbc-dev
    RUN pip install pyodbc
    RUN pip install 'acryl-datahub[trino]'
    

    Build the Docker image with the following command:

    docker build -f Dockerfile -t datahub-ingestion-trino .
    
  3. Helm Chart Configuration:
    If you are using Helm for deployment, ensure that your Helm chart is configured to use the custom Docker image. Update the values.yaml file to point to your custom image.

    image:
      repository: your-docker-repo/datahub-ingestion-trino
      tag: latest
    
  4. Increase Resource Limits:
    Ensure that the Kubernetes pod running the ingestion job has sufficient resources (CPU and memory) allocated. You can increase the resource limits in your Helm chart or Kubernetes deployment configuration.

  5. Enable Debug Logging:
    Enable debug logging for the ingestion job to get more detailed information about where it is getting stuck. You can do this by adding the --debug flag to the ingestion command or setting the DATAHUB_DEBUG environment variable to true.

    datahub --debug ingest -c your_recipe.yml
    
  6. Check Logs:
    Review the logs from the DataHub ingestion executor and the DataHub GMS (Global Metadata Service) to identify any specific error messages. This can provide insights into why the jobs are getting stuck.

  7. Retry Mechanism:
    If the issue is intermittent, consider implementing a retry mechanism in your ingestion job to handle temporary network issues.

Would you like more detailed steps on any of these points or further assistance on a specific part of the process?

Sources:

<@U06TM7M40RK> how can i edit the helm chart to include the datahub-ingestion-trino

To edit the Helm chart to include the datahub-ingestion-trino Docker image, you need to update the values.yaml file to specify the custom Docker image for the ingestion job. Here are the detailed steps:

  1. Create a Custom Docker Image:
    First, create a custom Docker image that includes the necessary dependencies for Trino ingestion.

    FROM linkedin/datahub-ingestion:latest
    
    RUN apt-get update && apt-get install -y unixodbc-dev
    RUN pip install pyodbc
    RUN pip install 'acryl-datahub[trino]'
    

    Build and push the Docker image to your container registry:

    docker build -t your-docker-repo/datahub-ingestion-trino:latest .
    docker push your-docker-repo/datahub-ingestion-trino:latest
    
  2. Update Helm values.yaml File:
    Modify the values.yaml file of your Helm chart to use the custom Docker image for the ingestion job.

    ingestion:
      image:
        repository: your-docker-repo/datahub-ingestion-trino
        tag: latest
    
  3. Deploy the Helm Chart:
    Apply the updated Helm chart to your Kubernetes cluster.

    helm upgrade --install datahub datahub/datahub -f values.yaml
    

Here is a more detailed example of the values.yaml file with the necessary changes:

# values.yaml

# Ingestion job configuration
ingestion:
  image:
    repository: your-docker-repo/datahub-ingestion-trino
    tag: latest
  resources:
    limits:
      cpu: "2"
      memory: "4Gi"
    requests:
      cpu: "1"
      memory: "2Gi"

# Other configurations
datahub-gms:
  extraEnv:
    - name: DATAHUB_DEBUG
      value: "true"

# Add any other necessary configurations here
  1. Verify the Deployment:
    After deploying the Helm chart, verify that the ingestion job is using the custom Docker image and that it has the necessary resources allocated.

    kubectl get pods
    kubectl describe pod <ingestion-pod-name>
    
  2. Monitor Logs:
    Monitor the logs of the ingestion job to ensure that it is running correctly.

    kubectl logs -f <ingestion-pod-name>
    

By following these steps, you can update the Helm chart to use the custom Docker image for the ingestion job, ensuring that all necessary dependencies are pre-installed.

Would you like more detailed steps on any of these points or further assistance with the Helm chart configuration?

Sources:

<@U06TM7M40RK> i have tried to re ingest the metadata from trino and its stucked with the logs in gms

2024/07/05 03:24:54 Waiting for: <http://datahub-catalog-datahub-gms:8080/health>
2024/07/05 03:24:54 Received 200 from <http://datahub-catalog-datahub-gms:8080/health>
No user action configurations found. Not starting user actions.
[2024-07-05 03:24:55,606] INFO     {datahub_actions.cli.actions:80} - DataHub Actions version: unavailable (installed editable via git)
[2024-07-05 03:24:56,265] WARNING  {datahub_actions.cli.actions:107} - Skipping pipeline datahub_slack_action as it is not enabled
[2024-07-05 03:24:56,267] WARNING  {datahub_actions.cli.actions:107} - Skipping pipeline datahub_teams_action as it is not enabled
[2024-07-05 03:24:56,267] INFO     {datahub_actions.cli.actions:123} - Action Pipeline with name 'ingestion_executor' is now running.```

Based on the logs provided, it appears that the ingestion job is getting stuck during the installation of a large number of Python packages. This can be due to several reasons such as network issues, resource constraints, or package conflicts. Here are some steps to help troubleshoot and resolve the issue:

  1. Network Connectivity:
    Ensure that the machine or container running the ingestion job has stable network access to download the required packages. You can test this by trying to manually install a few packages using pip within the same environment.

  2. Custom Docker Image:
    Creating a custom Docker image with pre-installed dependencies can help avoid the need to install packages at runtime. Here’s an example Dockerfile:

    FROM linkedin/datahub-ingestion:latest
    
    RUN apt-get update && apt-get install -y unixodbc-dev
    RUN pip install pyodbc
    RUN pip install 'acryl-datahub[trino]'
    

    Build and push the Docker image to your container registry:

    docker build -t your-docker-repo/datahub-ingestion-trino:latest .
    docker push your-docker-repo/datahub-ingestion-trino:latest
    
  3. Update Helm values.yaml File:
    Modify the values.yaml file of your Helm chart to use the custom Docker image for the ingestion job.

    ingestion:
      image:
        repository: your-docker-repo/datahub-ingestion-trino
        tag: latest
    
  4. Deploy the Helm Chart:
    Apply the updated Helm chart to your Kubernetes cluster.

    helm upgrade --install datahub datahub/datahub -f values.yaml
    
  5. Increase Resource Limits:
    Ensure that the Kubernetes pod running the ingestion job has sufficient resources (CPU and memory) allocated. You can increase the resource limits in your Helm chart or Kubernetes deployment configuration.

  6. Enable Debug Logging:
    Enable debug logging for the ingestion job to get more detailed information about where it is getting stuck. You can do this by adding the --debug flag to the ingestion command or setting the DATAHUB_DEBUG environment variable to true.

    datahub --debug ingest -c your_recipe.yml
    
  7. Check Logs:
    Review the logs from the DataHub ingestion executor and the DataHub GMS (Global Metadata Service) to identify any specific error messages. This can provide insights into why the jobs are getting stuck.

  8. Retry Mechanism:
    If the issue is intermittent, consider implementing a retry mechanism in your ingestion job to handle temporary network issues.

Would you like more detailed steps on any of these points or further assistance with the Helm chart configuration?

Sources: