Troubleshooting Trino DataHub Ingestion Issues

Original Slack Thread

<@U06TM7M40RK> , currently facing the ingestion issue with datahub built in helm chart .
my ingestion receipt :

source:
type: trino
config:
host_port: ‘trino://localhost:9050
database: null
username: null
include_views: true
include_tables: true
profiling:
enabled: true
profile_table_level_only: true
stateful_ingestion:
enabled: true
options:
connect_args:
http_scheme: httpattachment

Based on the details provided, it seems like there might be a configuration or connectivity issue with your Trino ingestion setup. Here are a few steps to troubleshoot and resolve the issue:

  1. Check Database Host Configuration:

    • Ensure that the host_port is correctly specified. The format should be hostname:port without the trino:// prefix. For example, localhost:9050.
  2. Verify Network Connectivity:

    • Ensure that the machine or container running the ingestion job has network access to the Trino server. You can test this by trying to ping the Trino host or using a tool like telnet to check connectivity on the specified port.
  3. Correct YAML Formatting:

    • Ensure that the YAML indentation and formatting are correct. Incorrect indentation can lead to parsing errors.

Here’s an example of a corrected ingestion recipe:

source:
    type: trino
    config:
        host_port: 'localhost:9050'
        database: null
        username: null
        include_views: true
        include_tables: true
        profiling:
            enabled: true
            profile_table_level_only: true
        stateful_ingestion:
            enabled: true
        options:
            connect_args:
                http_scheme: http

Additional Configuration Details

Would you like more detailed steps on any of these troubleshooting points or further assistance on a specific part of the process?

Sources:

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?


:information_source: We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a :thumbsup: or :thumbsdown: so other folks in the Community know if you’re still looking for help on the topic!

To help you more effectively, could you please provide the following information?

  1. Are you using the UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g., 0.12.0)?
  3. What data source(s) are you integrating with DataHub (e.g., BigQuery)?

Based on the error message you provided, it seems like there might be an issue with the network connection or the database host configuration. Here are a few steps to troubleshoot and resolve the issue:

  1. Check Database Host Configuration:
    Ensure that the database host name is correctly specified in your ingestion configuration. The error Name or service not known typically indicates that the hostname cannot be resolved.

  2. Network Connectivity:
    Verify that the machine or container running the ingestion job has network access to the database server. You can test this by trying to ping the database host or using a tool like telnet to check connectivity on the database port.

  3. DNS Resolution:
    Ensure that the DNS settings are correctly configured on the machine or container running the ingestion job. Sometimes, DNS resolution issues can cause the hostname to be unresolvable.

  4. Database Service Availability:
    Confirm that the database service is running and accessible. You can try connecting to the database using a database client tool to ensure it is reachable.

  5. Retry Mechanism:
    If the issue is intermittent, consider implementing a retry mechanism in your ingestion job to handle temporary network issues.

Would you like more detailed steps on any of these troubleshooting points or further assistance on a specific part of the process?

Sources:

  • <Slack Thread on Ingestion Logs>
  • <Postgres | DataHub Ingestion Documentation> :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

<@U06TM7M40RK>

  1. UI
  2. 0.4.19
  3. Trino

Sure thing! I’m looking through the available data to answer your question. Hold on a second… :hourglass_flowing_sand:

<@U06TM7M40RK> the ingestion stucked at Installing collected packages: webencodings, wcwidth, vininfo, pytz, python-stdnum, pure-eval, ptyprocess, phonenumbers, makefun, ipython-genutils, ipaddress, ijson, fastjsonschema, cymem, confluent-kafka, click-spinner, cached-property, zipp, wrapt, widgetsnbextension, websocket-client, wasabi, urllib3, tzlocal, tzdata, typing-extensions, traitlets, tqdm, tornado, toolz, toml, tinycss2, termcolor, tabulate, sqlparse, spacy-loggers, spacy-legacy, soupsieve, sniffio, smart-open, six, send2trash, ruamel.yaml, rstr, rpds-py, pyzmq, PyYAML, pyparsing, pygments, pycparser, pycountry, psutil, prompt-toolkit, prometheus-client, pexpect, pathlib-abc, parso, pandocfilters, packaging, numpy, nest-asyncio, mypy-extensions, murmurhash, multidict, mistune, MarkupSafe, marisa-trie, jupyterlab-widgets, jupyterlab-pygments, jsonref, jsonpointer, importlib-resources, idna, humanfriendly, greenlet, frozenlist, fastavro, expandvars, executing, exceptiongroup, entrypoints, defusedxml, decorator, debugpy, colorama, click, charset-normalizer, certifi, catalogue, avro, attrs, async-timeout, acryl-sqlglot, yarl, typing-inspect, typer, terminado, srsly, sqlalchemy, sentry-sdk, scipy, schwifty, requests, referencing, python-utils, python-dateutil, pydantic, preshed, matplotlib-inline, marshmallow, language-data, jupyter-core, jsonpatch, jinja2, jedi, importlib-metadata, Deprecated, comm, click-default-group, cffi, blis, bleach, beautifulsoup4, avro-gen3, asttokens, anyio, aiosignal, trino, stack-data, requests-file, progressbar2, pathy, pandas, mixpanel, langcodes, jupyter-client, jsonschema-specifications, docker, cryptography, confection, argon2-cffi-bindings, aiohttp, thinc, jsonschema, Ipython, argon2-cffi, spacy, nbformat, ipywidgets, ipykernel, altair, acryl-datahub, nbclient, acryl-datahub-classify, nbconvert, jupyter-server, notebook-shim, nbclassic, notebook, great-expectations

<@U06TM7M40RK> ingestion job stuck at running with the latest log

Installing collected packages: webencodings, wcwidth, vininfo, pytz, python-stdnum, pure-eval, ptyprocess, phonenumbers, makefun, ipython-genutils, ipaddress, ijson, fastjsonschema, cymem, confluent-kafka, click-spinner, cached-property, zipp, wrapt, widgetsnbextension, websocket-client, wasabi, urllib3, tzlocal, tzdata, typing-extensions, traitlets, tqdm, tornado, toolz, toml, tinycss2, termcolor, tabulate, sqlparse, spacy-loggers, spacy-legacy, soupsieve, sniffio, smart-open, six, send2trash, ruamel.yaml, rstr, rpds-py, pyzmq, PyYAML, pyparsing, pygments, pycparser, pycountry, psutil, prompt-toolkit, prometheus-client, pexpect, pathlib-abc, parso, pandocfilters, packaging, numpy, nest-asyncio, mypy-extensions, murmurhash, multidict, mistune, MarkupSafe, marisa-trie, jupyterlab-widgets, jupyterlab-pygments, jsonref, jsonpointer, importlib-resources, idna, humanfriendly, greenlet, frozenlist, fastavro, expandvars, executing, exceptiongroup, entrypoints, defusedxml, decorator, debugpy, colorama, click, charset-normalizer, certifi, catalogue, avro, attrs, async-timeout, acryl-sqlglot, yarl, typing-inspect, typer, terminado, srsly, sqlalchemy, sentry-sdk, scipy, schwifty, requests, referencing, python-utils, python-dateutil, pydantic, preshed, matplotlib-inline, marshmallow, language-data, jupyter-core, jsonpatch, jinja2, jedi, importlib-metadata, Deprecated, comm, click-default-group, cffi, blis, bleach, beautifulsoup4, avro-gen3, asttokens, anyio, aiosignal, trino, stack-data, requests-file, progressbar2, pathy, pandas, mixpanel, langcodes, jupyter-client, jsonschema-specifications, docker, cryptography, confection, argon2-cffi-bindings, aiohttp, thinc, jsonschema, Ipython, argon2-cffi, spacy, nbformat, ipywidgets, ipykernel, altair, acryl-datahub, nbclient, acryl-datahub-classify, nbconvert, jupyter-server, notebook-shim, nbclassic, notebook, great-expectations```

<@U06TM7M40RK> any advice >

<@U06TM7M40RK> how can i find the Kubernetes service name within the kubernetes