Ingesting Hive Metadata from Cloudera Data Warehouse in DataHub: Seeking Help to Resolve Errors

Original Slack Thread

Hi! i’m trying to ingest hive metadata from a Cloudera Data Warehouse (on a CDP in public cloud). Cloudera exposes hive with the http transport mode and i can successfully connect with a jdbc client (for example DBeaver). However when i try to ingest from the datahub quickstart i receive an error. In the comments the attached logs and the params that i’m using. Can you help me to solve this?

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)
    type: hive
    config:
        host_port: 'cdw_host:443'
        database: default
        username: user_x
        stateful_ingestion:
            enabled: true
        password: password_x
        scheme: hive+https```![attachment]({'ID': 'F073EL635CL', 'EDITABLE': True, 'IS_EXTERNAL': False, 'USER_ID': 'U06V7891S86', 'CREATED': '2024-05-14 12:29:31+00:00', 'PERMALINK': 'https://datahubspace.slack.com/files/U06V7891S86/F073EL635CL/exec-urn_li_datahubexecutionrequest_7b60273b-417b-4935-97ac-d6e2af119a8e.log', 'EXTERNAL_TYPE': '', 'TIMESTAMPS': '2024-05-14 12:29:31+00:00', 'MODE': 'snippet', 'DISPLAY_AS_BOT': False, 'PRETTY_TYPE': 'C++', 'NAME': 'exec-urn_li_dataHubExecutionRequest_7b60273b-417b-4935-97ac-d6e2af119a8e.log', 'IS_PUBLIC': True, 'PREVIEW_HIGHLIGHT': '<div class="CodeMirror cm-s-default CodeMirrorServer">\n<div class="CodeMirror-code">\n<div><pre><span class="cm-variable">~~~~</span> <span class="cm-variable">Execution</span> <span class="cm-variable">Summary</span> <span class="cm-operator">-</span> <span class="cm-variable">RUN_INGEST</span> <span class="cm-variable">~~~~</span></pre></div>\n<div><pre><span class="cm-variable">Execution</span> <span class="cm-variable">finished</span> <span class="cm-variable">with</span> <span class="cm-variable">errors</span>.</pre></div>\n<div><pre>{<span class="cm-string">\'exec_id\'</span>: <span class="cm-string">\'7b60273b-417b-4935-97ac-d6e2af119a8e\'</span>,</pre></div>\n<div><pre> <span class="cm-string">\'infos\'</span>: [<span class="cm-string">\'2024-05-14 12:14:42.722784 INFO: Starting execution for task with name=RUN_INGEST\'</span>,</pre></div>\n<div><pre>           <span class="cm-string">&quot;2024-05-14 12:14:48.787677 INFO: Failed to execute \'datahub ingest\', exit code 1&quot;</span>,</pre></div>\n</div>\n</div>\n', 'MIMETYPE': 'text/plain', 'PERMALINK_PUBLIC': 'https://slack-files.com/TUMKD5EGJ-F073EL635CL-4a3d8e9e61', 'FILETYPE': 'cpp', 'EDIT_LINK': 'https://datahubspace.slack.com/files/U06V7891S86/F073EL635CL/exec-urn_li_datahubexecutionrequest_7b60273b-417b-4935-97ac-d6e2af119a8e.log/edit', 'URL_PRIVATE': 'https://files.slack.com/files-pri/TUMKD5EGJ-F073EL635CL/exec-urn_li_datahubexecutionrequest_7b60273b-417b-4935-97ac-d6e2af119a8e.log', 'HAS_RICH_PREVIEW': False, 'TITLE': 'exec-urn_li_dataHubExecutionRequest_7b60273b-417b-4935-97ac-d6e2af119a8e.log', 'IS_STARRED': False, 'PREVIEW_IS_TRUNCATED': True, 'URL_PRIVATE_DOWNLOAD': 'https://files.slack.com/files-pri/TUMKD5EGJ-F073EL635CL/download/exec-urn_li_datahubexecutionrequest_7b60273b-417b-4935-97ac-d6e2af119a8e.log', 'PREVIEW': '~~~~ Execution Summary - RUN_INGEST ~~~~\nExecution finished with errors.\n{\'exec_id\': \'7b60273b-417b-4935-97ac-d6e2af119a8e\',\n \'infos\': [\'2024-05-14 12:14:42.722784 INFO: Starting execution for task with name=RUN_INGEST\',\n           "2024-05-14 12:14:48.787677 INFO: Failed to execute \'datahub ingest\', exit code 1",', 'PUBLIC_URL_SHARED': False, 'MESSAGE_TS': '1715689832.652469', 'PARENT_MESSAGE_TS': '1715689745.447509', 'MESSAGE_CHANNEL_ID': 'CUMUWQU66', '_FIVETRAN_DELETED': False, 'LINES_MORE': 223, 'LINES': 228, 'SIZE': 14226, '_FIVETRAN_SYNCED': '2024-05-19 08:22:13.319000+00:00'})
  1. UI
  2. 0.13.1
  3. Hive from CDW of CDP

We haven’t explicitly been testing with cloudera, but it might be that the new release on thrift 0.20.0 broke it. Could you try pinning to thrift&lt;0.20.0⁣ and see if that makes ingestion run?

how can i change the thrift version in the docker quickstart version? (i’ve installed datahub cli from pip)

Are you using ingestion through the UI? If so, you can use the “advanced” section on the last step of the setup to specify an extra pip requirement

ok thanks!

i’ve done that but now i’m receiving an other error

Traceback (most recent call last):
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/datahub/entrypoints.py", line 188, in main
    sys.exit(datahub(standalone_mode=False, **kwargs))
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 454, in wrapper
    raise e
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 403, in wrapper
    res = func(*args, **kwargs)
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 201, in run
    ret = loop.run_until_complete(run_ingestion_and_check_upgrade())
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 185, in run_ingestion_and_check_upgrade
    ret = await ingestion_future
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 139, in run_pipeline_to_completion
    raise e
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 131, in run_pipeline_to_completion
    pipeline.run()
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 405, in run
    for wu in itertools.islice(
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 147, in auto_stale_entity_removal
    for wu in stream:
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/datahub/ingestion/api/incremental_lineage_helper.py", line 113, in auto_incremental_lineage
    yield from stream
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 171, in auto_workunit_reporter
    for wu in stream:
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 253, in auto_browse_path_v2
    for urn, batch in _batch_workunits_by_urn(stream):
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 391, in _batch_workunits_by_urn
    for wu in stream:
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 184, in auto_materialize_referenced_tags
    for wu in stream:
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 91, in auto_status_aspect
    for wu in stream:
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/datahub/ingestion/source/sql/sql_common.py", line 540, in get_workunits_internal
    for inspector in self.get_inspectors():
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/datahub/ingestion/source/sql/two_tier_sql_source.py", line 119, in get_inspectors
    with engine.connect() as conn:
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3325, in connect
    return self._connection_cls(self, close_with_result=close_with_result)
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 96, in __init__
    else engine.raw_connection()
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3404, in raw_connection
    return self._wrap_pool_connect(self.pool.connect, _connection)
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3371, in _wrap_pool_connect
    return fn()
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 327, in connect
    return _ConnectionFairy._checkout(self)
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 894, in _checkout
    fairy = _ConnectionRecord.checkout(pool)
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 493, in checkout
    rec = pool._do_get()
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/sqlalchemy/pool/impl.py", line 145, in _do_get
    with util.safe_reraise():
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
    compat.raise_(
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 211, in raise_
    raise exception
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/sqlalchemy/pool/impl.py", line 143, in _do_get
    return self._create_connection()
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 273, in _create_connection
    return _ConnectionRecord(self)
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 388, in __init__
    self.__connect()
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 690, in __connect
    with util.safe_reraise():
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
    compat.raise_(
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 211, in raise_
    raise exception
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 686, in __connect
    self.dbapi_connection = connection = pool._invoke_creator(self)
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/sqlalchemy/engine/create.py", line 574, in connect
    return dialect.connect(*cargs, **cparams)
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 598, in connect
    return self.dbapi.connect(*cargs, **cparams)
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/pyhive/hive.py", line 174, in connect
    return Connection(*args, **kwargs)
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/pyhive/hive.py", line 308, in __init__
    response = self._client.OpenSession(open_session_req)
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/TCLIService/TCLIService.py", line 186, in OpenSession
    self.send_OpenSession(req)
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/TCLIService/TCLIService.py", line 195, in send_OpenSession
    self._oprot.trans.flush()
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/pyhive/hive.py", line 142, in flush
    super().flush()
  File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/thrift/transport/THttpClient.py", line 191, in flush
    self.__http.putheader('Cookie', self.headers['Set-Cookie'])
  File "/usr/local/lib/python3.10/http/client.py", line 1245, in putheader
    raise CannotSendHeader()
http.client.CannotSendHeader```

Got it - then I’m not sure. I don’t have access to a Cloudera instance so it’s difficult to test, but I do know this code has been working for folks using hive/presto

one thing maybe is that Cloudera exposes the Hive/Impala connection with the https protocol: https://docs.cloudera.com/machine-learning/cloud/import-data/topics/ml-access-cdw-from-cml.html