Hi! i’m trying to ingest hive metadata from a Cloudera Data Warehouse (on a CDP in public cloud). Cloudera exposes hive with the http transport mode and i can successfully connect with a jdbc client (for example DBeaver). However when i try to ingest from the datahub quickstart i receive an error. In the comments the attached logs and the params that i’m using. Can you help me to solve this?
Hey there! Make sure your message includes the following information if relevant, so we can help more effectively!
- Are you using UI or CLI for ingestion?
- Which DataHub version are you using? (e.g. 0.12.0)
- What data source(s) are you integrating with DataHub? (e.g. BigQuery)
type: hive
config:
host_port: 'cdw_host:443'
database: default
username: user_x
stateful_ingestion:
enabled: true
password: password_x
scheme: hive+https```![attachment]({'ID': 'F073EL635CL', 'EDITABLE': True, 'IS_EXTERNAL': False, 'USER_ID': 'U06V7891S86', 'CREATED': '2024-05-14 12:29:31+00:00', 'PERMALINK': 'https://datahubspace.slack.com/files/U06V7891S86/F073EL635CL/exec-urn_li_datahubexecutionrequest_7b60273b-417b-4935-97ac-d6e2af119a8e.log', 'EXTERNAL_TYPE': '', 'TIMESTAMPS': '2024-05-14 12:29:31+00:00', 'MODE': 'snippet', 'DISPLAY_AS_BOT': False, 'PRETTY_TYPE': 'C++', 'NAME': 'exec-urn_li_dataHubExecutionRequest_7b60273b-417b-4935-97ac-d6e2af119a8e.log', 'IS_PUBLIC': True, 'PREVIEW_HIGHLIGHT': '<div class="CodeMirror cm-s-default CodeMirrorServer">\n<div class="CodeMirror-code">\n<div><pre><span class="cm-variable">~~~~</span> <span class="cm-variable">Execution</span> <span class="cm-variable">Summary</span> <span class="cm-operator">-</span> <span class="cm-variable">RUN_INGEST</span> <span class="cm-variable">~~~~</span></pre></div>\n<div><pre><span class="cm-variable">Execution</span> <span class="cm-variable">finished</span> <span class="cm-variable">with</span> <span class="cm-variable">errors</span>.</pre></div>\n<div><pre>{<span class="cm-string">\'exec_id\'</span>: <span class="cm-string">\'7b60273b-417b-4935-97ac-d6e2af119a8e\'</span>,</pre></div>\n<div><pre> <span class="cm-string">\'infos\'</span>: [<span class="cm-string">\'2024-05-14 12:14:42.722784 INFO: Starting execution for task with name=RUN_INGEST\'</span>,</pre></div>\n<div><pre> <span class="cm-string">"2024-05-14 12:14:48.787677 INFO: Failed to execute \'datahub ingest\', exit code 1"</span>,</pre></div>\n</div>\n</div>\n', 'MIMETYPE': 'text/plain', 'PERMALINK_PUBLIC': 'https://slack-files.com/TUMKD5EGJ-F073EL635CL-4a3d8e9e61', 'FILETYPE': 'cpp', 'EDIT_LINK': 'https://datahubspace.slack.com/files/U06V7891S86/F073EL635CL/exec-urn_li_datahubexecutionrequest_7b60273b-417b-4935-97ac-d6e2af119a8e.log/edit', 'URL_PRIVATE': 'https://files.slack.com/files-pri/TUMKD5EGJ-F073EL635CL/exec-urn_li_datahubexecutionrequest_7b60273b-417b-4935-97ac-d6e2af119a8e.log', 'HAS_RICH_PREVIEW': False, 'TITLE': 'exec-urn_li_dataHubExecutionRequest_7b60273b-417b-4935-97ac-d6e2af119a8e.log', 'IS_STARRED': False, 'PREVIEW_IS_TRUNCATED': True, 'URL_PRIVATE_DOWNLOAD': 'https://files.slack.com/files-pri/TUMKD5EGJ-F073EL635CL/download/exec-urn_li_datahubexecutionrequest_7b60273b-417b-4935-97ac-d6e2af119a8e.log', 'PREVIEW': '~~~~ Execution Summary - RUN_INGEST ~~~~\nExecution finished with errors.\n{\'exec_id\': \'7b60273b-417b-4935-97ac-d6e2af119a8e\',\n \'infos\': [\'2024-05-14 12:14:42.722784 INFO: Starting execution for task with name=RUN_INGEST\',\n "2024-05-14 12:14:48.787677 INFO: Failed to execute \'datahub ingest\', exit code 1",', 'PUBLIC_URL_SHARED': False, 'MESSAGE_TS': '1715689832.652469', 'PARENT_MESSAGE_TS': '1715689745.447509', 'MESSAGE_CHANNEL_ID': 'CUMUWQU66', '_FIVETRAN_DELETED': False, 'LINES_MORE': 223, 'LINES': 228, 'SIZE': 14226, '_FIVETRAN_SYNCED': '2024-05-19 08:22:13.319000+00:00'})
- UI
- 0.13.1
- Hive from CDW of CDP
We haven’t explicitly been testing with cloudera, but it might be that the new release on thrift 0.20.0 broke it. Could you try pinning to thrift<0.20.0
and see if that makes ingestion run?
how can i change the thrift version in the docker quickstart version? (i’ve installed datahub cli from pip)
Are you using ingestion through the UI? If so, you can use the “advanced” section on the last step of the setup to specify an extra pip requirement
ok thanks!
i’ve done that but now i’m receiving an other error
Traceback (most recent call last):
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/datahub/entrypoints.py", line 188, in main
sys.exit(datahub(standalone_mode=False, **kwargs))
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 454, in wrapper
raise e
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 403, in wrapper
res = func(*args, **kwargs)
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 201, in run
ret = loop.run_until_complete(run_ingestion_and_check_upgrade())
File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 185, in run_ingestion_and_check_upgrade
ret = await ingestion_future
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 139, in run_pipeline_to_completion
raise e
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 131, in run_pipeline_to_completion
pipeline.run()
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 405, in run
for wu in itertools.islice(
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 147, in auto_stale_entity_removal
for wu in stream:
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/datahub/ingestion/api/incremental_lineage_helper.py", line 113, in auto_incremental_lineage
yield from stream
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 171, in auto_workunit_reporter
for wu in stream:
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 253, in auto_browse_path_v2
for urn, batch in _batch_workunits_by_urn(stream):
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 391, in _batch_workunits_by_urn
for wu in stream:
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 184, in auto_materialize_referenced_tags
for wu in stream:
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 91, in auto_status_aspect
for wu in stream:
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/datahub/ingestion/source/sql/sql_common.py", line 540, in get_workunits_internal
for inspector in self.get_inspectors():
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/datahub/ingestion/source/sql/two_tier_sql_source.py", line 119, in get_inspectors
with engine.connect() as conn:
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3325, in connect
return self._connection_cls(self, close_with_result=close_with_result)
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 96, in __init__
else engine.raw_connection()
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3404, in raw_connection
return self._wrap_pool_connect(self.pool.connect, _connection)
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/sqlalchemy/engine/base.py", line 3371, in _wrap_pool_connect
return fn()
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 327, in connect
return _ConnectionFairy._checkout(self)
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 894, in _checkout
fairy = _ConnectionRecord.checkout(pool)
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 493, in checkout
rec = pool._do_get()
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/sqlalchemy/pool/impl.py", line 145, in _do_get
with util.safe_reraise():
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
compat.raise_(
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 211, in raise_
raise exception
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/sqlalchemy/pool/impl.py", line 143, in _do_get
return self._create_connection()
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 273, in _create_connection
return _ConnectionRecord(self)
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 388, in __init__
self.__connect()
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 690, in __connect
with util.safe_reraise():
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", line 70, in __exit__
compat.raise_(
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 211, in raise_
raise exception
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/sqlalchemy/pool/base.py", line 686, in __connect
self.dbapi_connection = connection = pool._invoke_creator(self)
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/sqlalchemy/engine/create.py", line 574, in connect
return dialect.connect(*cargs, **cparams)
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/sqlalchemy/engine/default.py", line 598, in connect
return self.dbapi.connect(*cargs, **cparams)
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/pyhive/hive.py", line 174, in connect
return Connection(*args, **kwargs)
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/pyhive/hive.py", line 308, in __init__
response = self._client.OpenSession(open_session_req)
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/TCLIService/TCLIService.py", line 186, in OpenSession
self.send_OpenSession(req)
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/TCLIService/TCLIService.py", line 195, in send_OpenSession
self._oprot.trans.flush()
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/pyhive/hive.py", line 142, in flush
super().flush()
File "/tmp/datahub/ingest/venv-hive-b7d7de67216d4ea0/lib/python3.10/site-packages/thrift/transport/THttpClient.py", line 191, in flush
self.__http.putheader('Cookie', self.headers['Set-Cookie'])
File "/usr/local/lib/python3.10/http/client.py", line 1245, in putheader
raise CannotSendHeader()
http.client.CannotSendHeader```
Got it - then I’m not sure. I don’t have access to a Cloudera instance so it’s difficult to test, but I do know this code has been working for folks using hive/presto
one thing maybe is that Cloudera exposes the Hive/Impala connection with the https protocol: https://docs.cloudera.com/machine-learning/cloud/import-data/topics/ml-access-cdw-from-cml.html