Problem in ingestion. A schema created by Airbyte crashes the pipelines
Hey there! Make sure your message includes the following information if relevant, so we can help more effectively!
- Are you using UI or CLI for ingestion?
- Which DataHub version are you using? (e.g. 0.12.0)
- What data source(s) are you integrating with DataHub? (e.g. BigQuery)
Here’s the log:
Traceback (most recent call last):
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/entrypoints.py", line 188, in main
sys.exit(datahub(standalone_mode=False, **kwargs))
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/click/core.py", line 1157, in _call_
return self.main(*args, **kwargs)
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 454, in wrapper
raise e
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 403, in wrapper
res = func(*args, **kwargs)
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 201, in run
ret = loop.run_until_complete(run_ingestion_and_check_upgrade())
File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
return future.result()
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 185, in run_ingestion_and_check_upgrade
ret = await ingestion_future
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 139, in run_pipeline_to_completion
raise e
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 131, in run_pipeline_to_completion
pipeline.run()
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 405, in run
for wu in itertools.islice(
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 147, in auto_stale_entity_removal
for wu in stream:
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/api/incremental_lineage_helper.py", line 116, in auto_incremental_lineage
yield from stream
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 171, in auto_workunit_reporter
for wu in stream:
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 253, in auto_browse_path_v2
for urn, batch in _batch_workunits_by_urn(stream):
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 391, in _batch_workunits_by_urn
for wu in stream:
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 184, in auto_materialize_referenced_tags
for wu in stream:
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 91, in auto_status_aspect
for wu in stream:
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/source/redshift/redshift.py", line 468, in get_workunits_internal
yield from self.extract_lineage(
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/source/redshift/redshift.py", line 987, in extract_lineage
lineage_extractor.populate_lineage(
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/source/redshift/lineage.py", line 659, in populate_lineage
table_renames, all_tables_set = self._process_table_renames(
File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/source/redshift/lineage.py", line 872, in _process_table_renames
all_tables[database][schema].add(prev_name)
KeyError: '_airbyte_planetfone'```
Datahub UI
V 0.13.2
Redshift
I had even deleted the schema once and it didn’t work. Likewise the deny command.
Here’s the yaml of my conn:
type: redshift
config:
table_lineage_mode: stl_scan_based
include_table_lineage: true
include_tables: true
database: prod
password: '${datahub_redshift_prod_password}'
profiling:
enabled: true
profile_table_level_only: false
host_port: '<host>:5439'
include_views: true
stateful_ingestion:
enabled: true
schema_pattern:
deny:
- _airbyte_planetfone
username: datahub
pipeline_name: '<name>'```
I believe this was fixed by https://github.com/datahub-project/datahub/pull/9967
Can you try with the latest CLI version (0.13.2.2)
I don’t understand why it’s not working then
Can you post the full ingestion logs from the run with 0.13.2.2?
![attachment]({‘ID’: ‘F073K6C78NB’, ‘EDITABLE’: True, ‘IS_EXTERNAL’: False, ‘USER_ID’: ‘U070S358BM3’, ‘CREATED’: ‘2024-05-17 16:16:38+00:00’, ‘PERMALINK’: ‘Slack’, ‘EXTERNAL_TYPE’: ‘’, ‘TIMESTAMPS’: ‘2024-05-17 16:16:38+00:00’, ‘MODE’: ‘snippet’, ‘DISPLAY_AS_BOT’: False, ‘PRETTY_TYPE’: ‘Plain Text’, ‘NAME’: ‘exec-urn_li_dataHubExecutionRequest_0d9f76a2-e850-4890-81ad-174a21e01ee7.log’, ‘IS_PUBLIC’: True, ‘PREVIEW_HIGHLIGHT’: ‘
~~~~ Execution Summary - RUN_INGEST
Execution finished with errors.
{'exec_id': '0d9f76a2-e850-4890-81ad-174a21e01ee7',
'infos': ['2024-05-17 03:45:00.067485 INFO: Starting execution for task with name=RUN_INGEST',
"2024-05-17 03:45:59.403357 INFO: Failed to execute 'datahub ingest', exit code 1",
<@U01GZEETMEZ> here it is