Schema Issue Causing Airbyte Pipeline Crashes

Original Slack Thread

Problem in ingestion. A schema created by Airbyte crashes the pipelines

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

Here’s the log:

Traceback (most recent call last):
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/entrypoints.py", line 188, in main
    sys.exit(datahub(standalone_mode=False, **kwargs))
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/click/core.py", line 1157, in _call_
    return self.main(*args, **kwargs)
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 454, in wrapper
    raise e
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 403, in wrapper
    res = func(*args, **kwargs)
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 201, in run
    ret = loop.run_until_complete(run_ingestion_and_check_upgrade())
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 185, in run_ingestion_and_check_upgrade
    ret = await ingestion_future
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 139, in run_pipeline_to_completion
    raise e
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 131, in run_pipeline_to_completion
    pipeline.run()
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 405, in run
    for wu in itertools.islice(
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 147, in auto_stale_entity_removal
    for wu in stream:
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/api/incremental_lineage_helper.py", line 116, in auto_incremental_lineage
    yield from stream
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 171, in auto_workunit_reporter
    for wu in stream:
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 253, in auto_browse_path_v2
    for urn, batch in _batch_workunits_by_urn(stream):
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 391, in _batch_workunits_by_urn
    for wu in stream:
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 184, in auto_materialize_referenced_tags
    for wu in stream:
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/api/source_helpers.py", line 91, in auto_status_aspect
    for wu in stream:
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/source/redshift/redshift.py", line 468, in get_workunits_internal
    yield from self.extract_lineage(
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/source/redshift/redshift.py", line 987, in extract_lineage
    lineage_extractor.populate_lineage(
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/source/redshift/lineage.py", line 659, in populate_lineage
    table_renames, all_tables_set = self._process_table_renames(
  File "/tmp/datahub/ingest/venv-redshift-2b9c1ab97dc6cd7f/lib/python3.10/site-packages/datahub/ingestion/source/redshift/lineage.py", line 872, in _process_table_renames
    all_tables[database][schema].add(prev_name)
KeyError: '_airbyte_planetfone'```

Datahub UI
V 0.13.2
Redshift

I had even deleted the schema once and it didn’t work. Likewise the deny command.

Here’s the yaml of my conn:

  type: redshift
  config:
    table_lineage_mode: stl_scan_based
    include_table_lineage: true
    include_tables: true
    database: prod
    password: '${datahub_redshift_prod_password}'
    profiling:
      enabled: true
      profile_table_level_only: false
    host_port: '<host>:5439'
    include_views: true
    stateful_ingestion:
      enabled: true
    schema_pattern:
      deny:
        - _airbyte_planetfone
    username: datahub
pipeline_name: '<name>'```

I believe this was fixed by https://github.com/datahub-project/datahub/pull/9967

Can you try with the latest CLI version (0.13.2.2)

I don’t understand why it’s not working then

Can you post the full ingestion logs from the run with 0.13.2.2?

![attachment]({‘ID’: ‘F073K6C78NB’, ‘EDITABLE’: True, ‘IS_EXTERNAL’: False, ‘USER_ID’: ‘U070S358BM3’, ‘CREATED’: ‘2024-05-17 16:16:38+00:00’, ‘PERMALINK’: ‘Slack’, ‘EXTERNAL_TYPE’: ‘’, ‘TIMESTAMPS’: ‘2024-05-17 16:16:38+00:00’, ‘MODE’: ‘snippet’, ‘DISPLAY_AS_BOT’: False, ‘PRETTY_TYPE’: ‘Plain Text’, ‘NAME’: ‘exec-urn_li_dataHubExecutionRequest_0d9f76a2-e850-4890-81ad-174a21e01ee7.log’, ‘IS_PUBLIC’: True, ‘PREVIEW_HIGHLIGHT’: ‘

\n
\n
~~~~ Execution Summary - RUN_INGEST 
\n
Execution finished with errors.
\n
{'exec_id': '0d9f76a2-e850-4890-81ad-174a21e01ee7',
\n
 'infos': ['2024-05-17 03:45:00.067485 INFO: Starting execution for task with name=RUN_INGEST',
\n
           "2024-05-17 03:45:59.403357 INFO: Failed to execute 'datahub ingest', exit code 1",
\n
\n
\n’, ‘MIMETYPE’: ‘text/plain’, ‘PERMALINK_PUBLIC’: ‘https://slack-files.com/TUMKD5EGJ-F073K6C78NB-1852e436e1’, ‘FILETYPE’: ‘text’, ‘EDIT_LINK’: ‘Slack’, ‘URL_PRIVATE’: ‘Slack’, ‘HAS_RICH_PREVIEW’: False, ‘TITLE’: ‘exec-urn_li_dataHubExecutionRequest_0d9f76a2-e850-4890-81ad-174a21e01ee7.log’, ‘IS_STARRED’: False, ‘PREVIEW_IS_TRUNCATED’: True, ‘URL_PRIVATE_DOWNLOAD’: ‘Slack’, ‘PREVIEW’: ’ Execution Summary - RUN_INGEST ~~~~\nExecution finished with errors.\n{'exec_id': '0d9f76a2-e850-4890-81ad-174a21e01ee7',\n 'infos': ['2024-05-17 03:45:00.067485 INFO: Starting execution for task with name=RUN_INGEST',\n “2024-05-17 03:45:59.403357 INFO: Failed to execute 'datahub ingest', exit code 1”,’, ‘PUBLIC_URL_SHARED’: False, ‘MESSAGE_TS’: ‘1715962600.524219’, ‘PARENT_MESSAGE_TS’: ‘1715624020.792149’, ‘MESSAGE_CHANNEL_ID’: ‘CUMUWQU66’, ‘_FIVETRAN_DELETED’: False, ‘LINES_MORE’: 398, ‘LINES’: 403, ‘SIZE’: 24581, ‘_FIVETRAN_SYNCED’: ‘2024-05-19 08:22:17.765000+00:00’})

<@U01GZEETMEZ> here it is