Troubleshooting Redshift Source Initialization Issue in DataHub Ingestion

Original Slack Thread

[2024-06-03 04:40:58,370] INFO     {datahub.cli.ingest_cli:147} - DataHub CLI version: 0.13.1.2
[2024-06-03 04:40:58,486] INFO     {datahub.ingestion.run.pipeline:239} - Sink configured successfully. DataHubRestEmitter: configured to talk to <http://localhost:8080>
[2024-06-03 04:41:01,383] ERROR    {datahub.entrypoints:201} - Command failed: Failed to find a registered source for type redshift: redshift is disabled due to an error in initialization
Traceback (most recent call last):
  File "/home/ec2-user/.local/lib/python3.9/site-packages/datahub/ingestion/api/registry.py", line 126, in _ensure_not_lazy
    plugin_class = import_path(path)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/datahub/ingestion/api/registry.py", line 56, in import_path
    item = importlib.import_module(module_name)
  File "/usr/lib64/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "&lt;frozen importlib._bootstrap&gt;", line 1030, in _gcd_import
  File "&lt;frozen importlib._bootstrap&gt;", line 1007, in _find_and_load
  File "&lt;frozen importlib._bootstrap&gt;", line 986, in _find_and_load_unlocked
  File "&lt;frozen importlib._bootstrap&gt;", line 680, in _load_unlocked
  File "&lt;frozen importlib._bootstrap_external&gt;", line 850, in exec_module
  File "&lt;frozen importlib._bootstrap&gt;", line 228, in _call_with_frames_removed
  File "/home/ec2-user/.local/lib/python3.9/site-packages/datahub/ingestion/source/redshift/redshift.py", line 47, in &lt;module&gt;
    from datahub.ingestion.source.redshift.lineage import RedshiftLineageExtractor
  File "/home/ec2-user/.local/lib/python3.9/site-packages/datahub/ingestion/source/redshift/lineage.py", line 34, in &lt;module&gt;
    from datahub.ingestion.source.redshift.report import RedshiftReport
  File "/home/ec2-user/.local/lib/python3.9/site-packages/datahub/ingestion/source/redshift/report.py", line 6, in &lt;module&gt;
    from datahub.ingestion.source.sql.sql_generic_profiler import ProfilingSqlReport
  File "/home/ec2-user/.local/lib/python3.9/site-packages/datahub/ingestion/source/sql/sql_generic_profiler.py", line 13, in &lt;module&gt;
    from datahub.ingestion.source.ge_data_profiler import (
  File "/home/ec2-user/.local/lib/python3.9/site-packages/datahub/ingestion/source/ge_data_profiler.py", line 30, in &lt;module&gt;
    from great_expectations.core.profiler_types_mapping import ProfilerTypeMapping
  File "/home/ec2-user/.local/lib/python3.9/site-packages/great_expectations/__init__.py", line 6, in &lt;module&gt;
    from great_expectations.data_context.migrator.cloud_migrator import CloudMigrator
  File "/home/ec2-user/.local/lib/python3.9/site-packages/great_expectations/data_context/__init__.py", line 1, in &lt;module&gt;
    from great_expectations.data_context.data_context import (
  File "/home/ec2-user/.local/lib/python3.9/site-packages/great_expectations/data_context/data_context/__init__.py", line 1, in &lt;module&gt;
    from great_expectations.data_context.data_context.abstract_data_context import (
  File "/home/ec2-user/.local/lib/python3.9/site-packages/great_expectations/data_context/data_context/abstract_data_context.py", line 38, in &lt;module&gt;
    from great_expectations.core import ExpectationSuite
  File "/home/ec2-user/.local/lib/python3.9/site-packages/great_expectations/core/__init__.py", line 3, in &lt;module&gt;
    from .domain import Domain
  File "/home/ec2-user/.local/lib/python3.9/site-packages/great_expectations/core/domain.py", line 8, in &lt;module&gt;
    from great_expectations.core.id_dict import IDDict
  File "/home/ec2-user/.local/lib/python3.9/site-packages/great_expectations/core/id_dict.py", line 5, in &lt;module&gt;
    from great_expectations.core.util import convert_to_json_serializable
  File "/home/ec2-user/.local/lib/python3.9/site-packages/great_expectations/core/util.py", line 32, in &lt;module&gt;
    from IPython import get_ipython
  File "/home/ec2-user/.local/lib/python3.9/site-packages/IPython/__init__.py", line 53, in &lt;module&gt;
    from .terminal.embed import embed
  File "/home/ec2-user/.local/lib/python3.9/site-packages/IPython/terminal/embed.py", line 16, in &lt;module&gt;
    from IPython.terminal.interactiveshell import TerminalInteractiveShell
  File "/home/ec2-user/.local/lib/python3.9/site-packages/IPython/terminal/interactiveshell.py", line 48, in &lt;module&gt;
    from .debugger import TerminalPdb, Pdb
  File "/home/ec2-user/.local/lib/python3.9/site-packages/IPython/terminal/debugger.py", line 8, in &lt;module&gt;
    from .shortcuts import create_ipython_shortcuts
  File "/home/ec2-user/.local/lib/python3.9/site-packages/IPython/terminal/shortcuts/__init__.py", line 29, in &lt;module&gt;
    from IPython.terminal.shortcuts.filters import filter_from_string
  File "/home/ec2-user/.local/lib/python3.9/site-packages/IPython/terminal/shortcuts/filters.py", line 18, in &lt;module&gt;
    from prompt_toolkit.filters import (
ImportError: cannot import name 'has_suggestion' from 'prompt_toolkit.filters' (/home/ec2-user/.local/lib/python3.9/site-packages/prompt_toolkit/filters/__init__.py)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ec2-user/.local/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line 121, in _add_init_error_context
    yield
  File "/home/ec2-user/.local/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line 249, in __init__
    source_class = source_registry.get(source_type)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/datahub/ingestion/api/registry.py", line 180, in get
    raise ConfigurationError(
datahub.configuration.common.ConfigurationError: redshift is disabled due to an error in initialization

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/ec2-user/.local/lib/python3.9/site-packages/datahub/entrypoints.py", line 188, in main
    sys.exit(datahub(standalone_mode=False, **kwargs))
  File "/home/ec2-user/.local/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ec2-user/.local/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ec2-user/.local/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/datahub/telemetry/telemetry.py", line 454, in wrapper
    raise e
  File "/home/ec2-user/.local/lib/python3.9/site-packages/datahub/telemetry/telemetry.py", line 403, in wrapper
    res = func(*args, **kwargs)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/datahub/cli/ingest_cli.py", line 201, in run
    ret = loop.run_until_complete(run_ingestion_and_check_upgrade())
  File "/usr/lib64/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/home/ec2-user/.local/lib/python3.9/site-packages/datahub/cli/ingest_cli.py", line 170, in run_ingestion_and_check_upgrade
    pipeline = Pipeline.create(
  File "/home/ec2-user/.local/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line 363, in create
    return cls(
  File "/home/ec2-user/.local/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line 249, in __init__
    source_class = source_registry.get(source_type)
  File "/usr/lib64/python3.9/contextlib.py", line 137, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/home/ec2-user/.local/lib/python3.9/site-packages/datahub/ingestion/run/pipeline.py", line 123, in _add_init_error_context
    raise PipelineInitError(f"Failed to {step}: {e}") from e
datahub.ingestion.run.pipeline.PipelineInitError: Failed to find a registered source for type redshift: redshift is disabled due to an error in initialization```
Hi All,
suddenly started getting this error while ingesting metadata into datahub.
Has anyone faced this issue before? I would be grateful if someone can help me debug this or provide a fix. Thanks!

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?


:information_source: We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a :thumbsup: or :thumbsdown: so other folks in the Community know if you’re still looking for help on the topic!

To help you more effectively, could you please provide the following information:

  1. Are you using the UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g., 0.13.1.2)?
  3. What data source(s) are you integrating with DataHub (e.g., Redshift)?

Based on the error message you provided, it seems that the Redshift source is disabled due to an error in initialization. This could be related to a dependency issue or a misconfiguration.

Potential Solutions:

  1. Dependency Issue: The error message indicates an issue with the prompt_toolkit library. You might want to check if all dependencies are correctly installed and compatible. Specifically, ensure that the prompt_toolkit version is compatible with your DataHub version.

  2. Configuration Issue: Double-check your recipe.dhub.yaml configuration file to ensure that all required fields are correctly specified.

  3. Version Compatibility: Ensure that you are using compatible versions of DataHub CLI and the Redshift connector. Sometimes upgrading or downgrading to a specific version can resolve such issues.

Example Configuration:
Here is an example configuration for Redshift ingestion that you can compare with your recipe.dhub.yaml:

pipeline_name: 'Test Ingestion of Redshift'
source:
    type: redshift
    config:
        database: 'DB_NAME'
        host_port: 'DB_HOST:PORT'
        include_table_lineage: true
        include_tables: true
        include_views: false
        password: '********'
        profiling:
            enabled: true
            profile_table_level_only: true
        schema_pattern:
            allow:
                - SOME_SCHEMA_NAME
        stateful_ingestion:
            enabled: true
        table_lineage_mode: stl_scan_based
        username: DummyUserName
transformers:
    - type: set_dataset_browse_path
      config:
          path_templates:
              - '/PLATFORM/Test Ingestion of Redshift/DATASET_PARTS'
sink:
    type: datahub-rest
    config:
        server: '${DATAHUB_GMS_URL}'
        token: '${DATAHUB_GMS_TOKEN}'

Disabling Stateful Ingestion:
If the issue persists, you can try disabling stateful ingestion as a temporary workaround:

stateful_ingestion:
    enabled: false

References:

Would you like more detailed steps on any of these potential solutions?

Sources:

  1. I am using CLI for ingestion.
  2. Datahub version - 0.13.1.2
  3. I am only ingesting redshift with datahub.
    This was running from a long time, around 6-7 months and then suddenly broke down.
    <@U06TM7M40RK>

Thank you for providing the additional details. Given that your setup was working for a long time and suddenly broke down, it is likely due to a change in dependencies or an update that caused the issue.

Steps to Troubleshoot and Resolve the Issue:

  1. Check Dependency Versions:
    The error message indicates an issue with the prompt_toolkit library. Ensure that the version of prompt_toolkit is compatible with your DataHub version. You can try reinstalling or downgrading the prompt_toolkit library.

    pip install prompt_toolkit==3.0.20
    
  2. Verify Configuration:
    Double-check your recipe.dhub.yaml configuration file to ensure that all required fields are correctly specified. Here is an example configuration for Redshift ingestion:

    pipeline_name: 'Test Ingestion of Redshift'
    source:
        type: redshift
        config:
            database: 'DB_NAME'
            host_port: 'DB_HOST:PORT'
            include_table_lineage: true
            include_tables: true
            include_views: false
            password: '********'
            profiling:
                enabled: true
                profile_table_level_only: true
            schema_pattern:
                allow:
                    - SOME_SCHEMA_NAME
            stateful_ingestion:
                enabled: true
            table_lineage_mode: stl_scan_based
            username: DummyUserName
    transformers:
        - type: set_dataset_browse_path
          config:
              path_templates:
                  - '/PLATFORM/Test Ingestion of Redshift/DATASET_PARTS'
    sink:
        type: datahub-rest
        config:
            server: '${DATAHUB_GMS_URL}'
            token: '${DATAHUB_GMS_TOKEN}'
    
  3. Disable Stateful Ingestion:
    As a temporary workaround, you can try disabling stateful ingestion to see if it resolves the issue:

    stateful_ingestion:
        enabled: false
    
  4. Update DataHub and Dependencies:
    Ensure that both the DataHub server and CLI are updated to the latest versions. Sometimes, updating to the latest version can resolve compatibility issues.

    pip install --upgrade acryl-datahub
    
  5. Check Logs for Detailed Errors:
    Review the logs from the DataHub ingestion executor and the DataHub GMS to identify any specific error messages. This can provide insights into why the jobs are marked as failed. You can access the logs by clicking on the Details button of the failed ingestion run in the UI.

Example Command to Run Ingestion:
Here is an example command to run the ingestion with the updated configuration:

datahub ingest -c recipe.dhub.yaml

References:

Would you like more detailed steps on any of these potential solutions?

Sources:

It looks like you’ve installed some other libraries which are causing issues. In general, we recommend installing acryl-datahub in a venv to avoid these sorts of issues

File "/home/ec2-user/.local/lib/python3.9/site-packages/IPython/terminal/shortcuts/filters.py", line 18, in &lt;module&gt;
from prompt_toolkit.filters import (
ImportError: cannot import name 'has_suggestion' from 'prompttoolkit.filters' (/home/ec2-user/.local/lib/python3.9/site-packages/prompttoolkit/filters/init.py)