Troubleshooting DBT Ingestion Error in DataHub Docker Quickstart

Original Slack Thread

Hello everyone :wave:
I am creating a POC using the quickstart with docker (version 0.12.1). I am trying to ingest from dbt core (not cloud version) and I am getting the following error:

[2023-12-14 16:17:23,930] ERROR    {datahub.entrypoints:186} - Command failed: Failed to find a registered source for type dbt: dbt is disabled; try running: pip install 'acryl-datahub[dbt]'
Traceback (most recent call last):
  File "/tmp/datahub/ingest/venv-dbt-f3c1b67e57e34548/lib/python3.10/site-packages/datahub/ingestion/api/registry.py", line 126, in _ensure_not_lazy
    plugin_class = import_path(path)
  File "/tmp/datahub/ingest/venv-dbt-f3c1b67e57e34548/lib/python3.10/site-packages/datahub/ingestion/api/registry.py", line 56, in import_path
    item = importlib.import_module(module_name)
  File "/usr/local/lib/python3.10/importlib/__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 883, in exec_module
  File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
  File "/tmp/datahub/ingest/venv-dbt-f3c1b67e57e34548/lib/python3.10/site-packages/datahub/ingestion/source/dbt/dbt_core.py", line 12, in <module>
    from datahub.configuration.git import GitReference
  File "/tmp/datahub/ingest/venv-dbt-f3c1b67e57e34548/lib/python3.10/site-packages/datahub/configuration/git.py", line 9, in <module>
    from datahub.ingestion.source.git.git_import import GitClone
  File "/tmp/datahub/ingest/venv-dbt-f3c1b67e57e34548/lib/python3.10/site-packages/datahub/ingestion/source/git/git_import.py", line 8, in <module>
    import git
ModuleNotFoundError: No module named 'git'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/tmp/datahub/ingest/venv-dbt-f3c1b67e57e34548/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 120, in _add_init_error_context
    yield
  File "/tmp/datahub/ingest/venv-dbt-f3c1b67e57e34548/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 223, in __init__
    source_class = source_registry.get(source_type)
  File "/tmp/datahub/ingest/venv-dbt-f3c1b67e57e34548/lib/python3.10/site-packages/datahub/ingestion/api/registry.py", line 176, in get
    raise ConfigurationError(
datahub.configuration.common.ConfigurationError: dbt is disabled; try running: pip install 'acryl-datahub[dbt]'```
Do I need to install git in one of the images? Or run this `pip install 'acryl-datahub[dbt]'` as stated in the stacktrace? In which of the images should I execute that command or do I need a new Dockerfile for this? Thanks!

You should install the dbt plugin in the virtual environment of the datahub cli.

Are you running the CLI from within a container?

> Are you running the CLI from within a container?
I assume not, the dbt plugin is already present in the slim https://github.com/datahub-project/datahub/blob/master/docker/datahub-ingestion/Dockerfile#L28|images that Acryl publish, afaik

Thanks <@U06A5NJ11S7> for the response. I run datahub docker quickstart from my local terminal and inside a virtualenv. But the ingestion, I am running from the UI.

I had the same issue. By running datahub check plugins --verbose, I saw the the dbt plugin is failing with ModuleNotFoundError("No module named 'git'")

It appears in 0.12.1 a dependency was introduced on gitpython for the dbt recipe, but this wasn’t updated in setup.py

It needs to be fixed https://github.com/datahub-project/datahub/blob/v0.12.1/metadata-ingestion/setup.py|here, but temporarily you should be able to get things working with a pip install GitPython

Thanks <@U043M9RSB3L> for the response. Do I need to run that inside the container? Because I installed in my conda env, from where I am running the quickstart command and I keep getting the error about module not found, even after doing the pip install GitPython

Hey <@U06A3QYDW3X>, this is fixed now on the minor version 0.12.1.1 . So, if you pip install acryl-datahub[dbt]==0.12.1.1 inside your container, it should be fine :slightly_smiling_face:

<@U06A3QYDW3X> - We used gitpython = "^3.1.40" and it fixed the issue for us.

Hi team, is anyone get this working I am facing the same issue with DBT ingestion

yes, adding the above dependancy should resolve the issue