Connecting Teradata to DataHub - Configuration Challenges and Solutions

Original Slack Thread

I’m starting to use DataHub and I’m trying to connect a teradata database to DataHub. For the MySql database I have successfully used the following configuration:

    type: mysql
    config:
        host_port: xxxxxxxxxxxx:3306
        database: null
        username: xxxxxxxxxxxxxxx
        include_tables: true
        include_views: true
        profiling:
            enabled: true
            profile_table_level_only: true
        stateful_ingestion:
            enabled: true
        password: 'xxxxxxxxxxxxx'```
DataHub does not have teradata-ready configuration. Does anyone have any example configuration for teradata?

NOTE: I have already installed "pip install acryl-datahub[teradata]"

<@U062T651B5J>
teradata is not supported in DataHub

<@U062T651B5J> Did you try the one from the documentation? ( https://datahubproject.io/docs/generated/ingestion/sources/teradata ) I had to remove the sink statement as I got an error, but then I could not get it to run with the “not find registered class for teradata” error

source:
  type: teradata
  config:
    host_port: "<http://myteradatainstance.teradata.com:1025|myteradatainstance.teradata.com:1025>"
    username: myuser
    password: mypassword
    #database_pattern:
    #  allow:
    #    - "my_database"
    #  ignoreCase: true
    include_table_lineage: true
    include_usage_statistics: true
    stateful_ingestion:
      enabled: true
sink:```

Yes, I try

Did you find a solution to the registered class error as I have the same problem

Hey folks! We recently rolled out our Teradata ingestion source, so we’re navigating issues as they arise… thanks for bringing this to our attention! <@UV14447EU> Any idea what might be happening here?

Thank you Maggie. I have a few customers investigating data catalog options and DataHub has come up a few times. I’m a Teradata architect and I’m interested in seeing the Teradata ingestion capability working so that I can support them.

I am also considering a blog post demonstrating this connection as it is a good example of Teradata being visible in the modern data tools stack, which our customers are very interested in as they modernize their workloads.

Which version of datahub did you try to use?
We have multiple improvements coming soon for Teradata as it is under active development right now.

The latest 0.12 release as quickstart running under Docker locally, with the correct plugin install as per the Teradata ingestion sources documentation

Execution finished with errors.
{'exec_id': '602a54c8-bdc5-4a82-aa77-951ef69ebb2c',
 'infos': ['2023-11-09 03:38:42.542339 INFO: Starting execution for task with name=RUN_INGEST',
           "2023-11-09 03:38:48.728641 INFO: Failed to execute 'datahub ingest', exit code 1",
           '2023-11-09 03:38:48.731169 INFO: Caught exception EXECUTING task_id=602a54c8-bdc5-4a82-aa77-951ef69ebb2c, name=RUN_INGEST, '
           'stacktrace=Traceback (most recent call last):\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/default_executor.py", line 140, in execute_task\n'
           '    task_event_loop.run_until_complete(task_future)\n'
           '  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete\n'
           '    return future.result()\n'
           '  File "/usr/local/lib/python3.10/site-packages/acryl/executor/execution/sub_process_ingestion_task.py", line 282, in execute\n'
           '    raise TaskError("Failed to execute \'datahub ingest\'")\n'
           "acryl.executor.execution.task.TaskError: Failed to execute 'datahub ingest'\n"],
 'errors': []}

~~~~ Ingestion Logs ~~~~
Obtaining venv creation lock...
Acquired venv creation lock
venv is already set up
venv setup time = 0 sec
This version of datahub supports report-to functionality
+ exec datahub ingest run -c /tmp/datahub/ingest/602a54c8-bdc5-4a82-aa77-951ef69ebb2c/recipe.yml --report-to /tmp/datahub/ingest/602a54c8-bdc5-4a82-aa77-951ef69ebb2c/ingestion_report.json
[2023-11-09 03:38:45,997] INFO     {datahub.cli.ingest_cli:148} - DataHub CLI version: 0.11.0.1
[2023-11-09 03:38:46,079] INFO     {datahub.ingestion.run.pipeline:213} - Sink configured successfully. DataHubRestEmitter: configured to talk to <http://datahub-gms:8080>
[2023-11-09 03:38:46,480] ERROR    {datahub.entrypoints:199} - Command failed: Failed to find a registered source for type teradata: 'Did not find a registered class for teradata'
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 120, in _add_init_error_context
    yield
  File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 223, in __init__
    source_class = source_registry.get(source_type)
  File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/api/registry.py", line 181, in get
    raise KeyError(f"Did not find a registered class for {key}")
KeyError: 'Did not find a registered class for teradata'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/datahub/entrypoints.py", line 186, in main
    sys.exit(datahub(standalone_mode=False, **kwargs))
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 448, in wrapper
    raise e
  File "/usr/local/lib/python3.10/site-packages/datahub/telemetry/telemetry.py", line 397, in wrapper
    res = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/datahub/utilities/memory_leak_detector.py", line 95, in wrapper
    return func(ctx, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 198, in run
    ret = loop.run_until_complete(run_ingestion_and_check_upgrade())
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.10/site-packages/datahub/cli/ingest_cli.py", line 168, in run_ingestion_and_check_upgrade
    pipeline = Pipeline.create(
  File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 336, in create
    return cls(
  File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 220, in __init__
    with _add_init_error_context(
  File "/usr/local/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/usr/local/lib/python3.10/site-packages/datahub/ingestion/run/pipeline.py", line 122, in _add_init_error_context
    raise PipelineInitError(f"Failed to {step}: {e}") from e
datahub.ingestion.run.pipeline.PipelineInitError: Failed to find a registered source for type teradata: 'Did not find a registered class for teradata'

<@U062T651B5J> The DataHub Cli you use doesn’t have the Tearadata plugin. Maybe you are not in the right virtual environment?
Did you install teradata source with:
pip install "acryl-datahub[teradata]"
If you do this you should be able to check if it was installed successfully by running:
datahub check plugins --verbose
If you can’t see there the Teradata plugin then I would double-check if you are in the virtual environment:
which datahub
this should show you where it ran the datahub command

<@U042V25A8SK> I just tried locally and I could install it from scratch to a new virtual environment. If you ran it from cli then please, can you do the checks I recommended in my post above this?

Thanks <@UV14447EU>

That certainly helped understand where I was going wrong. I was trying to do this ingestion via the UI and not the CLI. Creating my recipe as a YAML file in my virtual environment and executing datahub ingest -c td-test.yaml worked okay. However trying to do this via the UI gives the error we have highlighted about not finding the registered class for teradata.

<@U062T651B5J> I suspect you have the same problem in trying to do this via the UI.

<@U042V25A8SK> I haven’t tried running with “datahub ingest -c td-test.yaml”. What is the model of this td-test.yaml file?

<@UV14447EU> I installed pip3 install ‘acryl-datahub[teradata]’, but it doesn’t appear in the list:

Sources:
athena         (disabled)          ModuleNotFoundError("No module named 'pyathena'")
azure-ad       AzureADSource
bigquery       (disabled)          ModuleNotFoundError("No module named 'google'")
clickhouse     (disabled)          ModuleNotFoundError("No module named 'clickhouse_driver'")
clickhouse-usage (disabled)          ModuleNotFoundError("No module named 'clickhouse_driver'")
csv-enricher   CSVEnricherSource
datahub        (disabled)          ModuleNotFoundError("No module named 'confluent_kafka'")
datahub-business-glossary BusinessGlossaryFileSource
datahub-lineage-file LineageFileSource
dbt            (disabled)          ModuleNotFoundError("No module named 'boto3'")
dbt-cloud      DBTCloudSource
delta-lake     (disabled)          ModuleNotFoundError("No module named 'deltalake'")
demo-data      DemoDataSource
druid          (disabled)          ModuleNotFoundError("No module named 'pydruid'")
dynamodb       (disabled)          ModuleNotFoundError("No module named 'boto3'")
elasticsearch  (disabled)          ModuleNotFoundError("No module named 'elasticsearch'")
feast          (disabled)          ModuleNotFoundError("No module named 'feast'")
file           GenericFileSource
gcs            (disabled)          ModuleNotFoundError("No module named 'boto3'")
glue           (disabled)          ModuleNotFoundError("No module named 'botocore'")
hana           HanaSource
hive           (disabled)          ModuleNotFoundError("No module named 'pyhive'")
iceberg        (disabled)          ModuleNotFoundError("No module named 'pyiceberg'")
json-schema    JsonSchemaSource
kafka          (disabled)          ModuleNotFoundError("No module named 'confluent_kafka'")
kafka-connect  (disabled)          ModuleNotFoundError("No module named 'jpype'")
ldap           (disabled)          ModuleNotFoundError("No module named 'ldap'")
looker         (disabled)          ModuleNotFoundError("No module named 'looker_sdk'")
lookml         (disabled)          ModuleNotFoundError("No module named 'lkml'")
mariadb        (disabled)          ModuleNotFoundError("No module named 'pymysql'")
metabase       (disabled)          ModuleNotFoundError("No module named 'sqllineage'")
mode           (disabled)          ModuleNotFoundError("No module named 'tenacity'")
mongodb        (disabled)          ModuleNotFoundError("No module named 'bson'")
mssql          (disabled)          ModuleNotFoundError("No module named 'sqlalchemy_pytds'")
mysql          (disabled)          ModuleNotFoundError("No module named 'pymysql'")
nifi           (disabled)          ModuleNotFoundError("No module named 'requests_gssapi'")
okta           (disabled)          ModuleNotFoundError("No module named 'okta'")
openapi        OpenApiSource
oracle         (disabled)          ModuleNotFoundError("No module named 'cx_Oracle'")
postgres       (disabled)          ModuleNotFoundError("No module named 'psycopg2'")
powerbi        (disabled)          ModuleNotFoundError("No module named 'lark'")
powerbi-report-server (disabled)          ModuleNotFoundError("No module named 'requests_ntlm'")
presto         (disabled)          ModuleNotFoundError("No module named 'pyhive'")
presto-on-hive (disabled)          ModuleNotFoundError("No module named 'pyhive'")
pulsar         PulsarSource
redash         (disabled)          ModuleNotFoundError("No module named 'redash_toolbelt'")
redshift       (disabled)          ModuleNotFoundError("No module named 'psycopg2'")
redshift-legacy (disabled)          ModuleNotFoundError("No module named 'psycopg2'")
redshift-usage-legacy (disabled)          ModuleNotFoundError("No module named 'psycopg2'")
s3             (disabled)          ModuleNotFoundError("No module named 'more_itertools'")
sagemaker      (disabled)          ModuleNotFoundError("No module named 'boto3'")
salesforce     (disabled)          ModuleNotFoundError("No module named 'simple_salesforce'")
snowflake      (disabled)          ModuleNotFoundError("No module named 'snowflake'")
sql-queries    SqlQueriesSource
sqlalchemy     SQLAlchemyGenericSource
starburst-trino-usage (disabled)          ModuleNotFoundError("No module named 'trino'")
superset       SupersetSource
tableau        (disabled)          ModuleNotFoundError("No module named 'tableauserverclient'")
trino          (disabled)          ModuleNotFoundError("No module named 'trino'")
unity-catalog  (disabled)          ModuleNotFoundError("No module named 'databricks'")
vertica        (disabled)          ModuleNotFoundError("No module named 'vertica_sqlalchemy_dialect'")

Sinks:
blackhole      BlackHoleSink
console        ConsoleSink
datahub-kafka  (disabled)          ModuleNotFoundError("No module named 'confluent_kafka'")
datahub-lite   DataHubLiteSink
datahub-rest   DatahubRestSink
file           FileSink

Transformers:
add_dataset_domain            AddDatasetDomain
add_dataset_ownership         AddDatasetOwnership
add_dataset_properties        AddDatasetProperties
add_dataset_tags              AddDatasetTags
add_dataset_terms             AddDatasetTerms
extract_dataset_tags          ExtractDatasetTags
mark_dataset_status           MarkDatasetStatus
pattern_add_dataset_domain    PatternAddDatasetDomain
pattern_add_dataset_ownership PatternAddDatasetOwnership
pattern_add_dataset_schema_tags PatternAddDatasetSchemaTags
pattern_add_dataset_schema_terms PatternAddDatasetSchemaTerms
pattern_add_dataset_tags      PatternAddDatasetTags
pattern_add_dataset_terms     PatternAddDatasetTerms
set_dataset_browse_path       AddDatasetBrowsePathTransformer
simple_add_dataset_domain     SimpleAddDatasetDomain
simple_add_dataset_ownership  SimpleAddDatasetOwnership
simple_add_dataset_properties SimpleAddDatasetProperties
simple_add_dataset_tags       SimpleAddDatasetTags
simple_add_dataset_terms      SimpleAddDatasetTerms
simple_remove_dataset_ownership SimpleRemoveDatasetOwnership```

<@U042V25A8SK> in the UI ingestion, there is always a default client that is used. You can override it by pinning the client version.
Here is the doc about how to do it -> https://datahubproject.io/docs/ui-ingestion/#advanced-ingestion-configs

<@U062T651B5J> I think the end of the plugin list is missing. Also please check the Datahub cli version to make sure you are on at least 0.12 version -> datahub --version

<@UV14447EU> 0.11.0.1, but with 0.12.0 I also had a problem. I’ll install 0.12.0 again and post it here.

<@UV14447EU> datahub 0.12.0 instaled and plugin teradata ok, now I’m going to try to ingest it with:

source:
  type: teradata
  config:
    host_port: "<http://myteradatainstance.teradata.com:1025|myteradatainstance.teradata.com:1025>"
    username: myuser
    password: mypassword
    #database_pattern:
    #  allow:
    #    - "my_database"
    #  ignoreCase: true
    include_table_lineage: true
    include_usage_statistics: true
    stateful_ingestion:
      enabled: true
sink:```

it looks good