"Troubleshooting Datahub Ingestion Errors with Business Glossary YAML File"

Original Slack Thread

Running a PoC for Datahub.

Unable to ingest our current business glossary. The YAML file is valid - according to a YAML parser.
This is what the CLI returns:

[datahub@rs-datahub-01 venv]$ datahub ingest -c rvk_hugtok.yml
[2025-01-03 11:48:45,565] INFO {datahub.cli.ingest_cli:150} - DataHub CLI version: 0.15.0
5 validation errors for PipelineConfig
source
Input should be a valid dictionary or instance of SourceConfig [type=model_type, input_value='DataHub', input_type=str]
For further information visit <https://errors.pydantic.dev/2.10/v/model_type>
version
Extra inputs are not permitted [type=extra_forbidden, input_value='1', input_type=str]
For further information visit <https://errors.pydantic.dev/2.10/v/extra_forbidden>
owners
Extra inputs are not permitted [type=extra_forbidden, input_value={'users': ['datahub']}, input_type=dict]
For further information visit <https://errors.pydantic.dev/2.10/v/extra_forbidden>
url
Extra inputs are not permitted [type=extra_forbidden, input_value='<https://hugtakasafn.reykjavik.is/>', input_type=str]
For further information visit <https://errors.pydantic.dev/2.10/v/extra_forbidden>
terms
Extra inputs are not permitted [type=extra_forbidden, input_value=[{'name': 'Fjárheimild',...'ut_hugtak': 'False'}]}], input_type=list]
For further information visit <https://errors.pydantic.dev/2.10/v/extra_forbidden>

Hey there! :wave:

<@U06TM7M40RK> Can you help with the questions in this thread?

:information_source: Please continue to tag RunLLM (e.g <@U06TM7M40RK>) to have it continue generating responses!

It seems like you’re encountering validation errors when trying to ingest your business glossary YAML file into DataHub. The errors indicate that there are extra fields in your YAML file that are not recognized by the DataHub ingestion pipeline.

Based on the <Business Glossary | DataHub Glossary File Format documentation>, the YAML file should conform to a specific schema. Here are some key points to consider:

  1. Top-Level Keys: The YAML file should have specific top-level keys such as version, source, owners, url, nodes, and terms. Each of these keys has specific requirements and sub-keys.

  2. Source Key: The source key should be a valid dictionary or an instance of SourceConfig. It seems like your current value 'DataHub' is not being recognized as valid.

  3. Extra Inputs: The error messages indicate that fields like version, owners, url, and terms are considered extra inputs. This suggests that these fields might not be correctly placed or formatted according to the expected schema.

  4. Validation Errors: The errors are related to the Pydantic model validation, which means the structure of your YAML file does not match the expected model.

To resolve these issues, you should:

  • Review the <Business Glossary | DataHub Glossary File Format documentation> to ensure your YAML file is structured correctly.
  • Ensure that each key and its sub-keys are correctly defined and placed according to the schema.
  • Check for any typos or misplacements in your YAML file that might cause these validation errors.

Would you like more detailed steps or examples on how to structure your YAML file correctly?

Sources:

  • <Business Glossary | DataHub Glossary File Format documentation> :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

Hi, how did my answer work out for you? Need any more assistance? You can also tag <@U07FV3UHE4D> to let your fellow community members know you’re still looking for guidance.

Removed the extra fields from the YAML file. Now I get:

[2025-01-03 14:34:15,757] ERROR {datahub.entrypoints:215} - Command failed: Cannot open config file /home/datahub/rvk_hugtok.yml
Traceback (most recent call last):
File "/home/datahub/.local/lib/python3.12/site-packages/datahub/entrypoints.py", line 202, in main
sys.exit(datahub(standalone_mode=False, **kwargs))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/datahub/.local/lib/python3.12/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/datahub/.local/lib/python3.12/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/home/datahub/.local/lib/python3.12/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/datahub/.local/lib/python3.12/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/datahub/.local/lib/python3.12/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/datahub/.local/lib/python3.12/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/datahub/.local/lib/python3.12/site-packages/datahub/telemetry/telemetry.py", line 476, in wrapper
raise e
File "/home/datahub/.local/lib/python3.12/site-packages/datahub/telemetry/telemetry.py", line 424, in wrapper
res = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/datahub/.local/lib/python3.12/site-packages/datahub/cli/ingest_cli.py", line 152, in run
pipeline_config = load_config_file(
^^^^^^^^^^^^^^^^^
File "/home/datahub/.local/lib/python3.12/site-packages/datahub/configuration/config_loader.py", line 152, in load_config_file
raise ConfigurationError(
datahub.configuration.common.configuration error: Cannot open config file /home/datahub/rvk_hugtok.yml
[datahub@rs-datahub-01 ~]$

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><