Discussion on `pydantic` Version Constraints for DataHub Software Development

Original Slack Thread

I’m surprised to see this current package constraint "pydantic>=1.10.0,!=1.10.3" in https://github.com/datahub-project/datahub/blob/45a8cc9ecf4d2fb10bda0a3d69ccc2afd172d540/metadata-ingestion/setup.py#L22|metadata-ingestion/setup.py.
Is it really correct? Because it seems we are not ready yet and schema updates are still required in a lot of files.
Why don’t we have "pydantic<2" like everywhere else ?

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?


:information_source: We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a :thumbsup: or :thumbsdown: so other folks in the Community know if you’re still looking for help on the topic!

Hi there! Let’s address your question regarding the pydantic version constraints in DataHub’s metadata-ingestion/setup.py.

Current Pydantic Version Constraints

Indeed, the metadata-ingestion/setup.py file specifies the constraint "pydantic>=1.10.0,!=1.10.3" (https://github.com/datahub-project/datahub/blob/45a8cc9ecf4d2fb10bda0a3d69ccc2afd172d540/metadata-ingestion/setup.py#L22|source). This is somewhat different from the more restrictive constraint "pydantic<2" seen in other parts of the project.

Reason for the Discrepancy

The discrepancy can be attributed to the ongoing efforts to make DataHub compatible with pydantic v2. According to the <https://github.com/datahub-project/datahub/pull/9434|pull request> that introduced compatibility with pydantic v2, the transition is still experimental. The project aims to support both pydantic v1 and v2, but certain parts of the codebase and plugins still require pydantic v1 due to extensive changes in pydantic v2.

Why Not Use "pydantic<2" Everywhere?

While the metadata-ingestion/setup.py file has been updated to allow for pydantic v2 compatibility, other parts of the project, such as specific plugins and modules, may still have constraints like "pydantic<2" to avoid breaking changes. This staged approach ensures that the core functionality can start leveraging pydantic v2 features while maintaining stability across the entire ecosystem.

Next Steps

  1. Schema Updates: The project is actively working on schema updates and other necessary changes to fully support pydantic v2 across all modules.
  2. Testing: Extensive testing is being conducted to ensure compatibility and to identify any issues that arise from the transition.

Would you like more detailed information on the specific schema updates required or the current status of the transition to pydantic v2?

Sources:

  1. CLI but I suspect a problem more general (eg: Operation updateIngestionSource execution )
  2. DataHub 0.13.3
  3. Glossary (example to come)

Disclaimer: I’m using pipx here but It is just to expose my concern.

❯ pipx runpip acryl-datahub list | grep pydantic
pydantic                  2.7.4
pydantic_core             2.18.4```
If I try to ingest the <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/bootstrap_data/business_glossary.yml|business glossary example file>, I get:
```❯ datahub ingest -c .\recipe.yaml
...
6 validation errors for BusinessGlossaryConfig
version
  Input should be a valid string [type=string_type, input_value=1, input_type=int]
    For further information visit <https://errors.pydantic.dev/2.7/v/string_type>
nodes.0.terms.0.custom_properties.is_confidential
  Input should be a valid string [type=string_type, input_value=False, input_type=bool]
    For further information visit <https://errors.pydantic.dev/2.7/v/string_type>
nodes.0.terms.1.custom_properties.is_confidential
  Input should be a valid string [type=string_type, input_value=True, input_type=bool]
    For further information visit <https://errors.pydantic.dev/2.7/v/string_type>
nodes.0.terms.2.custom_properties.is_confidential
  Input should be a valid string [type=string_type, input_value=True, input_type=bool]
    For further information visit <https://errors.pydantic.dev/2.7/v/string_type>
nodes.0.custom_properties.is_confidential
  Input should be a valid string [type=string_type, input_value=True, input_type=bool]
    For further information visit <https://errors.pydantic.dev/2.7/v/string_type>
nodes.2.terms.0.custom_properties.is_used_for_compliance_tracking
  Input should be a valid string [type=string_type, input_value=True, input_type=bool]
    For further information visit <https://errors.pydantic.dev/2.7/v/string_type>```
If I pin again the pydantic version:
`pipx runpip acryl-datahub install "pydantic==1.10"`
The previous ingestion command will work again:
``` ❯ datahub ingest -c .\recipe.yaml
...
Pipeline finished successfully; produced 55 events in 1 minute and 32.74 seconds.```

I am really happy to know that pydanctic-v2 is already on your radar but I would suggest to change your current strategy by following the official migration guide and <Migration Guide - Pydantic your v1 import>.
It improves code visibility, encourage migration steps and avoid potential python packaging problems.

Cc: <@U01GZEETMEZ>

We apply this tag to many ingestion sources https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/setup.py#L58

Broadly, some parts of the codebase are compatible with both pydantic v1 and v2, and some are not

In general, we recommend using pip install acryl-datahub[plugin] - this will ensure that we pull in the right set of dependencies

I’m not sure what the equivalent is for pipx