Dealing with Pydantic Schema Generation Error and Dependency Versions in DataHub Ingestion

Original Slack Thread

hi all - we started getting a bunch of errors last week (not sure which day, logs only go back till the 28th) during ingestion of glue data sources.
```pydantic.errors.PydanticSchemaGenerationError: Unable to generate pydantic-core schema for datahub.utilities.lossy_collections.LossyList[str]. Set arbitrary_types_allowed=True in the model_config to ignore this error or implement __get_pydantic_core_schema__ on your type to fully support it.

If you got this error by calling handler(<some type>) within __get_pydantic_core_schema__ then you likely need to call handler.generate_schema(&lt;some type&gt;) since we do not call __get_pydantic_core_schema__ on &lt;some type&gt; otherwise to avoid infinite recursion.```
i suspect this has to do with the fact that dependency versions arent pinned in datahub-actions when it does a dependency install and pydantic 2.1.0 and 2.1.1 were released july 25th.

we’re still on 0.10.1 currently. any recommendations for how to fix this?

Hi everyone. I have the same issue

<@U03PJBJMJKY> If you are running ingestion using cli then you can manually install that dependency otherwise you need to upgrade Datahub to latest version

woof okay, how can we prevent forced immediate upgrades in the future?

Hello! Is there a way to fix this without having to upgrade?

<@U01GZEETMEZ> might help you

I upgraded DataHub to latest version and the issue got resolved. I still dont understand why this error occurred?

in previous version dependency was not pinned

Some context on what this error is about - about 3 months ago, I proactively added a pydantic&lt;2 restriction to our dependencies in prep for their release of pydantic 2. (|PR)

Any acryl-datahub versions released prior to that will, when triggered via UI ingestion, automatically pip install the latest version of dependencies, which includes pydantic 2, and then crash with this error :disappointed:

So what can we do? One option is set the “cli version” of that ingestion to something slightly newer with the &lt;2 requirement, but stay with the older server version. You’d probably have decent success using 0.10.4.x cli releases with 0.10.1. [Of course, we still recommend keeping up with the latest releases - 0.10.5 has some important platform stability fixes and some nifty new features]

The other option is to run ingestion from the CLI for the time being - that way, you can pip install acryl-datahub==&lt;some version&gt; pydantic&lt;2 and avoid the dependency issue

We’re working on some improvements that will let you add extra pip requirements when running ingestion via the UI - that will help us solve these issues more easily in the future. But those are still a work in progress for now

ahhhhh okay thanks, that’s great context. was this posted anywhere that i missed? just trying to understand if there’s somewhere else i should watch to get notifications about potential upcoming breaking changes or not

It wasn’t explicitly posted anywhere, but there were a few other threads about it when the issue first popped up in early july. In general breaking changes are logged in the <|updating datahub doc>, but this particular one was more a dependency issue than a breaking change in datahub itself and so didn’t get mentioned there

okay. a bit of unrequested feedback: i think in the future when ticking time bombs are discovered a post in <#CUMV92XRQ|announcements> or similar would be nice, because it’s hard to read all of the threads in all of the channels. thanks for explaining it to me though, much appreciated!

Yep absolutely - if these things happen in the future, we’ll make an announcement about them. Thanks for the feedback