Troubleshooting intermittent ingest failures with NullPointerException error in Datahub 10.5

Original Slack Thread

Hi. I have a couple if ingests fail with this error

               'info': {'exceptionClass': 'com.linkedin.restli.server.RestLiServiceException',
                        'message': 'java.lang.NullPointerException',
                        'status': 500,```
But what's strange is that I have several separate ingest jobs for separate schemas within the same Database, and some of them worked fine, some failed with this error.
Please advise on what to to
Datahub version 10.5. Full log attached![attachment](https://files.slack.com/files-pri/TUMKD5EGJ-F062PKFFE10/ingest_error.log?t=xoxe-973659184562-6705490291811-6708051934148-dd1595bd5f63266bc09e6166373c7a3c)

Hey Nadia! Sorry for the delayed response here… were you able to get this resolved?

<@U0121TRV0FL> sometimes it still shows up, but usually it goes away if I rerun ingest job once or twice manually. so it kinda comes and goes:woman-shrugging:

Hmm… very strange! Please let us know if you find a failure pattern!

Hi <@U0121TRV0FL> this error here stil shows up sometimes, and I noticed that it does cause issues for us since in produces some broken entities, Caould you please point at whoever can help me with this?
I’m attaching full GMS logs as well as ingest job logs.
It happens occasionally for some of the daily ingest jobs, they fail with this error maybe once in a week or two, sometimes twice in a row, sometimes only once, and then work succesfully again until the next fail. I haven’t been able to find a pattern. The time is scheduled and always the same.

We’re on Datahub 10.5 and are deployed via Kubernetesattachmentattachmentattachmentattachment

Hi <@U0121TRV0FL> can please you address this issue to someone from the tech side of the team? It is still happening, even with CLI set to 12.1 :frowning:

hey Nadia! so i see in your logs that this NPE is being raised by a section of code that should be updated now. you say that you still get this error when updating datahub to 12.1?

if so could you send the error logs from there as well? <@U03LYB2ESJ0>

<@U03BEML16LB> here you go, the fresh ones from today, CLI 12.1.1attachment

<@UV5UEC3LN> can you make heads or tails of where or why this NPE could be happening? in 10.5 it looks like EbeanAspectDao.runInTransactionWithRetry(EbeanAspectDao.java:531) is the culprit where the exact location then is this like if (sqlState.equals("40001")) { https://github.com/datahub-project/datahub/blob/v0.10.5/metadata-io/src/main/java/com/linkedin/metadata/entity/ebean/EbeanAspectDao.java#L531C13-L531C44|here - however in 12.1 we still get an NPE but i’m not seeing a specific line being called out but it might be in a similar place (after calling _entityService.ingestProposal)

meanwhile I’try to set CLI to 0.12.1.4 and see if it helps

Do you have the GMS logs from the more recent run?

Also is 0.10.5 your client version or server version? If Client, what is your server version on the run?

Note: if it is the line that Chris called out, that line is not present in server version 0.12.1, so at minimum it would be a different error happening if you have modified both your server and client versions.