Troubleshooting Looker Ingestion Error: Max Retries Exceeded on DataHub GMS

Original Slack Thread

Hey folks! I have a Looker ingestion job that previously ran, but now I’m getting the following error:

               'info': {'message': "HTTPConnectionPool(host='datahub-datahub-gms', port=8080): Max retries exceeded with url: "
                                   "/entities?action=ingest (Caused by ResponseError('too many 500 error responses'))",
                        'id': 'urn:li:dataset:(urn:li:dataPlatform:looker,shiphero.explore.fulfillment_members,PROD)'}}]```
Anyone encounter something similar before?

It looks like it’s hitting 500 errors on the GMS side, have you inspected the GMS side logs to see what the source of the 500 error is?

Hi <@UV5UEC3LN>! Sorry for the delay here. Got distracted with some other data engineering objectives…

Here’s what I’m seeing in the logs on the GMS side:attachment

Hmm, there isn’t a stacktrace in the area of this log?

Doesn’t seem to be, sadly.

I may need to log into the deployment and see if I can get more detail that way.

Any hypotheses as to what could be happening here? May take me a bit to dig in here.

Unfortunately no, basically all this tells me is that something went wrong with the ingestion and the server threw an error when trying to process it… which you already know :sweat_smile: 500 is a generic server error so there should be some stacktrace thrown during execution of the ingestion that would give more information

Could be anything from the server isn’t correctly configured, is experiencing problems from the storage layer, the message is malformatted, etc

Good to know. I’ll update this thread once I find valuable troubleshooting data

Ah, ha! Found it, <@UV5UEC3LN>. Seems like it’s an issue with MetadataChangeLog, I think?attachment

Interesting, seems like an intermittent problem with your Kafka connection. The producer is failing to produce messages to the MCL topic

Your ingestion speeds are really slow which makes me think that there is a long network delay between GMS and your Kafka instance, are GMS and Kafka in the same region? Or is this a cross region cluster?

2023-12-19 17:03:19,400 [pool-13-thread-2] INFO c.l.m.filter.RestliLoggingFilter:55 - POST /entities?action=ingest - ingest - 200 - 5984ms
^ This successful one took 6s

and the one that failed failed after 13s, which might be hitting the backoff/retry limit or something

All in the same region, us-east4.

That is, both GMS and Kafka are in the same region.

Weird, are they in different VPCs or having to cross any other network boundary?

If it’s not explainable by network definitely check the broker’s metrics and see if it’s getting overloaded

Our deployment is in GCP.

What’s really weird is that when I initially set up the Looker ingestion, it worked just fine and then all of the sudden it just started failing. Not great intel for troubleshooting, but again, it’s worth noting that when it was setup initially everything was fine.

How do I check the broker metrics?