Troubleshooting Looker Ingestion Error: Max Retries Exceeded on DataHub GMS

user-2 · March 4, 2024, 5:07pm

Hey folks! I have a Looker ingestion job that previously ran, but now I’m getting the following error:

               'info': {'message': "HTTPConnectionPool(host='datahub-datahub-gms', port=8080): Max retries exceeded with url: "
                                   "/entities?action=ingest (Caused by ResponseError('too many 500 error responses'))",
                        'id': 'urn:li:dataset:(urn:li:dataPlatform:looker,shiphero.explore.fulfillment_members,PROD)'}}]```
Anyone encounter something similar before?

user-1 · March 4, 2024, 5:07pm

It looks like it’s hitting 500 errors on the GMS side, have you inspected the GMS side logs to see what the source of the 500 error is?

user-2 · March 4, 2024, 5:07pm

Hi <@UV5UEC3LN>! Sorry for the delay here. Got distracted with some other data engineering objectives…

Here’s what I’m seeing in the logs on the GMS side: attachment

user-1 · March 4, 2024, 5:07pm

Hmm, there isn’t a stacktrace in the area of this log?

user-2 · March 4, 2024, 5:07pm

Doesn’t seem to be, sadly.

user-2 · March 4, 2024, 5:07pm

I may need to log into the deployment and see if I can get more detail that way.

user-2 · March 4, 2024, 5:07pm

Any hypotheses as to what could be happening here? May take me a bit to dig in here.

user-1 · March 4, 2024, 5:07pm

Unfortunately no, basically all this tells me is that something went wrong with the ingestion and the server threw an error when trying to process it… which you already know 500 is a generic server error so there should be some stacktrace thrown during execution of the ingestion that would give more information

user-1 · March 4, 2024, 5:07pm

Could be anything from the server isn’t correctly configured, is experiencing problems from the storage layer, the message is malformatted, etc

user-2 · March 4, 2024, 5:07pm

Good to know. I’ll update this thread once I find valuable troubleshooting data

user-2 · March 4, 2024, 5:07pm

Ah, ha! Found it, <@UV5UEC3LN>. Seems like it’s an issue with MetadataChangeLog, I think? attachment

user-1 · March 4, 2024, 5:07pm

Interesting, seems like an intermittent problem with your Kafka connection. The producer is failing to produce messages to the MCL topic

user-1 · March 4, 2024, 5:07pm

Your ingestion speeds are really slow which makes me think that there is a long network delay between GMS and your Kafka instance, are GMS and Kafka in the same region? Or is this a cross region cluster?

user-1 · March 4, 2024, 5:07pm

2023-12-19 17:03:19,400 [pool-13-thread-2] INFO c.l.m.filter.RestliLoggingFilter:55 - POST /entities?action=ingest - ingest - 200 - 5984ms
^ This successful one took 6s

user-1 · March 4, 2024, 5:07pm

and the one that failed failed after 13s, which might be hitting the backoff/retry limit or something

user-2 · March 4, 2024, 5:07pm

All in the same region, us-east4.

user-2 · March 4, 2024, 5:07pm

That is, both GMS and Kafka are in the same region.

user-1 · March 4, 2024, 5:07pm

Weird, are they in different VPCs or having to cross any other network boundary?

user-1 · March 4, 2024, 5:08pm

If it’s not explainable by network definitely check the broker’s metrics and see if it’s getting overloaded

user-2 · March 4, 2024, 5:08pm

Our deployment is in GCP.

What’s really weird is that when I initially set up the Looker ingestion, it worked just fine and then all of the sudden it just started failing. Not great intel for troubleshooting, but again, it’s worth noting that when it was setup initially everything was fine.

How do I check the broker metrics?

Topic		Replies	Views
Troubleshooting "Unable to emit metadata to DataHub GMS" Error getting-started	1	145	May 6, 2024
Troubleshooting DataHub Ingestion Issues ingestion	2	67	September 2, 2024
Troubleshooting Slow Ingestion Performance After Upgrading DataHub ingestion	4	95	May 27, 2024
Troubleshooting intermittent ingest failures with NullPointerException error in Datahub 10.5 troubleshoot	13	61	March 4, 2024
Troubleshooting ingestion sources retrieval issue in Datahub-GMS troubleshoot	3	57	March 4, 2024

Troubleshooting Looker Ingestion Error: Max Retries Exceeded on DataHub GMS

Related topics