Troubleshooting BigQuery Ingestion Failure with 'Failed to produce MCLs' Error

user-3 · March 4, 2024, 4:23pm

Hi everyone,

Our scheduled ingestion for BigQuery started to fail with Failed to produce MCLs after the runs had been successful for many days and without any changes from our side:

"datahub.configuration.common.OperationalError: ('Unable to emit metadata to DataHub GMS', {'exceptionClass': "

           "'com.linkedin.restli.server.RestLiServiceException', 'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:500]: "

           'java.lang.RuntimeException: Failed to produce MCLs\\n\\tat com.linkedin.metadata.restli.RestliUtil.toTask(RestliUtil.java:42)\\n\\tat '

           'com.linkedin.metadata.restli.RestliUtil.toTask(RestliUtil.java:50)\\n\\tat '

           'com.linkedin.metadata.resources.entity.AspectResource.ingestProposal(AspectResource.java:206)\\n\\tat '

           'jdk.internal.reflect.GeneratedMethodAccessor233.invoke(Unknown Source)\\n\\tat '

           'java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\\n\\tat '

           'java.base/java.lang.reflect.Method.invoke(Method.java:566)\\n\\tat '

           'com.linkedin.restli.internal.server.RestLiMethodInvoker.doInvoke(RestLiMethodInvoker.java:177)\\n\\tat '

           'com.linkedin.restli.internal.server.RestLiMethodInvoker.invoke(RestLiMethodInvoker.java:333)\\n\\tat '

           'com.linkedin.restli.internal.server.filter.FilterChainDispatcherImpl.onRequestSuccess(FilterChainDispatcherImpl.java:47)\\n\\tat '

           'com.linkedin.restli.internal.server.filter.RestLiFilterChainIterator.onRequest(RestLiFilterChainIterator.java:86)\\n\\tat '

           'com.linkedin.restli.internal.server.filter.RestLiFilterChainIterator.lambda$onRequest$0(RestLiFilterChainIterator.java:73)\\n\\tat '

           'java.base/java.util.concurrent.CompletableFuture.uniAcceptNow(CompletableFuture.java:753)\\n\\tat '

           'java.base/java.util.concurrent.CompletableFuture.uniAcceptStage(CompletableFuture.java:731)\\n\\tat '

           'java.base/java.util.concurrent.CompletableFuture.thenAccept(CompletableFuture.java:2108)\\n\\tat '

           'com.linkedin.restli.internal.server.filter.RestLiFilterChainIterator.onRequest(RestLiFilterChainIterator.java:72)\\n\\tat '

           'com.linkedin.restli.internal.server.filter.RestLiFilterChain.onRequest(RestLiFilterChain.java:55)\\n\\tat '

           'com.linkedin.restli.server.BaseRestLiServer.handleResourceRequest(BaseRestLiServer.java:270)\\n\\tat '

           'com.linkedin.restli.server.RestRestLiServer.handleResourceRequestWithRestLiResponse(RestRestLiServer.java:294)\\n\\tat '

           'com.linkedin.restli.server.RestRestLiServer.handleResourceRequest(RestRestLiServer.java:262)\\n\\tat '

           'com.linkedin.restli.server.RestRestLiServer.handleResourceRequest(RestRestLiServer.java:232)\\n\\tat '

           'com.linkedin.restli.server.RestRestLiServer.doHandleRequest(RestRestLiServer.java:215)\\n\\tat '

           'com.linkedin.restli.server.RestRestLiServer.handleRequest(RestRestLiServer.java:171)\\n\\tat '

           'com.linkedin.restli.server.RestLiServer.handleRequest(RestLiServer.java:132)\\n\\tat '

           'com.linkedin.restli.server.DelegatingTransportDispatcher.handleRestRequest(DelegatingTransportDispatcher.java:70)\\n\\tat '

           'com.linkedin.r2.filter.transport.DispatcherRequestFilter.onRestRequest(DispatcherRequestFilter.java:70)\\n\\tat '

           'com.linkedin.r2.filter.TimedRestFilter.onRestRequest(TimedRestFilter.java:76)\\n\\tat '

           'com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:146)\\n\\tat '

           'com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:132)\\n\\tat '

           'com.linkedin.r2.filter.FilterChainIterator.onRequest(FilterChainIterator.java:62)\\n\\tat '

           'com.linkedin.r2.filter.TimedNextFilter.onRequest(TimedNextFilter.java:55)\\n\\tat '

           'com.linkedin.r2.filter.transport.ServerQueryTunnelFilter.onRestRequest(ServerQueryTunnelFilter.java:58)\\n\\tat '

           'com.linkedin.r2.filter.TimedRestFilter.onRestRequest(TimedRestFilter.java:76)\\n\\tat '

           'com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:146)\\n\\tat '

           'com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:132)\\n\\tat '

           'com.linkedin.r2.filter.FilterChainIterator.onRequest(FilterChainIterator.java:62)\\n\\tat '

           'com.linkedin.r2.filter.TimedNextFilter.onRequest(TimedNextFilter.java:55)\\n\\tat '

           'com.linkedin.r2.filter.message.rest.RestFilter.onRestRequest(RestFilter.java:50)\\n\\tat '

           'com.linkedin.r2.filter.TimedRestFilter.onRestRequest(TimedRestFilter.java:76)\\n\\tat '

           'com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:146)\\n\\tat '

           'com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:132)\\n\\tat '

           'com.linkedin.r2.filter.FilterChainIterator.onRequest(FilterChainIterator.java:62)\\n\\tat '

           'com.linkedin.r2.filter.FilterChainImpl.onRestRequest(FilterChainImpl.java:106)\\n\\tat '

           'com.linkedin.r2.filter.transport.FilterChainDispatcher.handleRestRequest(FilterChainDispatcher.java:75)\\n\\tat '

           'com.linkedin.r2.util.finalizer.RequestFinalizerDispatcher.handleRestRequest(RequestFinalizerDispatcher.java:61)\\n\\tat '

           'com.linkedin.r2.transport.http.server.HttpDispatcher.handleRequest(HttpDispatcher.java:101)\\n\\tat '

           'com.linkedin.r2.transport.http.server.AbstractR2Servlet.service(AbstractR2Servlet.java:105)\\n\\tat '

           'javax.servlet.http.HttpServlet.service(HttpServlet.java:790)\\n\\tat '

           'com.linkedin.restli.server.RestliHandlerServlet.service(RestliHandlerServlet.java:21)\\n\\tat '

           'com.linkedin.restli.server.RestliHandlerServlet.handleRequest(RestliHandlerServlet.java:26)\\n\\tat '

           'org.springframework.web.context.support.HttpRequestHandlerServlet.service(HttpRequestHandlerServlet.java:73)\\n\\tat '

           'javax.servlet.http.HttpServlet.service(HttpServlet.java:790)\\n\\tat '

           'org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:799)\\n\\tat '

           'org.eclipse.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1631)\\n\\tat '

           'com.datahub.auth.authentication.filter.AuthenticationFilter.doFilter(AuthenticationFilter.java:102)\\n\\tat '

           'org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)\\n\\tat '

           'org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)\\n\\tat '

           'org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:548)\\n\\tat '

           'org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\\n\\tat '

           'org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:600)\\n\\tat '

           'org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)\\n\\tat '

           'org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)\\n\\tat '

           'org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1624)\\n\\tat '

           'org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)\\n\\tat '

           'org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)\\n\\tat '

           'org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)\\n\\tat '

           'org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:501)\\n\\tat '

           'org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1594)\\n\\tat '

           'org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)\\n\\tat '

           'org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)\\n\\tat '

           'org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\\n\\tat '

           'org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191)\\n\\tat '

           'org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:146)\\n\\tat '

           'org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)\\n\\tat '

           'org.eclipse.jetty.server.Server.handle(Server.java:516)\\n\\tat '

           'org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487)\\n\\tat '

           'org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732)\\n\\tat '

           'org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479)\\n\\tat '

           'org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)\\n\\tat '

           'org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)\\n\\tat '

           'org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)\\n\\tat '

           'org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)\\n\\tat '

           'org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338)\\n\\tat '

           'org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315)\\n\\tat '

           'org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173)\\n\\tat '

           'org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)\\n\\tat '

           'org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:409)\\n\\tat '

           'org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)\\n\\tat '

           'org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)\\n\\tat '

           'java.base/java.lang.Thread.run(Thread.java:829)\\nCaused by: java.lang.RuntimeException: Failed to produce MCLs\\n\\tat '

           'com.linkedin.metadata.entity.EntityServiceImpl.emitMCL(EntityServiceImpl.java:684)\\n\\tat '

           'com.linkedin.metadata.entity.EntityServiceImpl.ingestAspects(EntityServiceImpl.java:543)\\n\\tat '

           'com.linkedin.metadata.entity.EntityServiceImpl.ingestProposalSync(EntityServiceImpl.java:855)\\n\\tat '

           'com.linkedin.metadata.entity.EntityServiceImpl.ingestProposal(EntityServiceImpl.java:757)\\n\\tat '

           'com.linkedin.metadata.resources.entity.AspectResource.lambda$ingestProposal$2(AspectResource.java:225)\\n\\tat '

           "com.linkedin.metadata.restli.RestliUtil.toTask(RestliUtil.java:30)\\n\\t... 88 more\\n', 'message': 'java.lang.RuntimeException: Failed "

           "to produce MCLs', 'status': 500})\n"```
In the ingestion debug logs there are some `{datahub.ingestion.source.bigquery_v2.usage:794} - Unable to find destination table in event AuditEvent`  which might be related, but I'm not sure.
Any ideas on the error cause and how to fix it are appreciated. Thanks in advance :slightly_smiling_face:!

Datahub is deployed in k8s using helm chart version `0.2.188`
Here is the ingestion source configuration (we are managing these programmatically using Python SDK):![attachment](https://files.slack.com/files-pri/TUMKD5EGJ-F05TR1DM7FX/ingestion-source-config.json?t=xoxe-973659184562-6705490291811-6708051934148-dd1595bd5f63266bc09e6166373c7a3c)

user-5 · March 4, 2024, 4:23pm

I am seeing this exact same error, but I am not seeing the destination table error. I am seeing this error for Looker + Snowfalke ingestions managed through the UI. Our global datahub version is 0.11.0 and for actions pod 0.0.11

user-3 · March 4, 2024, 4:23pm

<@U04JQ3J6HJN> we are also using 0.0.11 for actions.
Did you find out what caused the error and were you able to fix it?

user-5 · March 4, 2024, 4:23pm

no. <@U01GZEETMEZ> helped me with something similar previously - is 0.0.11 the best version of the actions pod to use??

user-5 · March 4, 2024, 4:23pm

also as of this morning this error is occurring on ALL of my ingestion sources

user-5 · March 4, 2024, 4:23pm

i updated actions to v0.1.13 and this did not resolve the issue

user-5 · March 4, 2024, 4:23pm

although it seems like the issue is actually with the GMS pod… actions pod is sending a request to the GMS pod to ingest the metadata and the GMS pod for some reason is unable to process said request (specifically, it is unable to produce MCLs as per <https://datahubproject.io/docs/advanced/mcp-mcl|this article>) and so it returns the 500.

user-5 · March 4, 2024, 4:23pm

here’s our GMS values, version is v.0.11.0 it seems like it has plenty of memory but maybe more memory or cpu??

  enabled: true
  image:
    repository: linkedin/datahub-gms
    # tag: "v0.10.0 # defaults to .global.datahub.version
  imagePullSecrets:
    - name: docker-io-creds
  resources:
    limits:
      memory: 5Gi
    requests:
      cpu: 100m
      memory: 4Gi
  # Optionally set a GMS specific SQL login (defaults to global login)
  # sql:
  #   datasource:
  #     username: "gms-login"
  #     password:
  #       secretRef: gms-secret
  #       secretKey: gms-password```

user-5 · March 4, 2024, 4:23pm

tagging my colleagues <@U05TT312DN2> and <@U04PSSFGYL9> for visibility

user-5 · March 4, 2024, 4:24pm

I think i may have been able to able to trace the error back to this:

org.apache.kafka.common.errors.RecordTooLargeException: The message is 2225348 bytes when serialized which is larger than 1048576, which is the value of the max.request.size configuration.
2023-09-26 18:37:03,523 [qtp350068407-272] ERROR c.l.m.entity.EntityServiceImpl:678 - Failed to produce MCLs: [UpsertBatchItem{urn=urn:li:dataHubExecutionRequest:e59963c1-b7fd-4cb2-8c5d-2ea80292eeb5, aspectName='dataHubExecutionRequestResult', systemMetadata={lastObserved=1695753423409, lastRunId=no-run-id-provided, runId=no-run-id-provided}, aspect={report=~~~~ Execution Summary - RUN_INGEST ~~~~```
so need to give kafka more resources

user-3 · March 4, 2024, 4:24pm

Thanks for the updates and for cross-posting <@U04JQ3J6HJN>!
I’d be interested to hear, if increasing resources solves the issue for you.
Looking at our resource monitoring, all pods are within the set limits during the ingestion run.

user-2 · March 4, 2024, 4:24pm

cc <@U03MF8MU5P0> - do we have docs on managing the max kafka message size?

Separately, we have some fixes in progress that will help solve the root cause, which is that the ingestion logs are too large

user-5 · March 4, 2024, 4:24pm

well i saw in another thread about updated this variable in the gms pod:

    - name: SPRING_KAFKA_PRODUCER_PROPERTIES_MAX_REQUEST_SIZE
      value: "5242880"```
and i updated that but it didnt seem to take bc now I'm getting basically the same error except it just says "The request included a message larger than the max message size the server will accept."

datahub_team · March 4, 2024, 4:24pm

Yeah there’s a few different things to configure - you might also need to configure the kafka broker to accept larger messages

user-5 · March 4, 2024, 4:24pm

how do i do that?? we are using the kafka from the datahub-prerequisites and i dont see a lot of things I can pass through there:

kafka:
  enabled: true
  imagePullSecrets:
    - name: docker-io-creds```

user-3 · March 4, 2024, 4:24pm

<@U04JQ3J6HJN> in our gms logs we are actually getting the same RecordTooLargeException: The request included a message larger than the max message size the server will accept.

Checking the <Artifact Hub helm chart> mentioned in the <datahub-helm/charts/prerequisites/values.yaml at ef63fae2d436ff099f600c8909a96a1e98c90c5b · acryldata/datahub-helm · GitHub helm chart> it seems we can pass extraConfig to set maxMessageBytes
(I haven’t tried it yet though).

user-5 · March 4, 2024, 4:24pm

yeah i tried that… didnt seem to work haha

user-5 · March 4, 2024, 4:24pm

a couple of other folks are experiencing this issue, we should consolidate threads

datahub_team · March 4, 2024, 4:24pm

Increasing the message size limit on the kafka brokers or another option might be to enable message compression. Adding the kafka environment on all message producers. GMS at least or if running standalone the mce/mae consumers. Env KAFKA_PRODUCER_COMPRESSION_TYPE document https://www.confluent.io/blog/apache-kafka-message-compression/#configuring-compression-type|here and can also be configured on the server side per confluent’s docs.

user-4 · March 4, 2024, 4:24pm

I’m running into very much the same issue (while trying to ingest clickhouse-usage configured via UI). I’ve increased the max message size to 5MB on all 3 parts (producer, consumer & kafka itself) and I don’t get the RecordTooLargeException, but I do get the exact same stack trace, which Mona reported initially, on the gms:

	at com.linkedin.metadata.entity.EntityServiceImpl.emitMCL(EntityServiceImpl.java:684)
	at com.linkedin.metadata.entity.EntityServiceImpl.ingestAspects(EntityServiceImpl.java:543)
	at com.linkedin.metadata.entity.EntityServiceImpl.ingestProposalSync(EntityServiceImpl.java:855)
	at com.linkedin.metadata.entity.EntityServiceImpl.ingestProposal(EntityServiceImpl.java:757)
	at com.linkedin.metadata.resources.entity.AspectResource.lambda$ingestProposal$2(AspectResource.java:225)
	at com.linkedin.metadata.restli.RestliUtil.toTask(RestliUtil.java:30)
	... 88 common frames omitted```
Any ideas/suggestions? <@U05Q7965PQA> <@U04JQ3J6HJN> did you already find a solution for your cases?

Topic		Replies	Views
Troubleshooting Metadata Ingestion Error 'Failed to produce MCLs' ingestion	8	95	June 3, 2024
Debugging 'Unable to emit metadata to DataHub GMS: java.lang.RuntimeException: Failed to produce MCLs' error after upgrading to 0.11.0 ingestion	16	114	March 4, 2024
Troubleshooting errors when emitting metadata to DataHub GMS ingestion	1	95	May 13, 2024
Troubleshooting intermittent ingest failures with NullPointerException error in Datahub 10.5 troubleshoot	13	61	March 4, 2024
Troubleshooting datahub cli error: 'Unable to emit metadata to DataHub GMS' ingestion	1	178	May 13, 2024

Troubleshooting BigQuery Ingestion Failure with 'Failed to produce MCLs' Error

Related topics