Troubleshooting BigQuery Ingestion Failure with 'Failed to produce MCLs' Error

Original Slack Thread

Hi everyone,

Our scheduled ingestion for BigQuery started to fail with Failed to produce MCLs after the runs had been successful for many days and without any changes from our side:

"datahub.configuration.common.OperationalError: ('Unable to emit metadata to DataHub GMS', {'exceptionClass': "

           "'com.linkedin.restli.server.RestLiServiceException', 'stackTrace': 'com.linkedin.restli.server.RestLiServiceException [HTTP Status:500]: "

           'java.lang.RuntimeException: Failed to produce MCLs\\n\\tat com.linkedin.metadata.restli.RestliUtil.toTask(RestliUtil.java:42)\\n\\tat '

           'com.linkedin.metadata.restli.RestliUtil.toTask(RestliUtil.java:50)\\n\\tat '

           'com.linkedin.metadata.resources.entity.AspectResource.ingestProposal(AspectResource.java:206)\\n\\tat '

           'jdk.internal.reflect.GeneratedMethodAccessor233.invoke(Unknown Source)\\n\\tat '

           'java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\\n\\tat '

           'java.base/java.lang.reflect.Method.invoke(Method.java:566)\\n\\tat '

           'com.linkedin.restli.internal.server.RestLiMethodInvoker.doInvoke(RestLiMethodInvoker.java:177)\\n\\tat '

           'com.linkedin.restli.internal.server.RestLiMethodInvoker.invoke(RestLiMethodInvoker.java:333)\\n\\tat '

           'com.linkedin.restli.internal.server.filter.FilterChainDispatcherImpl.onRequestSuccess(FilterChainDispatcherImpl.java:47)\\n\\tat '

           'com.linkedin.restli.internal.server.filter.RestLiFilterChainIterator.onRequest(RestLiFilterChainIterator.java:86)\\n\\tat '

           'com.linkedin.restli.internal.server.filter.RestLiFilterChainIterator.lambda$onRequest$0(RestLiFilterChainIterator.java:73)\\n\\tat '

           'java.base/java.util.concurrent.CompletableFuture.uniAcceptNow(CompletableFuture.java:753)\\n\\tat '

           'java.base/java.util.concurrent.CompletableFuture.uniAcceptStage(CompletableFuture.java:731)\\n\\tat '

           'java.base/java.util.concurrent.CompletableFuture.thenAccept(CompletableFuture.java:2108)\\n\\tat '

           'com.linkedin.restli.internal.server.filter.RestLiFilterChainIterator.onRequest(RestLiFilterChainIterator.java:72)\\n\\tat '

           'com.linkedin.restli.internal.server.filter.RestLiFilterChain.onRequest(RestLiFilterChain.java:55)\\n\\tat '

           'com.linkedin.restli.server.BaseRestLiServer.handleResourceRequest(BaseRestLiServer.java:270)\\n\\tat '

           'com.linkedin.restli.server.RestRestLiServer.handleResourceRequestWithRestLiResponse(RestRestLiServer.java:294)\\n\\tat '

           'com.linkedin.restli.server.RestRestLiServer.handleResourceRequest(RestRestLiServer.java:262)\\n\\tat '

           'com.linkedin.restli.server.RestRestLiServer.handleResourceRequest(RestRestLiServer.java:232)\\n\\tat '

           'com.linkedin.restli.server.RestRestLiServer.doHandleRequest(RestRestLiServer.java:215)\\n\\tat '

           'com.linkedin.restli.server.RestRestLiServer.handleRequest(RestRestLiServer.java:171)\\n\\tat '

           'com.linkedin.restli.server.RestLiServer.handleRequest(RestLiServer.java:132)\\n\\tat '

           'com.linkedin.restli.server.DelegatingTransportDispatcher.handleRestRequest(DelegatingTransportDispatcher.java:70)\\n\\tat '

           'com.linkedin.r2.filter.transport.DispatcherRequestFilter.onRestRequest(DispatcherRequestFilter.java:70)\\n\\tat '

           'com.linkedin.r2.filter.TimedRestFilter.onRestRequest(TimedRestFilter.java:76)\\n\\tat '

           'com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:146)\\n\\tat '

           'com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:132)\\n\\tat '

           'com.linkedin.r2.filter.FilterChainIterator.onRequest(FilterChainIterator.java:62)\\n\\tat '

           'com.linkedin.r2.filter.TimedNextFilter.onRequest(TimedNextFilter.java:55)\\n\\tat '

           'com.linkedin.r2.filter.transport.ServerQueryTunnelFilter.onRestRequest(ServerQueryTunnelFilter.java:58)\\n\\tat '

           'com.linkedin.r2.filter.TimedRestFilter.onRestRequest(TimedRestFilter.java:76)\\n\\tat '

           'com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:146)\\n\\tat '

           'com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:132)\\n\\tat '

           'com.linkedin.r2.filter.FilterChainIterator.onRequest(FilterChainIterator.java:62)\\n\\tat '

           'com.linkedin.r2.filter.TimedNextFilter.onRequest(TimedNextFilter.java:55)\\n\\tat '

           'com.linkedin.r2.filter.message.rest.RestFilter.onRestRequest(RestFilter.java:50)\\n\\tat '

           'com.linkedin.r2.filter.TimedRestFilter.onRestRequest(TimedRestFilter.java:76)\\n\\tat '

           'com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:146)\\n\\tat '

           'com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:132)\\n\\tat '

           'com.linkedin.r2.filter.FilterChainIterator.onRequest(FilterChainIterator.java:62)\\n\\tat '

           'com.linkedin.r2.filter.FilterChainImpl.onRestRequest(FilterChainImpl.java:106)\\n\\tat '

           'com.linkedin.r2.filter.transport.FilterChainDispatcher.handleRestRequest(FilterChainDispatcher.java:75)\\n\\tat '

           'com.linkedin.r2.util.finalizer.RequestFinalizerDispatcher.handleRestRequest(RequestFinalizerDispatcher.java:61)\\n\\tat '

           'com.linkedin.r2.transport.http.server.HttpDispatcher.handleRequest(HttpDispatcher.java:101)\\n\\tat '

           'com.linkedin.r2.transport.http.server.AbstractR2Servlet.service(AbstractR2Servlet.java:105)\\n\\tat '

           'javax.servlet.http.HttpServlet.service(HttpServlet.java:790)\\n\\tat '

           'com.linkedin.restli.server.RestliHandlerServlet.service(RestliHandlerServlet.java:21)\\n\\tat '

           'com.linkedin.restli.server.RestliHandlerServlet.handleRequest(RestliHandlerServlet.java:26)\\n\\tat '

           'org.springframework.web.context.support.HttpRequestHandlerServlet.service(HttpRequestHandlerServlet.java:73)\\n\\tat '

           'javax.servlet.http.HttpServlet.service(HttpServlet.java:790)\\n\\tat '

           'org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:799)\\n\\tat '

           'org.eclipse.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1631)\\n\\tat '

           'com.datahub.auth.authentication.filter.AuthenticationFilter.doFilter(AuthenticationFilter.java:102)\\n\\tat '

           'org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:193)\\n\\tat '

           'org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1601)\\n\\tat '

           'org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:548)\\n\\tat '

           'org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)\\n\\tat '

           'org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:600)\\n\\tat '

           'org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)\\n\\tat '

           'org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)\\n\\tat '

           'org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1624)\\n\\tat '

           'org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:233)\\n\\tat '

           'org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1440)\\n\\tat '

           'org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:188)\\n\\tat '

           'org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:501)\\n\\tat '

           'org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1594)\\n\\tat '

           'org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:186)\\n\\tat '

           'org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1355)\\n\\tat '

           'org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)\\n\\tat '

           'org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:191)\\n\\tat '

           'org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:146)\\n\\tat '

           'org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)\\n\\tat '

           'org.eclipse.jetty.server.Server.handle(Server.java:516)\\n\\tat '

           'org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:487)\\n\\tat '

           'org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:732)\\n\\tat '

           'org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:479)\\n\\tat '

           'org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:277)\\n\\tat '

           'org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)\\n\\tat '

           'org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:105)\\n\\tat '

           'org.eclipse.jetty.io.ChannelEndPoint$1.run(ChannelEndPoint.java:104)\\n\\tat '

           'org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:338)\\n\\tat '

           'org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:315)\\n\\tat '

           'org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:173)\\n\\tat '

           'org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:131)\\n\\tat '

           'org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:409)\\n\\tat '

           'org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:883)\\n\\tat '

           'org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1034)\\n\\tat '

           'java.base/java.lang.Thread.run(Thread.java:829)\\nCaused by: java.lang.RuntimeException: Failed to produce MCLs\\n\\tat '

           'com.linkedin.metadata.entity.EntityServiceImpl.emitMCL(EntityServiceImpl.java:684)\\n\\tat '

           'com.linkedin.metadata.entity.EntityServiceImpl.ingestAspects(EntityServiceImpl.java:543)\\n\\tat '

           'com.linkedin.metadata.entity.EntityServiceImpl.ingestProposalSync(EntityServiceImpl.java:855)\\n\\tat '

           'com.linkedin.metadata.entity.EntityServiceImpl.ingestProposal(EntityServiceImpl.java:757)\\n\\tat '

           'com.linkedin.metadata.resources.entity.AspectResource.lambda$ingestProposal$2(AspectResource.java:225)\\n\\tat '

           "com.linkedin.metadata.restli.RestliUtil.toTask(RestliUtil.java:30)\\n\\t... 88 more\\n', 'message': 'java.lang.RuntimeException: Failed "

           "to produce MCLs', 'status': 500})\n"```
In the ingestion debug logs there are some `{datahub.ingestion.source.bigquery_v2.usage:794} - Unable to find destination table in event AuditEvent`  which might be related, but I'm not sure.
Any ideas on the error cause and how to fix it are appreciated. Thanks in advance :slightly_smiling_face:!

Datahub is deployed in k8s using helm chart version `0.2.188`
Here is the ingestion source configuration (we are managing these programmatically using Python SDK):![attachment](https://files.slack.com/files-pri/TUMKD5EGJ-F05TR1DM7FX/ingestion-source-config.json?t=xoxe-973659184562-6705490291811-6708051934148-dd1595bd5f63266bc09e6166373c7a3c)

I am seeing this exact same error, but I am not seeing the destination table error. I am seeing this error for Looker + Snowfalke ingestions managed through the UI. Our global datahub version is 0.11.0 and for actions pod 0.0.11

<@U04JQ3J6HJN> we are also using 0.0.11 for actions.
Did you find out what caused the error and were you able to fix it?

no. <@U01GZEETMEZ> helped me with something similar previously - is 0.0.11 the best version of the actions pod to use??

also as of this morning this error is occurring on ALL of my ingestion sources

i updated actions to v0.1.13 and this did not resolve the issue

although it seems like the issue is actually with the GMS pod… actions pod is sending a request to the GMS pod to ingest the metadata and the GMS pod for some reason is unable to process said request (specifically, it is unable to produce MCLs as per <https://datahubproject.io/docs/advanced/mcp-mcl|this article>) and so it returns the 500.

here’s our GMS values, version is v.0.11.0 it seems like it has plenty of memory but maybe more memory or cpu??

  enabled: true
  image:
    repository: linkedin/datahub-gms
    # tag: "v0.10.0 # defaults to .global.datahub.version
  imagePullSecrets:
    - name: docker-io-creds
  resources:
    limits:
      memory: 5Gi
    requests:
      cpu: 100m
      memory: 4Gi
  # Optionally set a GMS specific SQL login (defaults to global login)
  # sql:
  #   datasource:
  #     username: "gms-login"
  #     password:
  #       secretRef: gms-secret
  #       secretKey: gms-password```

tagging my colleagues <@U05TT312DN2> and <@U04PSSFGYL9> for visibility

I think i may have been able to able to trace the error back to this:

org.apache.kafka.common.errors.RecordTooLargeException: The message is 2225348 bytes when serialized which is larger than 1048576, which is the value of the max.request.size configuration.
2023-09-26 18:37:03,523 [qtp350068407-272] ERROR c.l.m.entity.EntityServiceImpl:678 - Failed to produce MCLs: [UpsertBatchItem{urn=urn:li:dataHubExecutionRequest:e59963c1-b7fd-4cb2-8c5d-2ea80292eeb5, aspectName='dataHubExecutionRequestResult', systemMetadata={lastObserved=1695753423409, lastRunId=no-run-id-provided, runId=no-run-id-provided}, aspect={report=~~~~ Execution Summary - RUN_INGEST ~~~~```
so need to give kafka more resources

Thanks for the updates and for cross-posting <@U04JQ3J6HJN>!
I’d be interested to hear, if increasing resources solves the issue for you.
Looking at our resource monitoring, all pods are within the set limits during the ingestion run.

cc <@U03MF8MU5P0> - do we have docs on managing the max kafka message size?

Separately, we have some fixes in progress that will help solve the root cause, which is that the ingestion logs are too large

well i saw in another thread about updated this variable in the gms pod:

    - name: SPRING_KAFKA_PRODUCER_PROPERTIES_MAX_REQUEST_SIZE
      value: "5242880"```
and i updated that but it didnt seem to take bc now I'm getting basically the same error except it just says "The request included a message larger than the max message size the server will accept."

Yeah there’s a few different things to configure - you might also need to configure the kafka broker to accept larger messages

how do i do that?? we are using the kafka from the datahub-prerequisites and i dont see a lot of things I can pass through there:

kafka:
  enabled: true
  imagePullSecrets:
    - name: docker-io-creds```

<@U04JQ3J6HJN> in our gms logs we are actually getting the same RecordTooLargeException: The request included a message larger than the max message size the server will accept.

Checking the <Artifact Hub helm chart> mentioned in the <datahub-helm/charts/prerequisites/values.yaml at ef63fae2d436ff099f600c8909a96a1e98c90c5b · acryldata/datahub-helm · GitHub helm chart> it seems we can pass extraConfig to set maxMessageBytes
(I haven’t tried it yet though).

yeah i tried that… didnt seem to work haha

a couple of other folks are experiencing this issue, we should consolidate threads

Increasing the message size limit on the kafka brokers or another option might be to enable message compression. Adding the kafka environment on all message producers. GMS at least or if running standalone the mce/mae consumers. Env KAFKA_PRODUCER_COMPRESSION_TYPE document https://www.confluent.io/blog/apache-kafka-message-compression/#configuring-compression-type|here and can also be configured on the server side per confluent’s docs.

I’m running into very much the same issue (while trying to ingest clickhouse-usage configured via UI). I’ve increased the max message size to 5MB on all 3 parts (producer, consumer & kafka itself) and I don’t get the RecordTooLargeException, but I do get the exact same stack trace, which Mona reported initially, on the gms:

	at com.linkedin.metadata.entity.EntityServiceImpl.emitMCL(EntityServiceImpl.java:684)
	at com.linkedin.metadata.entity.EntityServiceImpl.ingestAspects(EntityServiceImpl.java:543)
	at com.linkedin.metadata.entity.EntityServiceImpl.ingestProposalSync(EntityServiceImpl.java:855)
	at com.linkedin.metadata.entity.EntityServiceImpl.ingestProposal(EntityServiceImpl.java:757)
	at com.linkedin.metadata.resources.entity.AspectResource.lambda$ingestProposal$2(AspectResource.java:225)
	at com.linkedin.metadata.restli.RestliUtil.toTask(RestliUtil.java:30)
	... 88 common frames omitted```
Any ideas/suggestions? <@U05Q7965PQA> <@U04JQ3J6HJN> did you already find a solution for your cases?