Troubleshooting Kafka and OpenSearch Connection Timeouts

user-2 · January 13, 2025, 12:04am

<@U06TM7M40RK> I am using datahub-kafka sync and bigquery as source. The events are generated by the source, however it wasn’t send to kafka. The ingestion gets stugled and shows this output:

[2025-01-07 11:14:41,144] ERROR {datahub.ingestion.run.pipeline:78} - failed to write record with workunit urn:li:container:01af53505d15f99d1e0986cd301ebc1d-containerProperties with KafkaError{code=_VALUE_SERIALIZATION,val=-161,str=“[Errno 110] Connection timed out”} and info {‘error’: ValueSerializationError(KafkaError{code=_VALUE_SERIALIZATION,val=-161,str=“[Errno 110] Connection timed out”}), ‘msg’: ‘Failed to write record: KafkaError{code=_VALUE_SERIALIZATION,val=-161,str=“[Errno 110] Connection timed out”}’}
Pipeline running with at least 113 failures so far; produced 0 events in 12 hours, 32 minutes and 12.83 seconds.

user-1 · January 13, 2025, 12:04am

The error message you are encountering indicates a ValueSerializationError with a Kafka error code _VALUE_SERIALIZATION and a connection timeout. This suggests that there is an issue with serializing the data before sending it to Kafka, and the connection to Kafka is timing out.

Here are some steps to troubleshoot and resolve this issue:

Check Kafka Configuration: Ensure that your Kafka configuration is correct, including the bootstrap servers and schema_registry_url. Verify that the Kafka brokers are reachable from the machine where the ingestion is running.
Schema Registry: If you are using a schema registry, ensure that it is correctly configured and accessible. The schema registry is responsible for managing the serialization and deserialization of messages.
Network Connectivity: Verify that there are no network issues preventing the ingestion job from reaching the Kafka brokers. This includes checking firewall rules, network policies, and ensuring that the correct ports are open.
Serialization Format: Ensure that the serialization format used in your Kafka configuration matches the format expected by the Kafka topic. If you are using Avro, make sure the Avro schema is correctly defined and registered in the schema registry.
Logs and Debugging: Check the logs of the Kafka brokers and the schema registry for any errors or warnings that might provide more context on what might be going wrong.
Increase Timeout: If the connection timeout is due to network latency, consider increasing the timeout settings in your Kafka producer configuration.

If you need more detailed steps on any of these points or further assistance on a specific part of the process, please let me know.

Sources:

<Slack Slack Thread on Kafka Issues> 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-1 · January 13, 2025, 12:04am

Hi <@U086EAJ6RR8>, how did my answer work out for you? Need any more assistance? You can also tag <@U07FV3UHE4D> to let your fellow community members know you’re still looking for guidance.

user-2 · January 13, 2025, 12:04am

<@U06TM7M40RK> 2025-01-07 13:48:25,096 [Thread-5543] ERROR c.l.m.s.e.query.ESSearchDAO:150 - Search query failed
java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-11 [ACTIVE]
at org.opensearch.client.RestClient.extractAndWrapCause(RestClient.java:937)
at org.opensearch.client.RestClient.performRequest(RestClient.java:333)
at org.opensearch.client.RestClient.performRequest(RestClient.java:321)
at org.opensearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1918)
at org.opensearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1884)
at org.opensearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1852)
at org.opensearch.client.RestHighLevelClient.search(RestHighLevelClient.java:1095)
at com.linkedin.metadata.search.elasticsearch.query.ESSearchDAO.executeAndExtract(ESSearchDAO.java:138)
at com.linkedin.metadata.search.elasticsearch.query.ESSearchDAO.search(ESSearchDAO.java:308)
at com.linkedin.metadata.search.elasticsearch.ElasticSearchService.search(ElasticSearchService.java:163)
at com.linkedin.metadata.search.elasticsearch.ElasticSearchService.search(ElasticSearchService.java:145)
at com.linkedin.metadata.client.JavaEntityClient.search(JavaEntityClient.java:458)
at com.linkedin.datahub.graphql.resolvers.ingest.source.ListIngestionSourcesResolver.lambda$get$1(ListIngestionSourcesResolver.java:83)
at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768)
at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-11 [ACTIVE]
at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:387)
at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:98)
at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:40)
at org.apache.http.impl.nio.reactor.AbstractIODispatch.timeout(AbstractIODispatch.java:175)
at org.apache.http.impl.nio.reactor.BaseIOReactor.sessionTimedOut(BaseIOReactor.java:261)
at org.apache.http.impl.nio.reactor.AbstractIOReactor.timeoutCheck(AbstractIOReactor.java:506)
at org.apache.http.impl.nio.reactor.BaseIOReactor.validate(BaseIOReactor.java:211)
at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:280)
at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591)
... 1 common frames omitted
2025-01-07 13:48:25,096 [Thread-5543] ERROR c.l.d.g.e.DataHubDataFetcherExceptionHandler:45 - Failed to execute
java.util.concurrent.CompletionException: java.lang.RuntimeException: Failed to list ingestion sources
at java.base/java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:315)
at java.base/java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:320)
at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1770)
at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: java.lang.RuntimeException: Failed to list ingestion sources
at com.linkedin.datahub.graphql.resolvers.ingest.source.ListIngestionSourcesResolver.lambda$get$1(ListIngestionSourcesResolver.java:119)
at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1768)
... 1 common frames omitted
Caused by: com.datahub.util.exception.ESQueryException: Search query failed:
at com.linkedin.metadata.search.elasticsearch.query.ESSearchDAO.executeAndExtract(ESSearchDAO.java:151)
at com.linkedin.metadata.search.elasticsearch.query.ESSearchDAO.search(ESSearchDAO.java:308)
at com.linkedin.metadata.search.elasticsearch.ElasticSearchService.search(ElasticSearchService.java:163)
at com.linkedin.metadata.search.elasticsearch.ElasticSearchService.search(ElasticSearchService.java:145)
at com.linkedin.metadata.client.JavaEntityClient.search(JavaEntityClient.java:458)
at com.linkedin.datahub.graphql.resolvers.ingest.source.ListIngestionSourcesResolver.lambda$get$1(ListIngestionSourcesResolver.java:83)
... 2 common frames omitted
Caused by: java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-11 [ACTIVE]
at org.opensearch.client.RestClient.extractAndWrapCause(RestClient.java:937)
at org.opensearch.client.RestClient.performRequest(RestClient.java:333)
at org.opensearch.client.RestClient.performRequest(RestClient.java:321)
at org.opensearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1918)
at org.opensearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1884)
at org.opensearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1852)
at org.opensearch.client.RestHighLevelClient.search(RestHighLevelClient.java:1095)
at com.linkedin.metadata.search.elasticsearch.query.ESSearchDAO.executeAndExtract(ESSearchDAO.java:138)
... 7 common frames omitted
Caused by: java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-11 [ACTIVE]
at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:387)
at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:98)
at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:40)
at org.apache.http.impl.nio.reactor.AbstractIODispatch.timeout(AbstractIODispatch.java:175)
at org.apache.http.impl.nio.reactor.BaseIOReactor.sessionTimedOut(BaseIOReactor.java:261)
at org.apache.http.impl.nio.reactor.AbstractIOReactor.timeoutCheck(AbstractIOReactor.java:506)
at org.apache.http.impl.nio.reactor.BaseIOReactor.validate(BaseIOReactor.java:211)
at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:280)
at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591)
... 1 common frames omitted

user-1 · January 13, 2025, 12:04am

The error message you are encountering is a SocketTimeoutException indicating that a search query to OpenSearch is timing out after 30 seconds. This suggests that the query is taking too long to execute, possibly due to network issues, high load on the OpenSearch server, or inefficient query execution.

Here are some steps to troubleshoot and resolve this issue:

Check OpenSearch Load: Ensure that your OpenSearch cluster is not under heavy load. High CPU or memory usage can cause queries to take longer to execute. You can monitor the cluster’s health and performance metrics to identify any bottlenecks.
Optimize Queries: Review the queries being executed to ensure they are optimized. Complex queries or those that scan large datasets can take longer to execute. Consider adding indices or optimizing existing ones to improve query performance.
Network Connectivity: Verify that there are no network issues between the DataHub application and the OpenSearch cluster. Ensure that the network latency is low and there are no intermittent connectivity issues.
Increase Timeout: If the query execution time is expected to be long, consider increasing the timeout setting in your OpenSearch client configuration. This can be done by adjusting the timeout parameter in the client settings.
Review OpenSearch Logs: Check the OpenSearch logs for any errors or warnings that might provide more context on what might be causing the query to time out.
Load Balancer: If there is a load balancer in front of OpenSearch, ensure that it is configured correctly and not introducing additional latency.

If you need more detailed steps on any of these points or further assistance on a specific part of the process, please let me know.

Sources:

<Slack Slack Thread on OpenSearch Timeout Issues> 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-2 · January 13, 2025, 12:04am

<@U06TM7M40RK> 025-01-07 14:09:53,969 [ThreadPoolTaskExecutor-1] INFO c.l.m.s.e.update.ESBulkProcessor urn=none aspect=none entityType=none changeType=none - Added request id: PageViewEvent_urn%3Ali%3Acorpuser%3Aflavio.leal_1736258993864_05259, operation type: CREATE, index: datahub_usage_event
2025-01-07 14:10:24,510 [I/O dispatcher 2] ERROR c.l.m.s.e.update.BulkListener urn=none aspect=none entityType=none changeType=none - Error feeding bulk request 140. No retries left. Request: Failed to perform bulk request: index [datahub_usage_event], optype: [CREATE], type [CREATE], id [PageViewEvent_urn%3Ali%3Acorpuser%3Aflavio.leal_1736258992571_05258];Failed to perform bulk request: index [datahub_usage_event], optype: [CREATE], type [CREATE], id [PageViewEvent_urn%3Ali%3Acorpuser%3Aflavio.leal_1736258993864_05259]
java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-35 [ACTIVE]
at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:387)
at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:98)
at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:40)
at org.apache.http.impl.nio.reactor.AbstractIODispatch.timeout(AbstractIODispatch.java:175)
at org.apache.http.impl.nio.reactor.BaseIOReactor.sessionTimedOut(BaseIOReactor.java:261)
at org.apache.http.impl.nio.reactor.AbstractIOReactor.timeoutCheck(AbstractIOReactor.java:506)
at org.apache.http.impl.nio.reactor.BaseIOReactor.validate(BaseIOReactor.java:211)
at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:280)
at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591)
at java.base/java.lang.Thread.run(Thread.java:840)

user-1 · January 13, 2025, 12:04am

The error message you are encountering is a SocketTimeoutException indicating that a bulk request to OpenSearch is timing out after 30 seconds. This suggests that the bulk operation is taking too long to execute, possibly due to network issues, high load on the OpenSearch server, or inefficient bulk request handling.

Here are some steps to troubleshoot and resolve this issue:

Check OpenSearch Load: Ensure that your OpenSearch cluster is not under heavy load. High CPU or memory usage can cause bulk operations to take longer to execute. You can monitor the cluster’s health and performance metrics to identify any bottlenecks.
Optimize Bulk Requests: Review the size and complexity of the bulk requests being sent. Large bulk requests can take longer to process. Consider breaking down large bulk requests into smaller batches to improve performance.
Network Connectivity: Verify that there are no network issues between the DataHub application and the OpenSearch cluster. Ensure that the network latency is low and there are no intermittent connectivity issues.
Increase Timeout: If the bulk operation time is expected to be long, consider increasing the timeout setting in your OpenSearch client configuration. This can be done by adjusting the timeout parameter in the client settings.
Review OpenSearch Logs: Check the OpenSearch logs for any errors or warnings that might provide more context on what might be causing the bulk request to time out.
Load Balancer: If there is a load balancer in front of OpenSearch, ensure that it is configured correctly and not introducing additional latency.

If you need more detailed steps on any of these points or further assistance on a specific part of the process, please let me know.

Sources:

<Slack Slack Thread on OpenSearch Timeout Issues> 0 button 0 button Hint: Mention @runllm in the thread for followups.

datahub_team · January 13, 2025, 12:04am

Hi ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look

Hope this helps ! ><

Topic		Replies	Views
Dealing with Errors and Batch Size in Kafka Data Ingestion ingestion	12	19	January 6, 2025
Troubleshooting connection timeout error in Kafka Connect pipeline ingestion	6	1	May 5, 2025
Troubleshooting Tableau Data Ingestion Timeout Errors ingestion	8	6	July 22, 2024
Fixing Kafka Message Size Issue ingestion	8	116	March 4, 2024
Troubleshooting SerializationException in DataHub GMS Kafka Consumer ingestion	4	45	August 12, 2024

Troubleshooting Kafka and OpenSearch Connection Timeouts

Related topics