Discussing Connection Pool Exhaustion Issue in DataHub Project

Original Slack Thread

Hi,
I’d like to push this issue a bit as this could lead to critical situations. Or does anyone has a quick fix in mind?
https://github.com/datahub-project/datahub/issues/10354

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Which DataHub version are you using? (e.g. 0.12.0)
  2. Please post any relevant error logs on the thread!

After doing upgrade yesterday I am seeing this issue too. In our case SendMAEStep fails too.


[main] ERROR c.l.d.u.restoreindices.SendMAEStep:78 - Error iterating futures```
<@U05HUB7E31U> <@U06M21PS798>

Do you see any EbeanAspectDao related messages such as:

WARN c.l.m.entity.ebean.EbeanAspectDao:747 - Retryable PersistenceException: java.sql.SQLException: Unsuccessfully waited [1000] millis for a connection to be returned. No connections are free. You need to Increase the max connections of [50] or look for a connection pool leak using datasource.xxx.capturestacktrace=true

<@U05SKM6KGGK> yes we do - not entirely same but similar
2024-04-29 04:09:57,453 [qtp133250414-78321] WARN c.l.m.entity.ebean.EbeanAspectDao:720 - Retryable PersistenceException: Error when batch flush on sql: update metadata_aspect_v2 set metadata=?, createdOn=?, createdBy=?, createdFor=?, systemmetadata=? where urn=? and aspect=? and version=?

org.postgresql.util.PSQLException: This connection has been closed. at org.postgresql.jdbc.PgConnection.checkClosed(PgConnection.java:883) at org.postgresql.jdbc.PgConnection.clearWarnings(PgConnection.java:776) at io.ebean.datasource.pool.PooledConnection.close(PooledConnection.java:436) at io.ebean.util.JdbcClose.close(JdbcClose.java:50) at io.ebeaninternal.server.transaction.TransactionFactoryBasic.createReadOnlyTransaction(TransactionFactoryBasic.java:31) at io.ebeaninternal.server.transaction.TransactionManager.createReadOnlyTransaction(TransactionManager.java:337) at io.ebeaninternal.server.core.DefaultServer.createReadOnlyTransaction(DefaultServer.java:2177) at io.ebeaninternal.server.core.OrmQueryRequest.initTransIfRequired(OrmQueryRequest.java:221) at io.ebeaninternal.server.core.DefaultServer.findList(DefaultServer.java:1462) at io.ebeaninternal.server.core.DefaultServer.findList(DefaultServer.java:1442) at io.ebeaninternal.server.querydefn.DefaultOrmQuery.findList(DefaultOrmQuery.java:1483) at com.linkedin.metadata.entity.ebean.EbeanAspectDao.batchGetUnion(EbeanAspectDao.java:443) at com.linkedin.metadata.entity.ebean.EbeanAspectDao.batchGet(EbeanAspectDao.java:358) at com.linkedin.metadata.entity.ebean.EbeanAspectDao.batchGet(EbeanAspectDao.java:334) at com.linkedin.metadata.entity.EntityServiceImpl.exists(EntityServiceImpl.java:1812) at com.linkedin.metadata.entity.EntityService.exists(EntityService.java:68) at com.linkedin.metadata.entity.EntityService.exists(EntityService.java:78) at com.linkedin.metadata.aspect.utils.DefaultAspectsUtil.getAdditionalChanges(DefaultAspectsUtil.java:60) at com.linkedin.metadata.entity.EntityServiceImpl.ingestAspects(EntityServiceImpl.java:631) at com.linkedin.metadata.entity.EntityServiceImpl.ingestProposalSync(EntityServiceImpl.java:1103) at com.linkedin.metadata.entity.EntityServiceImpl.ingestProposal(EntityServiceImpl.java:947) at com.linkedin.metadata.resources.entity.AspectResource.lambda$ingestProposal$2(AspectResource.java:250) at com.linkedin.metadata.restli.RestliUtil.toTask(RestliUtil.java:30) at com.linkedin.metadata.restli.RestliUtil.toTask(RestliUtil.java:51) at com.linkedin.metadata.resources.entity.AspectResource.ingestProposal(AspectResource.java:242) at jdk.internal.reflect.GeneratedMethodAccessor174.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:568) at com.linkedin.restli.internal.server.RestLiMethodInvoker.doInvoke(RestLiMethodInvoker.java:177) at com.linkedin.restli.internal.server.RestLiMethodInvoker.invoke(RestLiMethodInvoker.java:333) at com.linkedin.restli.internal.server.filter.FilterChainDispatcherImpl.onRequestSuccess(FilterChainDispatcherImpl.java:47) at com.linkedin.restli.internal.server.filter.RestLiFilterChainIterator.onRequest(RestLiFilterChainIterator.java:86) at com.linkedin.restli.internal.server.filter.RestLiFilterChainIterator.lambda$onRequest$0(RestLiFilterChainIterator.java:73) at java.base/java.util.concurrent.CompletableFuture.uniAcceptNow(CompletableFuture.java:757) at java.base/java.util.concurrent.CompletableFuture.uniAcceptStage(CompletableFuture.java:735) at java.base/java.util.concurrent.CompletableFuture.thenAccept(CompletableFuture.java:2182) at com.linkedin.restli.internal.server.filter.RestLiFilterChainIterator.onRequest(RestLiFilterChainIterator.java:72) at com.linkedin.restli.internal.server.filter.RestLiFilterChain.onRequest(RestLiFilterChain.java:55) at com.linkedin.restli.server.BaseRestLiServer.handleResourceRequest(BaseRestLiServer.java:270) at com.linkedin.restli.server.RestRestLiServer.handleResourceRequestWithRestLiResponse(RestRestLiServer.java:294) at com.linkedin.restli.server.RestRestLiServer.handleResourceRequest(RestRestLiServer.java:262) at com.linkedin.restli.server.RestRestLiServer.handleResourceRequest(RestRestLiServer.java:232) at com.linkedin.restli.server.RestRestLiServer.doHandleRequest(RestRestLiServer.java:215) at com.linkedin.restli.server.RestRestLiServer.handleRequest(RestRestLiServer.java:171) at com.linkedin.restli.server.RestLiServer.handleRequest(RestLiServer.java:132) at com.linkedin.restli.server.DelegatingTransportDispatcher.handleRestRequest(DelegatingTransportDispatcher.java:70) at com.linkedin.r2.filter.transport.DispatcherRequestFilter.onRestRequest(DispatcherRequestFilter.java:70) at com.linkedin.r2.filter.TimedRestFilter.onRestRequest(TimedRestFilter.java:76) at com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:146) at com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:132) at com.linkedin.r2.filter.FilterChainIterator.onRequest(FilterChainIterator.java:62) at com.linkedin.r2.filter.TimedNextFilter.onRequest(TimedNextFilter.java:55) at com.linkedin.r2.filter.transport.ServerQueryTunnelFilter.onRestRequest(ServerQueryTunnelFilter.java:58) at com.linkedin.r2.filter.TimedRestFilter.onRestRequest(TimedRestFilter.java:76) at com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:146) at com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:132) at com.linkedin.r2.filter.FilterChainIterator.onRequest(FilterChainIterator.java:62) at com.linkedin.r2.filter.TimedNextFilter.onRequest(TimedNextFilter.java:55) at com.linkedin.r2.filter.message.rest.RestFilter.onRestRequest(RestFilter.java:50) at com.linkedin.r2.filter.TimedRestFilter.onRestRequest(TimedRestFilter.java:76) at com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:146) at com.linkedin.r2.filter.FilterChainIterator$FilterChainRestIterator.doOnRequest(FilterChainIterator.java:132) at com.linkedin.r2.filter.FilterChainIterator.onRequest(FilterChainIterator.java:62) at com.linkedin.r2.filter.FilterChainImpl.onRestRequest(FilterChainImpl.java:106) at com.linkedin.r2.filter.transport.FilterChainDispatcher.handleRestRequest(FilterChainDispatcher.java:75) at com.linkedin.r2.util.finalizer.RequestFinalizerDispatcher.handleRestRequest(RequestFinalizerDispatcher.java:61) at com.linkedin.r2.transport.http.server.HttpDispatcher.handleRequest(HttpDispatcher.java:101) at com.linkedin.r2.transport.http.server.AbstractJakartaR2Servlet.service(AbstractJakartaR2Servlet.java:96) at jakarta.servlet.http.HttpServlet.service(HttpServlet.java:587) at com.linkedin.restli.server.RestliHandlerServlet.service(RestliHandlerServlet.java:20) at com.linkedin.restli.server.RestliHandlerServlet.handleRequest(RestliHandlerServlet.java:26) at org.springframework.web.context.support.HttpRequestHandlerServlet.service(HttpRequestHandlerServlet.java:67) at jakarta.servlet.http.HttpServlet.service(HttpServlet.java:587) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:764) at org.eclipse.jetty.servlet.ServletHandler$ChainEnd.doFilter(ServletHandler.java:1665) at com.datahub.auth.authentication.filter.AuthenticationFilter.doFilter(AuthenticationFilter.java:106) at org.eclipse.jetty.servlet.FilterHolder.doFilter(FilterHolder.java:202) at org.eclipse.jetty.servlet.ServletHandler$Chain.doFilter(ServletHandler.java:1635) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:527) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:131) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:598) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:223) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1580) at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:221) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1381) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:176) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:484) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1553) at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:174) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1303) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:129) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:149) at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:51) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:122) at org.eclipse.jetty.server.Server.handle(Server.java:563) at org.eclipse.jetty.server.HttpChannel$RequestDispatchable.dispatch(HttpChannel.java:1598) at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:753) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:501) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:287) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:314) at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:100) at org.eclipse.jetty.io.SelectableChannelEndPoint$1.run(SelectableChannelEndPoint.java:53) at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.runTask(AdaptiveExecutionStrategy.java:421) at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.consumeTask(AdaptiveExecutionStrategy.java:390) at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.tryProduce(AdaptiveExecutionStrategy.java:277) at org.eclipse.jetty.util.thread.strategy.AdaptiveExecutionStrategy.run(AdaptiveExecutionStrategy.java:199) at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:411) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:969) at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.doRunJob(QueuedThreadPool.java:1194) at org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:1149) at java.base/java.lang.Thread.run(Thread.java:840)

We have this

	at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
	at com.linkedin.datahub.upgrade.restoreindices.SendMAEStep.iterateFutures(SendMAEStep.java:75)
	at com.linkedin.datahub.upgrade.restoreindices.SendMAEStep.lambda$executable$0(SendMAEStep.java:198)
	at com.linkedin.datahub.upgrade.impl.DefaultUpgradeManager.executeStepInternal(DefaultUpgradeManager.java:125)
	at com.linkedin.datahub.upgrade.impl.DefaultUpgradeManager.executeInternal(DefaultUpgradeManager.java:76)
Caused by: javax.persistence.PersistenceException: java.sql.SQLException: Unsuccessfully waited [1000] millis for a connection to be returned. No connections are free. You need to Increase the max connections of [50] or look for a connection pool leak using datasource.xxx.capturestacktrace=true
	at io.ebeaninternal.server.transaction.TransactionFactoryBasic.createTransaction(TransactionFactoryBasic.java:49)
	at io.ebeaninternal.server.transaction.TransactionManager.createTransaction(TransactionManager.java:330)
	at io.ebeaninternal.server.transaction.TransactionManager.createTransaction(TransactionManager.java:322)
	at io.ebeaninternal.server.transaction.TransactionManager.beginScopedTransaction(TransactionManager.java:623)
	at io.ebeaninternal.server.core.DefaultServer.beginTransaction(DefaultServer.java:791)
	at com.linkedin.metadata.entity.ebean.EbeanAspectDao.runInTransactionWithRetryUnlocked(EbeanAspectDao.java:724)
	... 21 common frames omitted
Caused by: java.sql.SQLException: Unsuccessfully waited [1000] millis for a connection to be returned. No connections are free. You need to Increase the max connections of [50] or look for a connection pool leak using datasource.xxx.capturestacktrace=true
	at io.ebean.datasource.pool.PooledConnectionQueue._getPooledConnectionWaitLoop(PooledConnectionQueue.java:272)
	at io.ebean.datasource.pool.PooledConnectionQueue._getPooledConnection(PooledConnectionQueue.java:249)
	at io.ebean.datasource.pool.PooledConnectionQueue.getPooledConnection(PooledConnectionQueue.java:196)
	at io.ebean.datasource.pool.ConnectionPool.getPooledConnection(ConnectionPool.java:715)
	at io.ebean.datasource.pool.ConnectionPool.getConnection(ConnectionPool.java:705)
	at io.ebeaninternal.server.transaction.TransactionFactoryBasic.createTransaction(TransactionFactoryBasic.java:42)
	... 26 common frames omitted```

Yeah so the NullPointerException is just a side-effect of the pool exhaustion as you can see from this ^ last exception

ok found connection exhaustion![attachment]({‘ID’: ‘F0719D9UKHT’, ‘EDITABLE’: False, ‘IS_EXTERNAL’: False, ‘USER_ID’: ‘U06FXLKSNHJ’, ‘CREATED’: ‘2024-04-30 15:42:44+00:00’, ‘PERMALINK’: ‘Slack’, ‘EXTERNAL_TYPE’: ‘’, ‘TIMESTAMPS’: ‘2024-04-30 15:42:44+00:00’, ‘MODE’: ‘hosted’, ‘DISPLAY_AS_BOT’: False, ‘PRETTY_TYPE’: ‘PNG’, ‘NAME’: ‘image.png’, ‘IS_PUBLIC’: True, ‘PREVIEW_HIGHLIGHT’: None, ‘MIMETYPE’: ‘image/png’, ‘PERMALINK_PUBLIC’: ‘https://slack-files.com/TUMKD5EGJ-F0719D9UKHT-e4e3b5c507’, ‘FILETYPE’: ‘png’, ‘EDIT_LINK’: None, ‘URL_PRIVATE’: ‘Slack’, ‘HAS_RICH_PREVIEW’: False, ‘TITLE’: ‘image.png’, ‘IS_STARRED’: False, ‘PREVIEW_IS_TRUNCATED’: None, ‘URL_PRIVATE_DOWNLOAD’: ‘Slack’, ‘PREVIEW’: None, ‘PUBLIC_URL_SHARED’: False, ‘MESSAGE_TS’: ‘1714491769.494729’, ‘PARENT_MESSAGE_TS’: ‘1714461624.761899’, ‘MESSAGE_CHANNEL_ID’: ‘C029A3M079U’, ‘_FIVETRAN_DELETED’: False, ‘LINES_MORE’: None, ‘LINES’: None, ‘SIZE’: 40012, ‘_FIVETRAN_SYNCED’: ‘2024-05-05 12:59:46.009000+00:00’})

Suggestion here is to increase batchDelayMs to 1 second or higher. Ref: https://github.com/acryldata/datahub-helm/blob/89c92c8ac73b4dc371d647216d60dff28cc7c9ae/charts/datahub/values.yaml#L280

ok great. let me try that.

(Increasing the size of the connection pool could help too, but not sure if we have helm chart value for this)

ok. I can try with delay first. but good to know this connection issue. In future I can watch out too.

<@U05SKM6KGGK> adjusted batchDelayMs to 1 second and then 2.5 seconds too but still same issue both times. Looks like after 49000 records this issue happens continuously.![attachment]({‘ID’: ‘F0716816NAJ’, ‘EDITABLE’: False, ‘IS_EXTERNAL’: False, ‘USER_ID’: ‘U06FXLKSNHJ’, ‘CREATED’: ‘2024-04-30 16:51:21+00:00’, ‘PERMALINK’: ‘Slack’, ‘EXTERNAL_TYPE’: ‘’, ‘TIMESTAMPS’: ‘2024-04-30 16:51:21+00:00’, ‘MODE’: ‘hosted’, ‘DISPLAY_AS_BOT’: False, ‘PRETTY_TYPE’: ‘PNG’, ‘NAME’: ‘image.png’, ‘IS_PUBLIC’: True, ‘PREVIEW_HIGHLIGHT’: None, ‘MIMETYPE’: ‘image/png’, ‘PERMALINK_PUBLIC’: ‘https://slack-files.com/TUMKD5EGJ-F0716816NAJ-712eee6f8c’, ‘FILETYPE’: ‘png’, ‘EDIT_LINK’: None, ‘URL_PRIVATE’: ‘Slack’, ‘HAS_RICH_PREVIEW’: False, ‘TITLE’: ‘image.png’, ‘IS_STARRED’: False, ‘PREVIEW_IS_TRUNCATED’: None, ‘URL_PRIVATE_DOWNLOAD’: ‘Slack’, ‘PREVIEW’: None, ‘PUBLIC_URL_SHARED’: False, ‘MESSAGE_TS’: ‘1714495898.647019’, ‘PARENT_MESSAGE_TS’: ‘1714461624.761899’, ‘MESSAGE_CHANNEL_ID’: ‘C029A3M079U’, ‘_FIVETRAN_DELETED’: False, ‘LINES_MORE’: None, ‘LINES’: None, ‘SIZE’: 226300, ‘_FIVETRAN_SYNCED’: ‘2024-05-05 12:59:46.167000+00:00’})

try going to 10 seconds

<@U05SKM6KGGK> Still same. Is there a way to change connection pool manually anywhere ?

I manually increase this in restore indices job. EBEAN_MAX_CONNECTIONS to 250 but it reaches way beyond it.![attachment]({‘ID’: ‘F071GGP2G5A’, ‘EDITABLE’: False, ‘IS_EXTERNAL’: False, ‘USER_ID’: ‘U06FXLKSNHJ’, ‘CREATED’: ‘2024-04-30 19:53:48+00:00’, ‘PERMALINK’: ‘Slack’, ‘EXTERNAL_TYPE’: ‘’, ‘TIMESTAMPS’: ‘2024-04-30 19:53:48+00:00’, ‘MODE’: ‘hosted’, ‘DISPLAY_AS_BOT’: False, ‘PRETTY_TYPE’: ‘PNG’, ‘NAME’: ‘image.png’, ‘IS_PUBLIC’: True, ‘PREVIEW_HIGHLIGHT’: None, ‘MIMETYPE’: ‘image/png’, ‘PERMALINK_PUBLIC’: ‘https://slack-files.com/TUMKD5EGJ-F071GGP2G5A-4b2be372b3’, ‘FILETYPE’: ‘png’, ‘EDIT_LINK’: None, ‘URL_PRIVATE’: ‘Slack’, ‘HAS_RICH_PREVIEW’: False, ‘TITLE’: ‘image.png’, ‘IS_STARRED’: False, ‘PREVIEW_IS_TRUNCATED’: None, ‘URL_PRIVATE_DOWNLOAD’: ‘Slack’, ‘PREVIEW’: None, ‘PUBLIC_URL_SHARED’: False, ‘MESSAGE_TS’: ‘1714506896.937689’, ‘PARENT_MESSAGE_TS’: ‘1714461624.761899’, ‘MESSAGE_CHANNEL_ID’: ‘C029A3M079U’, ‘_FIVETRAN_DELETED’: False, ‘LINES_MORE’: None, ‘LINES’: None, ‘SIZE’: 38481, ‘_FIVETRAN_SYNCED’: ‘2024-05-05 12:59:46.262000+00:00’})

idle connections never gets closed.![attachment]({‘ID’: ‘F071JGL2XEF’, ‘EDITABLE’: False, ‘IS_EXTERNAL’: False, ‘USER_ID’: ‘U06FXLKSNHJ’, ‘CREATED’: ‘2024-05-01 16:46:29+00:00’, ‘PERMALINK’: ‘Slack’, ‘EXTERNAL_TYPE’: ‘’, ‘TIMESTAMPS’: ‘2024-05-01 16:46:29+00:00’, ‘MODE’: ‘hosted’, ‘DISPLAY_AS_BOT’: False, ‘PRETTY_TYPE’: ‘PNG’, ‘NAME’: ‘image.png’, ‘IS_PUBLIC’: True, ‘PREVIEW_HIGHLIGHT’: None, ‘MIMETYPE’: ‘image/png’, ‘PERMALINK_PUBLIC’: ‘https://slack-files.com/TUMKD5EGJ-F071JGL2XEF-b06a959258’, ‘FILETYPE’: ‘png’, ‘EDIT_LINK’: None, ‘URL_PRIVATE’: ‘Slack’, ‘HAS_RICH_PREVIEW’: False, ‘TITLE’: ‘image.png’, ‘IS_STARRED’: False, ‘PREVIEW_IS_TRUNCATED’: None, ‘URL_PRIVATE_DOWNLOAD’: ‘Slack’, ‘PREVIEW’: None, ‘PUBLIC_URL_SHARED’: False, ‘MESSAGE_TS’: ‘1714581996.092929’, ‘PARENT_MESSAGE_TS’: ‘1714461624.761899’, ‘MESSAGE_CHANNEL_ID’: ‘C029A3M079U’, ‘_FIVETRAN_DELETED’: False, ‘LINES_MORE’: None, ‘LINES’: None, ‘SIZE’: 99478, ‘_FIVETRAN_SYNCED’: ‘2024-05-05 12:59:46.288000+00:00’})

we are running with

batchDelayMs: 250```
and can also confirm the connection pool issue

increasing batchDelayMs does not really help. Now we have a lot more active DB connections but if fails the same way like before :confused: