Java.util.ConcurrentModificationException in Datahub v0.10.4 Exception - Possible solution with GMS Authentication

Original Slack Thread

Hello Community,
We are using v0.10.4 version of the Datahub and observed this Exception

2023-08-18 00:38:32,971 [Thread-25027] ERROR c.d.authorization.AuthorizerChain:74 - Caught exception while attempting to authorize request using Authorizer com.datahub.authorization.DataHubAuthorizer. Skipping authorizer.
java.util.ConcurrentModificationException: null
at java.base/java.util.ArrayList$Itr.checkForComodification(ArrayList.java:1043)
at java.base/java.util.ArrayList$Itr.next(ArrayList.java:997)
at com.datahub.authorization.DataHubAuthorizer.authorize(DataHubAuthorizer.java:95)
at com.datahub.authorization.AuthorizerChain.authorize(AuthorizerChain.java:60)
at com.linkedin.datahub.graphql.resolvers.AuthUtils.isAuthorized(AuthUtils.java:24)
at com.linkedin.datahub.graphql.resolvers.ingest.IngestionAuthUtils.canManageIngestion(IngestionAuthUtils.java:15)
at com.linkedin.datahub.graphql.resolvers.MeResolver.lambda$get$0(MeResolver.java:66)
at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
at java.base/java.lang.Thread.run(Thread.java:829)

Our guess work is (based on “at com.datahub.authorization.DataHubAuthorizer.authorize(DataHubAuthorizer.java:95)“)
that the _policyCache (line DataHubAuthorizer.java:53) that is getting refreshed periodically is conflicting with GMS Authentication.

Probably, setup a ‘flag = <in-progress>’ while _policyCache is getting refreshed and GMS Authentication can wait until the ‘flag = <done>’.

Let me know if I should open an incident, or you have other ideas about the cause of the problem.

cc: <@U040BNMTGSF>, <@U03FR4Q3M1P>, <@U04096SS05D>

<@UV5UEC3LN> might be able to speak to this!

Hey! Thanks for pointing out this bug, it’s not because of the _policyCache directly here, but because of adds happening to the ArrayList value stored within the policy cache. In addPolicyToCache which executes as a part of the async runner it is doing a getOrDefault + add while it is possible for it to be accessed by another thread in a loop. Will get started working on a fix, but this should be an extremely infrequent issue.