Troubleshooting GMS Backend Connectivity Issue After Datahub Upgrade

Original Slack Thread

After upgrading to the latest Datahub version we cannot get the GMS backend up and running with the error:

2023-08-17 13:18:57,324 [pool-15-thread-1] ERROR c.l.m.boot.OnBootApplicationListener:76 - Failed to bootstrap DataHub, OpenAPI servlet was not ready after 30 seconds```
I did find this bug/thread where AWS Glue as schema registry was commented out accidentally: <https://datahubspace.slack.com/archives/CV2UVAPPG/p1690545008697309>. We solved this by setting some extra_envs ourselves:
```KAFKA_SCHEMAREGISTRY_AWSGLUE_REGISTRYNAME
KAFKA_SCHEMAREGISTRY_AWSGLUE_REGION```
Despite settings those environment variables ourselves and using the Glue schema registry, it looks like the GMS component is trying to connect to port 8081, which is the port of the Confluent schema registry. Any idea on how to solve this?

<@UV5UEC3LN> could you look into this?

What version did you upgrade from?

Did the SystemUpdate job run successfully?

Are there any other errors indicating why the servlet did not start up successfully? What else are the logs saying?

We upgraded from version 0.10.1 to 0.10.5. The system update job ran successfully.

To me the logs don’t say much, here is an extract:

2023-08-21 15:25:39.321:INFO:oejs.Server:main: Started @33218ms
2023-08-21 15:25:39,806 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:71 - Failed to connect to open servlet: Connect to localhost:8081 [localhost/127.0.0.1] failed: Connection refused (Connection refused)
2023-08-21 15:25:39,806 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:63 - Sleeping for 1 second
2023-08-21 15:25:40,807 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:71 - Failed to connect to open servlet: Connect to localhost:8081 [localhost/127.0.0.1] failed: Connection refused (Connection refused)
2023-08-21 15:25:40,807 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:63 - Sleeping for 1 second
2023-08-21 15:25:41,808 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:71 - Failed to connect to open servlet: Connect to localhost:8081 [localhost/127.0.0.1] failed: Connection refused (Connection refused)
2023-08-21 15:25:41,808 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:63 - Sleeping for 1 second
2023-08-21 15:25:42,809 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:71 - Failed to connect to open servlet: Connect to localhost:8081 [localhost/127.0.0.1] failed: Connection refused (Connection refused)
2023-08-21 15:25:42,809 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:63 - Sleeping for 1 second
2023-08-21 15:25:43,810 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:71 - Failed to connect to open servlet: Connect to localhost:8081 [localhost/127.0.0.1] failed: Connection refused (Connection refused)
2023-08-21 15:25:43,810 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:63 - Sleeping for 1 second
2023-08-21 15:25:44,811 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:71 - Failed to connect to open servlet: Connect to localhost:8081 [localhost/127.0.0.1] failed: Connection refused (Connection refused)
2023-08-21 15:25:44,811 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:63 - Sleeping for 1 second
2023-08-21 15:25:45,812 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:71 - Failed to connect to open servlet: Connect to localhost:8081 [localhost/127.0.0.1] failed: Connection refused (Connection refused)
2023-08-21 15:25:45,812 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:63 - Sleeping for 1 second
2023-08-21 15:25:46,812 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:71 - Failed to connect to open servlet: Connect to localhost:8081 [localhost/127.0.0.1] failed: Connection refused (Connection refused)
2023-08-21 15:25:46,813 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:63 - Sleeping for 1 second
2023-08-21 15:25:47,813 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:71 - Failed to connect to open servlet: Connect to localhost:8081 [localhost/127.0.0.1] failed: Connection refused (Connection refused)
2023-08-21 15:25:47,813 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:63 - Sleeping for 1 second
2023-08-21 15:25:48,814 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:71 - Failed to connect to open servlet: Connect to localhost:8081 [localhost/127.0.0.1] failed: Connection refused (Connection refused)
2023-08-21 15:25:48,814 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:63 - Sleeping for 1 second
2023-08-21 15:25:49,815 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:71 - Failed to connect to open servlet: Connect to localhost:8081 [localhost/127.0.0.1] failed: Connection refused (Connection refused)
2023-08-21 15:25:49,815 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:63 - Sleeping for 1 second
2023-08-21 15:25:50,815 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:71 - Failed to connect to open servlet: Connect to localhost:8081 [localhost/127.0.0.1] failed: Connection refused (Connection refused)
2023-08-21 15:25:50,816 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:63 - Sleeping for 1 second
2023-08-21 15:25:51,816 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:71 - Failed to connect to open servlet: Connect to localhost:8081 [localhost/127.0.0.1] failed: Connection refused (Connection refused)
2023-08-21 15:25:51,816 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:63 - Sleeping for 1 second
2023-08-21 15:25:52,817 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:71 - Failed to connect to open servlet: Connect to localhost:8081 [localhost/127.0.0.1] failed: Connection refused (Connection refused)
2023-08-21 15:25:52,817 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:63 - Sleeping for 1 second
2023-08-21 15:25:53,818 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:71 - Failed to connect to open servlet: Connect to localhost:8081 [localhost/127.0.0.1] failed: Connection refused (Connection refused)
2023-08-21 15:25:53,818 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:63 - Sleeping for 1 second
2023-08-21 15:25:54,819 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:71 - Failed to connect to open servlet: Connect to localhost:8081 [localhost/127.0.0.1] failed: Connection refused (Connection refused)
2023-08-21 15:25:54,819 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:63 - Sleeping for 1 second
2023-08-21 15:25:55,820 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:71 - Failed to connect to open servlet: Connect to localhost:8081 [localhost/127.0.0.1] failed: Connection refused (Connection refused)
2023-08-21 15:25:55,820 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:63 - Sleeping for 1 second
2023-08-21 15:25:56,820 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:71 - Failed to connect to open servlet: Connect to localhost:8081 [localhost/127.0.0.1] failed: Connection refused (Connection refused)
2023-08-21 15:25:56,821 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:63 - Sleeping for 1 second
2023-08-21 15:25:57,821 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:71 - Failed to connect to open servlet: Connect to localhost:8081 [localhost/127.0.0.1] failed: Connection refused (Connection refused)
2023-08-21 15:25:57,822 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:63 - Sleeping for 1 second
2023-08-21 15:25:58,822 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:71 - Failed to connect to open servlet: Connect to localhost:8081 [localhost/127.0.0.1] failed: Connection refused (Connection refused)
2023-08-21 15:25:58,822 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:63 - Sleeping for 1 second
2023-08-21 15:25:59,823 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:71 - Failed to connect to open servlet: Connect to localhost:8081 [localhost/127.0.0.1] failed: Connection refused (Connection refused)
2023-08-21 15:25:59,823 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:63 - Sleeping for 1 second
2023-08-21 15:26:00,824 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:71 - Failed to connect to open servlet: Connect to localhost:8081 [localhost/127.0.0.1] failed: Connection refused (Connection refused)
2023-08-21 15:26:00,824 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:63 - Sleeping for 1 second
2023-08-21 15:26:01,825 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:71 - Failed to connect to open servlet: Connect to localhost:8081 [localhost/127.0.0.1] failed: Connection refused (Connection refused)
2023-08-21 15:26:01,825 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:63 - Sleeping for 1 second
2023-08-21 15:26:02,826 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:71 - Failed to connect to open servlet: Connect to localhost:8081 [localhost/127.0.0.1] failed: Connection refused (Connection refused)
2023-08-21 15:26:02,826 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:63 - Sleeping for 1 second
2023-08-21 15:26:03,826 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:71 - Failed to connect to open servlet: Connect to localhost:8081 [localhost/127.0.0.1] failed: Connection refused (Connection refused)
2023-08-21 15:26:03,827 [pool-15-thread-1] ERROR c.l.m.boot.OnBootApplicationListener:76 - Failed to bootstrap DataHub, OpenAPI servlet was not ready after 30 seconds
2023-08-21 15:26:03.836:INFO:oejs.AbstractConnector:JettyShutdownThread: Stopped ServerConnector@7fc229ab{HTTP/1.1, (http/1.1)}{0.0.0.0:8080}
2023-08-21 15:26:03.836:INFO:oejs.session:JettyShutdownThread: node0 Stopped scavenging
2023-08-21 15:26:03.837:INFO:oejshC.ROOT:JettyShutdownThread: Destroying Spring FrameworkServlet 'schemaRegistryServlet'
2023-08-21 15:26:03.839:INFO:oejshC.ROOT:JettyShutdownThread: Destroying Spring FrameworkServlet 'openapiServlet'
2023-08-21 15:26:03.840:INFO:oejshC.ROOT:JettyShutdownThread: Destroying Spring FrameworkServlet 'healthServlet'
2023-08-21 15:26:03.842:INFO:oejshC.ROOT:JettyShutdownThread: Destroying Spring FrameworkServlet 'authServlet'
2023-08-21 15:26:03.844:INFO:oejshC.ROOT:JettyShutdownThread: Destroying Spring FrameworkServlet 'apiServlet'
2023-08-21 15:26:03.851:INFO:oejshC.ROOT:JettyShutdownThread: Closing Spring root WebApplicationContext
2023-08-21 15:26:04.157:INFO:oejsh.ContextHandler:JettyShutdownThread: Stopped o.e.j.w.WebAppContext@5f058f00{Open source GMS,/,null,STOPPED}{file:///datahub/datahub-gms/bin/war.war}
2023/08/21 15:26:04 Command exited with error: exit status 1```![attachment](https://files.slack.com/files-pri/TUMKD5EGJ-F05P4GHCSPK/image.png?t=xoxe-973659184562-6705490291811-6708051934148-dd1595bd5f63266bc09e6166373c7a3c)

Are you deploying with helm? Looks like you haven’t configured the schema registry url and it is defaulting to localhost:8081 which is probably invalid

We are deploying with helm and we are using the AWS Glue Schema registry. Because of the bug (thread here: https://datahubspace.slack.com/archives/CV2UVAPPG/p1690545008697309). It fails to look at the AWS Glue configuration and tries to default to localhost:8081 (Confluent schema registry). We were able to solve that in all the other pods by proving some extra_envs, but this isn’t working for GMS

Example:

  image:
    repository: ${local.ecr_image_prefix}/acryldata/datahub-upgrade
  extraEnvs: #Workaround for bug: <https://datahubspace.slack.com/archives/CV2UVAPPG/p1690545008697309>
    - name: KAFKA_SCHEMAREGISTRY_AWSGLUE_REGISTRYNAME
      value: ${aws_glue_registry.kafka.registry_name}
    - name: KAFKA_SCHEMAREGISTRY_AWSGLUE_REGION
      value: ${var.aws_region}```

Our datahub-gms value in the values.yml:

  enabled: true
  image:
    repository: ${local.ecr_image_prefix}/linkedin/datahub-gms
  service:
    type: ClusterIP
  extraEnvs: #Workaround for bug: <https://datahubspace.slack.com/archives/CV2UVAPPG/p1690545008697309>
    - name: KAFKA_SCHEMAREGISTRY_AWSGLUE_REGISTRYNAME
      value: ${aws_glue_registry.kafka.registry_name}
    - name: KAFKA_SCHEMAREGISTRY_AWSGLUE_REGION
      value: ${var.aws_region}
    - name: KAFKA_SCHEMAREGISTRY_TYPE
      value: AWS_GLUE```
global.kafka settings:
```global:
  kafka:
    bootstrap:
      server: ${aws_msk_cluster.datahub.bootstrap_brokers}
    zookeeper:
      server: ${aws_msk_cluster.datahub.zookeeper_connect_string}
    partitions: 3
    replicationFactor: 2
    schemaregistry:
      type: AWS_GLUE
      glue:
        region: ${var.aws_region}
        registry: ${aws_glue_registry.kafka.registry_name}```

Ah looks like a bug in the helm chart :disappointed: https://github.com/acryldata/datahub-helm/blob/master/charts/datahub/subcharts/datahub-gms/templates/deployment.yaml#L168-L174

Try setting the schema registry url as an extra env as well

KAFKA_SCHEMAREGISTRY_URL
Is the environment property

What should I put there in the case of AWS Glue schema registry? You don’t get a host

I checked the working version on production (v0.10.1 still) and there it is just the default value from the prereqs that is not used (we use MSK + Glue SR). Let me try to put thatattachment

That didn’t work. It tried to connect to the provided host in KAFKA_SCHEMAREGISTRY_URL, but it didn’t detected the use of Glue schema registry:

2023-08-21 19:35:32,170 [pool-15-thread-1] INFO  c.l.m.boot.OnBootApplicationListener:71 - Failed to connect to open servlet: prerequisites-cp-schema-registry
2023-08-21 19:35:32,170 [pool-15-thread-1] ERROR c.l.m.boot.OnBootApplicationListener:76 - Failed to bootstrap DataHub, OpenAPI servlet was not ready after 30 seconds
2023-08-21 19:35:32.178:INFO:oejs.AbstractConnector:JettyShutdownThread: Stopped ServerConnector@7fc229ab{HTTP/1.1, (http/1.1)}{0.0.0.0:8080}```

I think I found the problem. It looks like the “isSchemaRegistryAPIServeletReady” is getting started with any spring event originating from the WebApplicationContext. You can see this happening in the screenshot from the logs of the GMS component, and also in the code https://github.com/datahub-project/datahub/blob/master/metadata-service/factories/src/main/java/com/linkedin/metadata/boot/OnBootApplicationListener.java#L51|https://github.com/datahub-project/datahub/blob/master/metadata-service/factories/[…]/java/com/linkedin/metadata/boot/OnBootApplicationListener.java

The check is getting initialized, even before the “schemaRegistryServlet” is initialised. I also don’t get why this schemaRegistryServlet bean is getting registered , it has the condition @ConditionalOnProperty(name = "kafka.schemaRegistry.type", havingValue = InternalSchemaRegistryFactory.TYPE)on it, while I have
- name: KAFKA_SCHEMAREGISTRY_TYPE
value: AWS_GLUE as environment variableattachment

Where are you seeing that the bean is loaded? The issue above that you’re still having is that the prerequisites url isn’t returning a 2XX status code. This check probably shouldn’t even execute for glue I’m guessing if it’s not giving you a url though

Do you actually have the prerequisites schema registry running?

This is where I see it. And no I do not have it running, we don’t use any of the prerequites helm charts. The check should indeed not execute for AWS Glue

It looks like AWS_GLUE is not supported in general anymore from the application.yml values: https://github.com/datahub-project/datahub/blob/v0.10.5/metadata-service/configuration/src/main/resources/application.yml#L227

Do you recommend to switch the kafka.schemaRegistry.type to using INTERNAL?