Troubleshooting Slow and Failing Datahub Search with EKS Deployment

Original Slack Thread

Hi All,
ISSUE : Datahub Search is slow and some time fails
We have deployed Datahub in production using EKS with 3 nodes each having 2vcpu and 8 gb ram.
We have around 4k dataset currently and it will increase as we are planning to add few more sources.
We have Kept elasticsearch replica as 1.
Does anyone has any idea on how to resolve the issue.

<@U05CJD391ND> might be able to speak to this!

Hi <@U05JSEQTJH5> - I’m sorry to hear search is giving you trouble. I’d be curious to know how often you’re seeing the slowness/failures as it could be related to elasticsearch or it could be some other code path causing slowness.

  1. Is it only the search page you’re seeing slowness or are there other pages too?
  2. Is the entire page that appears to load slowly, or is it a specific network request in the browser’s <https://developer.chrome.com/docs/devtools/network/|network tab>? For example, does the searchAcrossEntities graphql query appear to be the culprit or others?
  3. Do you have any monitoring on your elasticsearch instance to help narrow down? Tagging in <@UV5UEC3LN> to see if you have any other troubleshooting tips.

Your GMS is probably overprovisioned for the size of your data and I would not expect it to be the bottleneck here, definitely look into your ES metrics to see how much your pushing the resources there. What is the instance size of your single node on ES?

What level of user traffic are you seeing and are you simultaneously running ingestions or other API calls from external services?

<@U05CJD391ND>.

  1. The slowness is mostly in search bar(getAutoCompleteMultipleResults operation), we have some delay in Lineage tab of dataset as well, but it’s not at concerning level. for searches sometimes it takes around 9-10 seconds.
  2. currently we are facing issue only in getAutoCompleteMultipleResults operation.
  3. Currently we don’t have any monitoring for elasticseach…will look into this.

<@UV5UEC3LN>,

  1. For ES we have dedicated node of 2vcpu and 8gb ram.
  2. As for the user traffic there are only 4-5 users as of now, but I am seeing issue even if there is a single person.
  3. Currently we are running ingestion from only one source (Snowflake).

Hmm yeah that’s surprising. And the slowness isn’t spiky it’s just constant?

What’s the rough count of datasets?

The ES node is a little small, but with a low level of ingestion, low volume of data, and low volume of user traffic I’d expect it to be fine. Did you also take a look at the ES dashboards to see if it’s spiking on resources?

<@UV5UEC3LN>,
Yeah the slowness is constant for any new word ( In a session, that is, if I search the same term again after some time it again takes time) haven’t enabled any caching explicitly.
We have around 4k dataset Ingested.
We haven’t setup any Separate ES monitoring, but looking into the AWS Cloud Metrics for the node I can see that max cpu utilisation for that is around 13% and that too at the time of our ingestion runs.
When typing a word in search bar I can see that some of the request get’s failed ( is it normal).
I have also attached screenshot for the time taken for a successful return.

Hi <@UV5UEC3LN>,
Could you help with this.

It kinda seems like your internet connection is a bit slow. The above network timing shows 7 seconds were spent downloading the response from the server. Unless the response is dozens of MB, which would not be in line with your data size, that seems extraordinarily slow to me.

<@UV5UEC3LN> I will look into data size(as everyone else facing same issue), but isn’t 12 sec for api response (search autocomplete) is slow?
Will get back to you with more details.

If your network is slow in general then it could just be the hops across that are being slow rather than the actual server processing time. Check logs for timings on when the server is actually receiving requests.