Fixing Entity Publishing Issues and Restoring Indices in DataHub Production

Original Slack Thread

Hello,
We are experiencing some issues publishing new entities to DataHub using the python SDK since upgrading to 0.11.0.
In version 0.10.4, we pushed a small number of containers and datasets. All of the containers are in a single domain, and each container has one dataset associated with it.
In version 0.11.0, we pushed one new container in the same domain with one dataset that is part of the container, using the same code and process. We are using version 0.11.0 of the acryl-datahub pypi package. This is what we observe about the new entities:
• They cannot be found using the search bar in the DataHub UI. However, using the URN to go directly to the entity URL works.
• The original domain shows all of the original entities that we ingested in 0.10.4, but it does not show the new container pushed in 0.11.0. However, the new container shows that it is part of the original domain and has a correct URL link to it.
• The new container does not show that it has any entities. However, the new dataset shows the container in its breadcrumbs and has a correct link to the container.
• Further, this issue is only existing in our production datahub instance, not in our development instance. Production has much more data in it, but otherwise is used and configured similarly.
So it seems the entity relationships are only working in one direction. Any suggestions on what to look at or what the problem might be would be welcome.

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Which DataHub version are you using? (e.g. 0.12.0)

  2. Please post any relevant error logs on the thread!

I had the SDK log level set to debug and there are no errors in the logs. All of the POSTs the SDK made returned 200.

<@U03MF8MU5P0> Could you look into this? thanks!

An update on this: We have fixed the problem for new entities. It seems to be related to issues we were having with Kafka serialization. Those have been fixed and all new entities can be found using search, and all the relations are correct (domains show the expected entities and so on).

However, we have several weeks worth of entities that were automatically ingested during this time that are not discoverable using search or browsing. We’ve verified that upserting over a “broken” entity doesn’t fix it, but deleting it and recreating it does fix it. Is there a better way to fix the “broken” entities other than deleting all of them and recreating them?

To clarify on the serialization issue we resolved: We were seeing errors in the GME backend pod by consumers trying to deserialize messages from the Metadata Change Log topics (versioned and timeseries). We fixed it by switching our schema registry back to kafka like we were using in 0.10.4 and previous versions, then resetting the consumer offsets to latest. So I assume there are a large number of messages in those topics which were never successfully processed.

Restoring indices for the non-timeseries messages (aka versioned) can be done via the http://rest.li|rest.li api https://datahubproject.io/docs/api/restli/restore-indices/ or https://datahubproject.io/docs/how/restore-indices/

Thanks <@U03MF8MU5P0>! This looks like exactly what we need.

Hello,
Resurrecting this because we have not been able to successfully run the Restore Search and Graph Indices job.
Following the instructions for Kubernetes deployments on https://datahubproject.io/docs/how/restore-indices/
This works fine on our smaller dev environment, but on our production environment with over 25 million aspects, the job runs out of JVM memory early on. We bumped the memory for the pod from the default up to 4 GB and verified that the JVM max heap size grew, but that just helped us get a little farther. With a 4 GB pod, the job starts throwing repeated JVM out of memory errors at about 7% completion. We’ve tried other config adjustments like adjusting the batch size without success. Is there a way we can configure this to run successfully on a DB of this size, other than brute forcing this with a very large pod?

Running on DH version 0.11.0. We’ve noticed that recently a new https://github.com/datahub-project/datahub/blob/cca1e9dd495e85fecaa5c128b5e9848a8e931e9f/datahub-upgrade/src/main/java/com/linkedin/datahub/upgrade/restoreindices/SendMAEStep.java#L145C30-L145C30|urnBasedPagination argument was added to this job. Our assessment is that this way of running the job should work better on a DB of this size (it appears to only have one batch in scope at a time as opposed to how it works without this arg set). This change doesn’t appear to be in the 0.12.0 release. Any other suggestions for how we can run this job in the meantime? Are we correct in our assessment that this new option will work better in our environment?

Exact error:

	at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
	at com.linkedin.datahub.upgrade.restoreindices.SendMAEStep.iterateFutures(SendMAEStep.java:71)
	at com.linkedin.datahub.upgrade.restoreindices.SendMAEStep.lambda$executable$0(SendMAEStep.java:138)
	at com.linkedin.datahub.upgrade.impl.DefaultUpgradeManager.executeStepInternal(DefaultUpgradeManager.java:110)
	at com.linkedin.datahub.upgrade.impl.DefaultUpgradeManager.executeInternal(DefaultUpgradeManager.java:68)
	at com.linkedin.datahub.upgrade.impl.DefaultUpgradeManager.executeInternal(DefaultUpgradeManager.java:42)
	at com.linkedin.datahub.upgrade.impl.DefaultUpgradeManager.execute(DefaultUpgradeManager.java:33)
	at com.linkedin.datahub.upgrade.UpgradeCli.run(UpgradeCli.java:80)
	at org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:768)
	at org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:752)
	at org.springframework.boot.SpringApplication.run(SpringApplication.java:314)
	at org.springframework.boot.builder.SpringApplicationBuilder.run(SpringApplicationBuilder.java:164)
	at com.linkedin.datahub.upgrade.UpgradeCliApplication.main(UpgradeCliApplication.java:23)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.springframework.boot.loader.MainMethodRunner.run(MainMethodRunner.java:49)
	at org.springframework.boot.loader.Launcher.launch(Launcher.java:108)
	at org.springframework.boot.loader.Launcher.launch(Launcher.java:58)
	at org.springframework.boot.loader.JarLauncher.main(JarLauncher.java:65)```

You might try running the restore indices job with the docker image from that PR. The commit was merged to master with tag a29fce9 and the corresponding image is this one https://hub.docker.com/layers/acryldata/datahub-upgrade/a29fce9/images/sha256-fcf2b7ef3059112a2a035bd68d1ce0fe19d26ff05a29965047aef41904dec062?context=explore - I would recommend running it on a test system.