Recovering Lost Information in DataHub and Preventing Data Loss Scenarios

user-1 · March 4, 2024, 5:43pm

Hi, team. we’ve found that if mae consumer failed to update some information from MCL event to ES/Neo4j because of unexpected situation, there is no way to recover that, so that information is lost forever. Is there a way to recover in this sitaution?
For exmaple, let’s assume like below
Neo4j is down because of some reason
New dataset is ingested into GMS, and MCL event is published
mae-consumer consume that MCL event but failed to update Neo4j
Case closed. When Neo4j is recovered, that dataset is absent in Neo4j

If there are some tips for this situation, please let us know…

datahub_team · March 4, 2024, 5:43pm

There are a few options: 1) Re-ingest the data 2) run restore indices via the job https://datahubproject.io/docs/how/restore-indices/ or the api https://datahubproject.io/docs/api/restli/restore-indices 3) Use the kafka cli to reset the kafka offsets for the mae-consumer to a time prior to the issue.

user-1 · March 4, 2024, 5:43pm

Thank you for your kind reply. If you don’t mind, could I take this question a little further?
If there is a way to restore Neo4j, ES indexes, there are two more concern about this.

MCL is not only used to create indexes. As I know, time-series aspects are stored using MCL, so if mae consumer failed to store time-series aspect value from MCL, time series aspect will be lost forever. Is there some way to prevent this situation in DataHub?
Our team have a plan to use DataHub as a core engine of our data catalog system. So we want to expand mae consumer’s role to trigger various serverless applications to generate more knowledge based on MCL. In this case, When mae-consumer fails, I think there is no way to recover some events. Is there a way to prevent this situation in DataHub?
Lastly, I know I asked a lot of things but our team is big fan of DataHub, DataHub is really amazing to us.

datahub_team · March 4, 2024, 5:43pm

Option 3 would create the time-series aspects (as long at the messages are still in the kafka topic) 2) Depending on your requirements and knowledge there are a few different ways to process mcl without depending on the mae consumer. The first would be to build your own consumer and read from the same topics with a different consumer group. Another option is the action’s framework which similarly consumes mcl messages independently of the mae consumer using a python based framework here: https://datahubproject.io/docs/actions/

Topic		Replies	Views
Managing Event Processing Order in Multiple `mae-consumer` Servers ingestion	3	72	March 4, 2024
Troubleshooting Loss of Data in Datahub Restoration from Metadata Database all-things-deployment	1	53	March 4, 2024
Understanding the purpose of datahub-mae-consumer and datahub-mce-consumer containers in a deployed Datahub on ECS all-things-deployment	10	111	March 4, 2024
Investigating metadata ingestion issues and data loss in Datahub 0.10.5.5 with potential solutions troubleshoot	7	50	March 4, 2024
Troubleshooting Metadata Ingestion Error 'Failed to produce MCLs' ingestion	8	95	June 3, 2024

Recovering Lost Information in DataHub and Preventing Data Loss Scenarios

Related topics