Predicting and Managing Kafka Topic Volumes: `DataHubUpgradeHistory` and `MetadataChangeLog_Timeseries`

Original Slack Thread

What type of information can use to predict & project the volume of specific topics? Specifically, I’m curious about DataHubUpgradeHistory and MetadataChangeLog_Timeseries , but wondering about all of them.

The size of most topics are highly dependent on how frequently you are performing ingestions and the scale of each one. Kafka gives metrics on lag and byte throughput and if your ingestion jobs are relatively consistent you should be able to predict the volume through those.

For the two you called out, DataHubUpgradeHistory will always be extremely low volume. MetadataChangeLog_Timeseries may be fairly hefty depending on usage.

thanks <@UV5UEC3LN>, that’s helpful.

Can you help me understand when these two topics have messages in them?

Guessing purely on the name, is DataHubUpgradeHistory used whenever the datahubUpgrade job is run? That would explain the low volume.

Is MetadataChangeLog_Timeseries used whenever an ingestion job is run?

For Upgrade History, yes. MCL Timeseries is a bit more nuanced than just ingestion jobs, but generally that will be what puts events in there. It’s also possible for timeseries aspects to be generated from other processes though. Basically whenever a timeseries aspect is ingested it will have an event. Timeseries aspects are defined by the @Aspect annotation in the PDL model files with type=timeseries.

Sorry - can you say more about what a Time Series Aspect is? I’m not sure i follow

https://datahubproject.io/docs/graphql/interfaces/#timeseriesaspect

So essentially, every time an aspect gets ingestion, they’ll be some kafka messages sent to MetadataChangeLog_TimeSeries

I’m trying to understand what sort of retention should exist on these topics. Is this a scenario where Kafka is being used as storage for these datasets? Or is it truly being used for message processing in that the data in these messages gets processed into a persistent datastore (like the DB / ElasticSearch)

Timeseries information is only persisted into ElasticSearch and not the relational DB. ES is generally pretty reliable, but it can be helpful to keep retention on the Kafka topics for in-between snapshot latency on ES for disaster scenarios. We have put in reasonable defaults for retention on the topic setup scripts.

Yeah i went hunting for those and found that the retention on those is 90 days for MetadataChangeLog_Timeseries: https://github.com/datahub-project/datahub/blob/master/docker/kafka-setup/kafka-setup.sh#L124

And unlimited for DatahubUpgradeHistory: https://github.com/datahub-project/datahub/blob/master/docker/kafka-setup/kafka-setup.sh#L140

The team that manages our kafka infrastructure is worried about these retention settings because sometimes those are configured when folks are attempting to misuse Kafka for persistent storage. If we configured retention at something more like ~1 week, would that have any impact on the application other than disaster recovery? It sounds like it would not.

<@UV5UEC3LN> - polite bump on this question above

For timeseries I think it should be okay, just reduces your options for disaster recovery. For the Upgrade History topic if you change the retention to a week then if you lose the GMS pod it will not start back up until you run another upgrade unless you configure it to skip the upgrade check (not recommended).

Ah, good to know. Is that because it uses the kafka topic to determine the last upgrade that was run?

Yep

If we do get into a state where the GMS can’t determine the last upgrade run, how do we unstick ourselves?

~Do we have to temporarily disable the upgrade check?~ This doesn’t really make sense

Re-running the upgrade job is probably your best option, it should be essentially a no-op if your system is in the correct state

Ah, so it’s the GMS that reads the topic to determine the last upgrade run. If the upgrade is always run prior to a deploy, it should be fine? Even if no upgrade is necessary, it’ll no-op but ~right~ write the correct messages into the kafka topic to let GMS start up. Is that accurate?

Yeah, that’s correct