Predicting and Managing Kafka Topic Volumes: `DataHubUpgradeHistory` and `MetadataChangeLog_Timeseries`

user-2 · March 4, 2024, 3:02pm

What type of information can use to predict & project the volume of specific topics? Specifically, I’m curious about DataHubUpgradeHistory and MetadataChangeLog_Timeseries , but wondering about all of them.

user-1 · March 4, 2024, 3:02pm

The size of most topics are highly dependent on how frequently you are performing ingestions and the scale of each one. Kafka gives metrics on lag and byte throughput and if your ingestion jobs are relatively consistent you should be able to predict the volume through those.

For the two you called out, DataHubUpgradeHistory will always be extremely low volume. MetadataChangeLog_Timeseries may be fairly hefty depending on usage.

user-2 · March 4, 2024, 3:02pm

thanks <@UV5UEC3LN>, that’s helpful.

Can you help me understand when these two topics have messages in them?

Guessing purely on the name, is DataHubUpgradeHistory used whenever the datahubUpgrade job is run? That would explain the low volume.

Is MetadataChangeLog_Timeseries used whenever an ingestion job is run?

user-1 · March 4, 2024, 3:02pm

For Upgrade History, yes. MCL Timeseries is a bit more nuanced than just ingestion jobs, but generally that will be what puts events in there. It’s also possible for timeseries aspects to be generated from other processes though. Basically whenever a timeseries aspect is ingested it will have an event. Timeseries aspects are defined by the @Aspect annotation in the PDL model files with type=timeseries.

user-2 · March 4, 2024, 3:02pm

Sorry - can you say more about what a Time Series Aspect is? I’m not sure i follow

user-1 · March 4, 2024, 3:02pm

https://datahubproject.io/docs/graphql/interfaces/#timeseriesaspect

user-2 · March 4, 2024, 3:02pm

So essentially, every time an aspect gets ingestion, they’ll be some kafka messages sent to MetadataChangeLog_TimeSeries

user-2 · March 4, 2024, 3:02pm

I’m trying to understand what sort of retention should exist on these topics. Is this a scenario where Kafka is being used as storage for these datasets? Or is it truly being used for message processing in that the data in these messages gets processed into a persistent datastore (like the DB / ElasticSearch)

user-1 · March 4, 2024, 3:02pm

Timeseries information is only persisted into ElasticSearch and not the relational DB. ES is generally pretty reliable, but it can be helpful to keep retention on the Kafka topics for in-between snapshot latency on ES for disaster scenarios. We have put in reasonable defaults for retention on the topic setup scripts.

user-2 · March 4, 2024, 3:02pm

Yeah i went hunting for those and found that the retention on those is 90 days for MetadataChangeLog_Timeseries: https://github.com/datahub-project/datahub/blob/master/docker/kafka-setup/kafka-setup.sh#L124

And unlimited for DatahubUpgradeHistory: https://github.com/datahub-project/datahub/blob/master/docker/kafka-setup/kafka-setup.sh#L140

The team that manages our kafka infrastructure is worried about these retention settings because sometimes those are configured when folks are attempting to misuse Kafka for persistent storage. If we configured retention at something more like ~1 week, would that have any impact on the application other than disaster recovery? It sounds like it would not.

user-2 · March 4, 2024, 3:02pm

<@UV5UEC3LN> - polite bump on this question above

user-1 · March 4, 2024, 3:02pm

For timeseries I think it should be okay, just reduces your options for disaster recovery. For the Upgrade History topic if you change the retention to a week then if you lose the GMS pod it will not start back up until you run another upgrade unless you configure it to skip the upgrade check (not recommended).

user-2 · March 4, 2024, 3:02pm

Ah, good to know. Is that because it uses the kafka topic to determine the last upgrade that was run?

user-1 · March 4, 2024, 3:02pm

Yep

user-2 · March 4, 2024, 3:02pm

If we do get into a state where the GMS can’t determine the last upgrade run, how do we unstick ourselves?

user-2 · March 4, 2024, 3:02pm

~Do we have to temporarily disable the upgrade check?~ This doesn’t really make sense

user-1 · March 4, 2024, 3:02pm

Re-running the upgrade job is probably your best option, it should be essentially a no-op if your system is in the correct state

user-2 · March 4, 2024, 3:02pm

Ah, so it’s the GMS that reads the topic to determine the last upgrade run. If the upgrade is always run prior to a deploy, it should be fine? Even if no upgrade is necessary, it’ll no-op but ~right~ write the correct messages into the kafka topic to let GMS start up. Is that accurate?

user-1 · March 4, 2024, 3:03pm

Yeah, that’s correct

Topic		Replies	Views
Investigating Disk Space Issue with MetadataChangeLog_Versioned_v1 in Datahub环境 troubleshoot	3	50	March 4, 2024
Diagnosing and Improving DataHub Kafka Topic Lag Post-Upgrade from 0.12.0 to 0.13.0 ingestion	3	47	May 27, 2024
Managing Database Growth in DataHub with Retention Policies ingestion	2	22	November 18, 2024
Deduplication of Kafka Topics in DataHub through Transformer or Ingestion Property ingestion	1	28	May 13, 2024
Troubleshooting Kafka Topic Unavailability in DataHub Configuration ingestion	5	77	March 4, 2024

Predicting and Managing Kafka Topic Volumes: `DataHubUpgradeHistory` and `MetadataChangeLog_Timeseries`

Related topics