Diagnosing and Improving DataHub Kafka Topic Lag Post-Upgrade from 0.12.0 to 0.13.0

Original Slack Thread

We have recently upgraded from DataHub 0.12.0 to DataHub 0.13.0, and we have several ingestion processes. I have noticed that the LAG in the “MetadataChangeLog_Versioned_v1” kafka topic is ever increasing, like the processing is slower than the ingestion. Is there a way to alleviate it?
( otherwise it can take days to see the changes being reflected on the UI)

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?

:information_source: We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a :thumbsup: or :thumbsdown: so other folks in the Community know if you’re still looking for help on the topic!

To address the issue of increasing lag in the MetadataChangeLog_Versioned_v1 topic after upgrading to DataHub 0.13.0, you can take several steps to diagnose and improve the performance of your ingestion processes. Here’s a summary of potential solutions:

  1. Check Kafka Consumer Configuration:

    • Ensure that your Kafka consumer configuration is optimized. You might need to adjust settings like max.poll.interval.ms and max.poll.records to handle larger batches more efficiently.
  2. Use Standalone Consumers:

    • Running standalone MAE (Metadata Audit Event) and MCE (Metadata Change Event) consumers can help alleviate the load on embedded consumers within GMS (Graph Metadata Service). This can improve the processing speed of messages in the Kafka topic.
  3. Scale Resources:

    • Ensure that your DataHub instance has sufficient resources (CPU, memory) allocated. Sometimes, performance issues can be mitigated by scaling up the resources.
  4. Review and Optimize Ingestion Processes:

    • Review your ingestion processes to ensure they are not introducing unnecessary load. Optimize the ingestion recipes and configurations for better performance.

Here are the detailed steps and configurations you can try:

1. Adjust Kafka Consumer Configuration

Increase the max.poll.interval.ms and set max.poll.records to a lower value to ensure the consumer can process messages within the poll interval.

# Example Kafka consumer configuration
    max.poll.interval.ms: 300000  # Increase poll interval
    max.poll.records: 20         # Reduce the number of records per poll

2. Use Standalone Consumers

Running standalone MAE and MCE consumers can help distribute the load more effectively. You can enable standalone consumers by setting the appropriate configuration in your deployment.

# Example configuration to enable standalone consumers
    enabled: true
    enabled: true

3. Scale Resources

Ensure that your DataHub instance has sufficient resources. You might need to increase the number of replicas for GMS or allocate more CPU and memory.

# Example Kubernetes resource configuration
    cpu: "4"
    memory: "8Gi"
    cpu: "2"
    memory: "4Gi"

4. Optimize Ingestion Processes

Review your ingestion recipes and configurations to ensure they are optimized for performance. For example, you can adjust the number of threads for ingestion.

# Example Looker ingestion recipe
  type: looker
    base_url: "https://your.looker.instance"
    client_id: "your_client_id"
    client_secret: "your_client_secret"
    number_of_threads: 10  # Adjust the number of threads for better performance

  type: "datahub-rest"
    server: "http://localhost:8080"

Additional Questions

To provide more specific guidance, could you please answer the following:

  1. Are you using the UI or CLI for ingestion?
  2. Which DataHub version are you currently using (e.g., v0.13.0)?
  3. What data sources are you integrating with DataHub (e.g., dbt, Airflow, Snowflake, Looker)?


Feel free to provide more details or ask follow-up questions! :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.