Troubleshooting Kafka Consumer Issues in DataHub Acryl Action Pod

Original Slack Thread

<@U06TM7M40RK> what’s the problem here? I got log form pod datahub acryl action

[2024-08-08 08:39:06,004] ERROR    {datahub_actions.plugin.source.kafka.utils:16} - KafkaError{code=UNKNOWN_MEMBER_ID,val=25,str="Commit failed: Broker: Unknown member"}
[2024-08-08 08:39:08,240] ERROR    {datahub_actions.plugin.source.kafka.utils:16} - KafkaError{code=UNKNOWN_MEMBER_ID,val=25,str="Commit failed: Broker: Unknown member"}
[2024-08-08 08:39:10,388] ERROR    {datahub_actions.plugin.source.kafka.utils:16} - KafkaError{code=UNKNOWN_MEMBER_ID,val=25,str="Commit failed: Broker: Unknown member"}
[2024-08-08 08:39:15,031] ERROR    {datahub_actions.plugin.source.kafka.utils:16} - KafkaError{code=UNKNOWN_MEMBER_ID,val=25,str="Commit failed: Broker: Unknown member"}
[2024-08-08 08:39:16,794] WARNING  {urllib3.connectionpool:826} - Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))': /aspects?action=ingestProposal
[2024-08-08 08:39:21,635] ERROR    {datahub_actions.plugin.source.kafka.utils:16} - KafkaError{code=UNKNOWN_MEMBER_ID,val=25,str="Commit failed: Broker: Unknown member"}
[2024-08-08 08:39:22,256] WARNING  {datahub_actions.plugin.source.kafka.utils:20} - kafka event source: exhausted all attempts.
[2024-08-08 08:39:23,821] ERROR    {datahub_actions.plugin.source.kafka.kafka_event_source:173} - Kafka consume error: KafkaError{code=_MAX_POLL_EXCEEDED,val=-147,str="Application maximum poll interval (10000ms) exceeded by 295ms"}
Traceback (most recent call last):
  File "/datahub-ingestion/.venv/lib/python3.10/site-packages/datahub_actions/plugin/source/kafka/kafka_event_source.py", line 171, in events
    msg = self.consumer.poll(timeout=2.0)
  File "/datahub-ingestion/.venv/lib/python3.10/site-packages/confluent_kafka/deserializing_consumer.py", line 104, in poll
    raise ConsumeError(msg.error(), kafka_message=msg)
confluent_kafka.error.ConsumeError: KafkaError{code=_MAX_POLL_EXCEEDED,val=-147,str="Application maximum poll interval (10000ms) exceeded by 295ms"}
%4|1723106384.476|MAXPOLL|rdkafka#consumer-1| [thrd:main]: Application maximum poll interval (10000ms) exceeded by 69ms (adjust <http://max.poll.interval.ms|max.poll.interval.ms> for long-running message processing): leaving group
[2024-08-08 08:39:58,271] ERROR    {datahub_actions.plugin.source.kafka.utils:16} - KafkaError{code=UNKNOWN_MEMBER_ID,val=25,str="Commit failed: Broker: Unknown member"}
[2024-08-08 08:40:00,174] ERROR    {datahub_actions.plugin.source.kafka.utils:16} - KafkaError{code=UNKNOWN_MEMBER_ID,val=25,str="Commit failed: Broker: Unknown member"}
[2024-08-08 08:40:03,804] ERROR    {datahub_actions.plugin.source.kafka.utils:16} - KafkaError{code=UNKNOWN_MEMBER_ID,val=25,str="Commit failed: Broker: Unknown member"}
[2024-08-08 08:40:07,162] ERROR    {datahub_actions.plugin.source.kafka.utils:16} - KafkaError{code=UNKNOWN_MEMBER_ID,val=25,str="Commit failed: Broker: Unknown member"}
[2024-08-08 08:40:13,882] ERROR    {datahub_actions.plugin.source.kafka.utils:16} - KafkaError{code=UNKNOWN_MEMBER_ID,val=25,str="Commit failed: Broker: Unknown member"}
[2024-08-08 08:40:15,485] WARNING  {datahub_actions.plugin.source.kafka.utils:20} - kafka event source: exhausted all attempts.```

The log messages you provided indicate that the Kafka consumer in your DataHub Acryl Action pod is experiencing issues related to the max.poll.interval.ms configuration. Specifically, the consumer is exceeding the maximum poll interval, which causes it to leave the consumer group and subsequently fail to commit offsets. This results in the UNKNOWN_MEMBER_ID errors.

Key Issues Identified:

  1. Maximum Poll Interval Exceeded:

    • The log message %4|1723106341.031|MAXPOLL|rdkafka#consumer-1| [thrd:main]: Application maximum poll interval (10000ms) exceeded by 295ms indicates that the application is taking longer than the configured max.poll.interval.ms (10,000ms) to process messages.
    • This causes the consumer to leave the group, leading to the UNKNOWN_MEMBER_ID errors when trying to commit offsets.
  2. Commit Failed: Broker: Unknown Member:

    • The repeated KafkaError{code=UNKNOWN_MEMBER_ID,val=25,str="Commit failed: Broker: Unknown member"} errors indicate that the consumer is trying to commit offsets after it has been removed from the consumer group due to the poll interval being exceeded.

Suggested Solutions:

  1. Increase max.poll.interval.ms:

    • Increase the max.poll.interval.ms configuration to allow more time for message processing. This can be done by setting the environment variable SPRING_KAFKA_PROPERTIES_MAX_POLL_INTERVAL_MS to a higher value.
    • Example:
      - name: SPRING_KAFKA_PROPERTIES_MAX_POLL_INTERVAL_MS
        value: "30000"  # Increase to 30 seconds
      
  2. Reduce max.poll.records:

    • Reduce the max.poll.records configuration to limit the number of records returned in each poll, which can help in processing messages within the poll interval.
    • Example:
      - name: SPRING_KAFKA_PROPERTIES_MAX_POLL_RECORDS
        value: "10"
      
  3. Optimize Message Processing:

    • Review and optimize the message processing logic to ensure it completes within the allowed poll interval.

Example Configuration:
Here is an example of how you can set these configurations in your environment variables:

- name: SPRING_KAFKA_PROPERTIES_MAX_POLL_INTERVAL_MS
  value: "30000"  # 30 seconds

- name: SPRING_KAFKA_PROPERTIES_MAX_POLL_RECORDS
  value: "10"

Additional Resources:

Would you like more details on any specific part of the solution?

Sources:

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?


:information_source: Please continue to tag RunLLM (e.g <@U06TM7M40RK>) to have it continue generating responses! We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66|> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a :+1: or :-1: so other folks in the Community know if you’re still looking for help on the topic and to help us improve the assistant!

I’m here to help with questions about DataHub! Could you please provide more details about your specific issue or question? Specifically, it would be helpful to know:

  1. Are you using the UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g., 0.12.0)?
  3. What data source(s) are you integrating with DataHub? (e.g., BigQuery)?

This information will help me assist you more effectively. :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

<@U06TM7M40RK> which config I must change? I’m using kubernetes as deployment and I’m using v0.13.3

Sure thing! I’m looking through the available data to answer your question. Hold on a second… :hourglass_flowing_sand:

<@U06TM7M40RK> able to response?