Adding and Utilizing Multiple Kafka Instances for MSK Event Ingestion

Original Slack Thread

<@U06TM7M40RK> add multiple instances of kafka - need 3 msk instances. Irrespective of whether or not i have kafka instances, the same number of events/resources are ingested. First thing, add kafka instances. second thing how to check if msk events are added?

data:
{{.recipe}}.yaml: |
source:
type: kafka-connect
config:
connect_uri: valid_uri
platform_instance: cdc-connect
platform_instance_map:
dynamodb: dynamodb
documentdb: documentdb
postgres: rds
mongodb: mongo
s3: s3
kafka: kafka-msk-cdc-us-east-1-dev

To add multiple Kafka instances, specifically for MSK (Managed Streaming for Apache Kafka), and to verify if MSK events are being ingested, you can follow these steps:

  1. Add Multiple Kafka Instances:

    • You can specify multiple Kafka instances by using the platform_instance_map in your Kafka Connect configuration. This allows you to map different Kafka instances to different platform names. Here’s an example configuration:

      source:
        type: kafka-connect
        config:
          connect_uri: valid_uri
          platform_instance: cdc-connect
          platform_instance_map:
            dynamodb: dynamodb
            documentdb: documentdb
            postgres: rds
            mongodb: mongo
            s3: s3
            kafka: kafka-msk-cdc-us-east-1-dev
            kafka_instance_1: kafka-msk-instance-1
            kafka_instance_2: kafka-msk-instance-2
            kafka_instance_3: kafka-msk-instance-3
      
    • Ensure that each Kafka instance is correctly configured and accessible from your environment.

  2. Check if MSK Events are Added:

    • To verify if events from your MSK instances are being ingested, you can check the logs of your DataHub ingestion process. Look for any errors or warnings that might indicate issues with the ingestion.
    • Additionally, you can use monitoring tools or dashboards provided by AWS MSK to track the flow of events and ensure that they are being processed as expected.
    • You can also verify the ingestion by checking the DataHub UI or API to see if the expected datasets and metadata are being populated.

Would you like more detailed step-by-step guidance on any of these points?

Sources:

<@U06TM7M40RK> how can I make sure that all kafka instances are used?

To ensure that all Kafka instances are being utilized effectively, you can follow these steps:

  1. Configure Load Balancing:

    • Ensure that your Kafka clients (producers and consumers) are configured to connect to all Kafka instances. This can be done by specifying all the broker addresses in the bootstrap.servers configuration. This allows the Kafka client to distribute the load across all available brokers.
  2. Monitor Kafka Cluster:

    • Use monitoring tools to observe the load distribution across your Kafka instances. AWS provides CloudWatch metrics for MSK, which can help you monitor the performance and utilization of each broker in your Kafka cluster.
    • Check metrics such as BytesInPerSec, BytesOutPerSec, and MessagesInPerSec to ensure that traffic is evenly distributed across all instances.
  3. Partitioning Strategy:

    • Ensure that your topics are configured with an appropriate number of partitions. The number of partitions should be a multiple of the number of brokers to allow for even distribution of data.
    • Use a partitioning strategy that distributes messages evenly across partitions. This can be achieved by using a key that results in a uniform hash distribution.
  4. Consumer Group Configuration:

    • Ensure that your consumer groups are configured to consume from all partitions. This will help in distributing the load across all Kafka instances.
    • Monitor consumer lag to ensure that consumers are keeping up with the data being produced.
  5. Test and Validate:

    • Perform load testing to validate that all Kafka instances are being utilized. You can use tools like Apache JMeter or custom scripts to simulate load and observe the distribution of traffic across your Kafka cluster.

Would you like more detailed steps or further assistance on a specific part of the process?

Sources:

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><