Ingesting Connectors from Multiple Kafka Connect Clusters with Different URIs

Original Slack Thread

I have multiple connect clusters for each kafka cluster. How do I ingest all these connectors for a given cluster? Each connect cluster has its own connect uri <@U06TM7M40RK>

To ingest connectors from multiple Kafka Connect clusters, each with its own connect URI, you can configure your DataHub ingestion recipe to include each connect URI separately. Here’s a general approach based on the available data sources:

  1. Define Multiple Sources: You will need to define multiple sources in your ingestion recipe, each corresponding to a different Kafka Connect cluster. Each source will have its own connect_uri.

  2. Use Platform Instance Mapping: If you have multiple instances of Kafka or source/sink systems, you can use the platform_instance_map or connect_to_platform_map to specify the platform instance for each connector. This helps in generating the correct lineage edges.

  3. Example Configuration: Below is an example configuration that demonstrates how to set up multiple Kafka Connect sources in a single ingestion recipe:

source:
  - type: "kafka-connect"
    config:
      connect_uri: "http://connect-cluster1:8083"
      username: "admin"
      password: "password"
      platform_instance_map:
        mysql: "mysql_platform_instance1"
  - type: "kafka-connect"
    config:
      connect_uri: "http://connect-cluster2:8083"
      username: "admin"
      password: "password"
      platform_instance_map:
        mysql: "mysql_platform_instance2"
  1. Advanced Configurations: If you have specific configurations provided from external sources, you can include them in the provided_configs section of your recipe. This is useful if your Kafka Connect setup uses pluggable configuration providers.

  2. Platform Instance Precedence: Note that the platform_instance specified in connect_to_platform_map will take precedence over the platform_instance_map if both are used.

For more detailed guidance, you can refer to the <Kafka Connect | DataHub Connect Source Configuration> documentation.

Would you like more detailed step-by-step guidance on configuring these settings?

Sources:

  • <Kafka Connect | DataHub Connect Source Configuration> :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

say I have connect_uri_1, connect_uri_2 etc and then I have a single instance kafka-msk_platform-instance-1

give me an example recipe. One connector could be s3 sink, the other could be source-dynamodb connect or source-postgres-connect. Help me with the recipe here <@U06TM7M40RK>

To create an ingestion recipe for multiple Kafka Connect clusters with different connectors (e.g., S3 sink, DynamoDB source, Postgres source), you can define each connect URI separately in your recipe. Here’s an example configuration:

source:
  - type: "kafka-connect"
    config:
      connect_uri: "http://connect_uri_1:8083"
      username: "admin"
      password: "password"
      platform_instance_map:
        s3: "s3_platform_instance"
      connect_to_platform_map:
        s3_sink_connector:
          kafka: "kafka-msk_platform-instance-1"

  - type: "kafka-connect"
    config:
      connect_uri: "http://connect_uri_2:8083"
      username: "admin"
      password: "password"
      platform_instance_map:
        dynamodb: "dynamodb_platform_instance"
      connect_to_platform_map:
        dynamodb_source_connector:
          kafka: "kafka-msk_platform-instance-1"

  - type: "kafka-connect"
    config:
      connect_uri: "http://connect_uri_3:8083"
      username: "admin"
      password: "password"
      platform_instance_map:
        postgres: "postgres_platform_instance"
      connect_to_platform_map:
        postgres_source_connector:
          kafka: "kafka-msk_platform-instance-1"

sink:
  type: "datahub-rest"
  config:
    server: "http://datahub-gms:8080"

Explanation:

  • Multiple Sources: Each kafka-connect source block corresponds to a different Kafka Connect cluster, specified by its connect_uri.
  • Platform Instance Mapping: The platform_instance_map is used to map each connector type (e.g., S3, DynamoDB, Postgres) to its respective platform instance.
  • Connect to Platform Map: The connect_to_platform_map specifies the mapping of each connector to the Kafka platform instance. This ensures that the correct lineage is generated for each connector.
  • Sink Configuration: The datahub-rest sink is used to send the ingested metadata to the DataHub server.

This setup allows you to ingest metadata from multiple Kafka Connect clusters, each with different types of connectors, into a single DataHub instance.

Would you like more detailed step-by-step guidance on any specific part of this configuration?

Sources:

<@U06TM7M40RK> each connector uri might have multiple connectors few of which might correspond to a different kafka cluster

To handle a scenario where each Kafka Connect URI might have multiple connectors, and some of these connectors correspond to different Kafka clusters, you can use the connect_to_platform_map to specify the platform instance for each connector. This allows you to map each connector to its respective Kafka cluster.

Here’s an example configuration that demonstrates how to set this up:

source:
  - type: "kafka-connect"
    config:
      connect_uri: "http://connect_uri_1:8083"
      username: "admin"
      password: "password"
      connect_to_platform_map:
        s3_sink_connector:
          kafka: "kafka-msk_platform-instance-1"
        dynamodb_source_connector:
          kafka: "kafka-msk_platform-instance-2"

  - type: "kafka-connect"
    config:
      connect_uri: "http://connect_uri_2:8083"
      username: "admin"
      password: "password"
      connect_to_platform_map:
        postgres_source_connector:
          kafka: "kafka-msk_platform-instance-1"
        another_connector:
          kafka: "kafka-msk_platform-instance-3"

sink:
  type: "datahub-rest"
  config:
    server: "http://datahub-gms:8080"

Explanation:

  • Multiple Sources: Each kafka-connect source block corresponds to a different Kafka Connect cluster, specified by its connect_uri.
  • Connect to Platform Map: The connect_to_platform_map is used to map each connector to its respective Kafka platform instance. This ensures that the correct lineage is generated for each connector, even if they correspond to different Kafka clusters.
  • Sink Configuration: The datahub-rest sink is used to send the ingested metadata to the DataHub server.

This setup allows you to manage multiple connectors within each Kafka Connect cluster, mapping them to the appropriate Kafka clusters as needed.

Would you like more detailed step-by-step guidance on any specific part of this configuration?

Sources:

  • <Kafka Connect | DataHub Connect Source Configuration> :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

can

        s3: "s3_platform_instance"
      connect_to_platform_map:
        s3_sink_connector:
          kafka: "kafka-msk_platform-instance-1"```
be
```platform_instance_map:
        s3: "s3_platform_instance"
        kafka: "kafka-msk_platform-instance-1"          ```
<@U06TM7M40RK>

Yes, you can use the platform_instance_map to specify a single platform instance for each platform across all connectors if the same instance is used consistently. This approach simplifies the configuration when you have a uniform platform instance for a given platform across multiple connectors.

Here’s how you can configure it:

platform_instance_map:
  s3: "s3_platform_instance"
  kafka: "kafka-msk_platform-instance-1"

Explanation:

  • Platform Instance Map: This map specifies the platform instance to use for each platform (e.g., S3, Kafka) across all connectors. It is useful when the same instance is used consistently for a platform.
  • Simplification: By using platform_instance_map, you avoid the need to specify the platform instance for each connector individually in connect_to_platform_map, provided that the same instance applies to all connectors for that platform.

This configuration is appropriate when you have a consistent platform instance for each platform across all connectors. If you have connectors that need to map to different instances, you would use connect_to_platform_map for those specific cases.

Would you like more detailed step-by-step guidance on any specific part of this configuration?

Sources:

  • <Kafka Connect | DataHub Connect Source Configuration> :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><