Ingesting Connectors from Multiple Kafka Connect Clusters with Different URIs

user-1 · November 4, 2024, 12:04am

I have multiple connect clusters for each kafka cluster. How do I ingest all these connectors for a given cluster? Each connect cluster has its own connect uri <@U06TM7M40RK>

user-2 · November 4, 2024, 12:04am

To ingest connectors from multiple Kafka Connect clusters, each with its own connect URI, you can configure your DataHub ingestion recipe to include each connect URI separately. Here’s a general approach based on the available data sources:

Define Multiple Sources: You will need to define multiple sources in your ingestion recipe, each corresponding to a different Kafka Connect cluster. Each source will have its own connect_uri.
Use Platform Instance Mapping: If you have multiple instances of Kafka or source/sink systems, you can use the platform_instance_map or connect_to_platform_map to specify the platform instance for each connector. This helps in generating the correct lineage edges.
Example Configuration: Below is an example configuration that demonstrates how to set up multiple Kafka Connect sources in a single ingestion recipe:

source:
  - type: "kafka-connect"
    config:
      connect_uri: "http://connect-cluster1:8083"
      username: "admin"
      password: "password"
      platform_instance_map:
        mysql: "mysql_platform_instance1"
  - type: "kafka-connect"
    config:
      connect_uri: "http://connect-cluster2:8083"
      username: "admin"
      password: "password"
      platform_instance_map:
        mysql: "mysql_platform_instance2"

Advanced Configurations: If you have specific configurations provided from external sources, you can include them in the provided_configs section of your recipe. This is useful if your Kafka Connect setup uses pluggable configuration providers.
Platform Instance Precedence: Note that the platform_instance specified in connect_to_platform_map will take precedence over the platform_instance_map if both are used.

For more detailed guidance, you can refer to the <Kafka Connect | DataHub Connect Source Configuration> documentation.

Would you like more detailed step-by-step guidance on configuring these settings?

Sources:

<Kafka Connect | DataHub Connect Source Configuration> 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-1 · November 4, 2024, 12:04am

say I have connect_uri_1, connect_uri_2 etc and then I have a single instance kafka-msk_platform-instance-1

give me an example recipe. One connector could be s3 sink, the other could be source-dynamodb connect or source-postgres-connect. Help me with the recipe here <@U06TM7M40RK>

user-2 · November 4, 2024, 12:04am

To create an ingestion recipe for multiple Kafka Connect clusters with different connectors (e.g., S3 sink, DynamoDB source, Postgres source), you can define each connect URI separately in your recipe. Here’s an example configuration:

source:
  - type: "kafka-connect"
    config:
      connect_uri: "http://connect_uri_1:8083"
      username: "admin"
      password: "password"
      platform_instance_map:
        s3: "s3_platform_instance"
      connect_to_platform_map:
        s3_sink_connector:
          kafka: "kafka-msk_platform-instance-1"

  - type: "kafka-connect"
    config:
      connect_uri: "http://connect_uri_2:8083"
      username: "admin"
      password: "password"
      platform_instance_map:
        dynamodb: "dynamodb_platform_instance"
      connect_to_platform_map:
        dynamodb_source_connector:
          kafka: "kafka-msk_platform-instance-1"

  - type: "kafka-connect"
    config:
      connect_uri: "http://connect_uri_3:8083"
      username: "admin"
      password: "password"
      platform_instance_map:
        postgres: "postgres_platform_instance"
      connect_to_platform_map:
        postgres_source_connector:
          kafka: "kafka-msk_platform-instance-1"

sink:
  type: "datahub-rest"
  config:
    server: "http://datahub-gms:8080"

Explanation:

Multiple Sources: Each kafka-connect source block corresponds to a different Kafka Connect cluster, specified by its connect_uri.
Platform Instance Mapping: The platform_instance_map is used to map each connector type (e.g., S3, DynamoDB, Postgres) to its respective platform instance.
Connect to Platform Map: The connect_to_platform_map specifies the mapping of each connector to the Kafka platform instance. This ensures that the correct lineage is generated for each connector.
Sink Configuration: The datahub-rest sink is used to send the ingested metadata to the DataHub server.

This setup allows you to ingest metadata from multiple Kafka Connect clusters, each with different types of connectors, into a single DataHub instance.

Would you like more detailed step-by-step guidance on any specific part of this configuration?

Sources:

https://datahubproject.io/docs/0.14.0/generated/ingestion/sources/kafka-connect|https://datahubproject.io/docs/0.14.0/generated/ingestion/sources/kafka-connect
https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/kafka-connect|https://datahubproject.io/docs/0.13.1/generated/ingestion/sources/kafka-connect 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-1 · November 4, 2024, 12:04am

<@U06TM7M40RK> each connector uri might have multiple connectors few of which might correspond to a different kafka cluster

user-2 · November 4, 2024, 12:04am

To handle a scenario where each Kafka Connect URI might have multiple connectors, and some of these connectors correspond to different Kafka clusters, you can use the connect_to_platform_map to specify the platform instance for each connector. This allows you to map each connector to its respective Kafka cluster.

Here’s an example configuration that demonstrates how to set this up:

source:
  - type: "kafka-connect"
    config:
      connect_uri: "http://connect_uri_1:8083"
      username: "admin"
      password: "password"
      connect_to_platform_map:
        s3_sink_connector:
          kafka: "kafka-msk_platform-instance-1"
        dynamodb_source_connector:
          kafka: "kafka-msk_platform-instance-2"

  - type: "kafka-connect"
    config:
      connect_uri: "http://connect_uri_2:8083"
      username: "admin"
      password: "password"
      connect_to_platform_map:
        postgres_source_connector:
          kafka: "kafka-msk_platform-instance-1"
        another_connector:
          kafka: "kafka-msk_platform-instance-3"

sink:
  type: "datahub-rest"
  config:
    server: "http://datahub-gms:8080"

Explanation:

Multiple Sources: Each kafka-connect source block corresponds to a different Kafka Connect cluster, specified by its connect_uri.
Connect to Platform Map: The connect_to_platform_map is used to map each connector to its respective Kafka platform instance. This ensures that the correct lineage is generated for each connector, even if they correspond to different Kafka clusters.
Sink Configuration: The datahub-rest sink is used to send the ingested metadata to the DataHub server.

This setup allows you to manage multiple connectors within each Kafka Connect cluster, mapping them to the appropriate Kafka clusters as needed.

Would you like more detailed step-by-step guidance on any specific part of this configuration?

Sources:

<Kafka Connect | DataHub Connect Source Configuration> 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-1 · November 4, 2024, 12:04am

can

        s3: "s3_platform_instance"
      connect_to_platform_map:
        s3_sink_connector:
          kafka: "kafka-msk_platform-instance-1"```
be
```platform_instance_map:
        s3: "s3_platform_instance"
        kafka: "kafka-msk_platform-instance-1"          ```
<@U06TM7M40RK>

user-2 · November 4, 2024, 12:04am

Yes, you can use the platform_instance_map to specify a single platform instance for each platform across all connectors if the same instance is used consistently. This approach simplifies the configuration when you have a uniform platform instance for a given platform across multiple connectors.

Here’s how you can configure it:

platform_instance_map:
  s3: "s3_platform_instance"
  kafka: "kafka-msk_platform-instance-1"

Explanation:

Platform Instance Map: This map specifies the platform instance to use for each platform (e.g., S3, Kafka) across all connectors. It is useful when the same instance is used consistently for a platform.
Simplification: By using platform_instance_map, you avoid the need to specify the platform instance for each connector individually in connect_to_platform_map, provided that the same instance applies to all connectors for that platform.

This configuration is appropriate when you have a consistent platform instance for each platform across all connectors. If you have connectors that need to map to different instances, you would use connect_to_platform_map for those specific cases.

Would you like more detailed step-by-step guidance on any specific part of this configuration?

Sources:

<Kafka Connect | DataHub Connect Source Configuration> 0 button 0 button Hint: Mention @runllm in the thread for followups.

datahub_team · November 4, 2024, 12:04am

Hi ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look

Hope this helps ! ><

Topic		Replies	Views
Using multiple `connect_uris` in the same recipe for Kafka Connect ingestion	4	0	November 4, 2024
Ingesting Metadata from Multiple PostgreSQL Servers in a Single Ingestion ingestion	10	6	February 3, 2025
Setting up Kafka Connect Connector to Ingest Data from PostgreSQL with DataHub v0.12.1 ingestion	6	2	April 7, 2025
Troubleshooting Kafka-Connect Ingest Error in DataHub Version 0.14.0.4 ingestion	5	18	October 14, 2024
Ingesting Data from Kafka into DataHub ingestion	6	26	November 4, 2024

Ingesting Connectors from Multiple Kafka Connect Clusters with Different URIs

Related topics