Creating YAML recipe for Confluent Kafka ingestion into DataHub

Original Slack Thread

<@U06TM7M40RK> give sample receipe.yaml for created confluent kafka as an ingestion source

To create a recipe for ingesting metadata from Confluent Kafka into DataHub, you can use the following YAML configuration. This example demonstrates how to configure the Kafka source with Confluent Cloud, including the necessary authentication details for both the Kafka broker and the schema registry.

source:
  type: "kafka"
  config:
    platform_instance: "YOUR_CLUSTER_ID"
    connection:
      bootstrap: "abc-defg.eu-west-1.aws.confluent.cloud:9092"
      consumer_config:
        security.protocol: "SASL_SSL"
        sasl.mechanism: "PLAIN"
        sasl.username: "${CLUSTER_API_KEY_ID}"
        sasl.password: "${CLUSTER_API_KEY_SECRET}"
      schema_registry_url: "https://abc-defgh.us-east-2.aws.confluent.cloud"
      schema_registry_config:
        basic.auth.user.info: "${REGISTRY_API_KEY_ID}:${REGISTRY_API_KEY_SECRET}"
    topic_patterns:
      allow: ['.*']  # Adjust this to specify which topics to include
      deny: ['_confluent.*']  # Adjust this to specify which topics to exclude

sink:
  type: "datahub-rest"
  config:
    server: "http://localhost:8080"  # Replace with your DataHub GMS server URL
    token: "<Your DataHub API Token>"  # Replace with your DataHub API token if required

Key Points:

  • Bootstrap Server: The bootstrap field specifies the Kafka broker’s address.
  • Consumer Config: Includes security protocol and SASL authentication details.
  • Schema Registry: The schema_registry_url and schema_registry_config fields are used to connect to the Confluent schema registry.
  • Topic Patterns: Use allow and deny to filter which topics to ingest.
  • Sink Configuration: The datahub-rest sink is configured to send metadata to your DataHub instance.

This configuration is based on the guidelines provided in the <https://datahubproject.io/docs/0.14.0/generated/ingestion/sources/kafka#connecting-to-confluent-cloud|DataHub Kafka Source Documentation> and <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/docs/sources/kafka/kafka.md|GitHub documentation>.

Would you like more detailed steps or additional information on any specific part of the process?

Sources:

<@U06TM7M40RK> below is the yaml we got from github

  type: "kafka"
  config:
    platform_instance: "YOUR_CLUSTER_ID"
    connection:
      bootstrap: "abc-defg.eu-west-1.aws.confluent.cloud:9092"
      consumer_config:
        security.protocol: "SASL_SSL"
        sasl.mechanism: "PLAIN"
        sasl.username: "${CLUSTER_API_KEY_ID}"
        sasl.password: "${CLUSTER_API_KEY_SECRET}"
      schema_registry_url: "<https://abc-defgh.us-east-2.aws.confluent.cloud>"
      schema_registry_config:
        basic.auth.user.info: "${REGISTRY_API_KEY_ID}:${REGISTRY_API_KEY_SECRET}"

sink:
  # sink configs```
However in our case we don have schema registry url. What will be yaml file for us

<@U06TM7M40RK> below is the yaml we got from github

  type: "kafka"
  config:
    platform_instance: "YOUR_CLUSTER_ID"
    connection:
      bootstrap: "abc-defg.eu-west-1.aws.confluent.cloud:9092"
      consumer_config:
        security.protocol: "SASL_SSL"
        sasl.mechanism: "PLAIN"
        sasl.username: "${CLUSTER_API_KEY_ID}"
        sasl.password: "${CLUSTER_API_KEY_SECRET}"
      schema_registry_url: "<https://abc-defgh.us-east-2.aws.confluent.cloud>"
      schema_registry_config:
        <http://basic.auth.user.info|basic.auth.user.info>: "${REGISTRY_API_KEY_ID}:${REGISTRY_API_KEY_SECRET}"

sink:
  # sink configs```
However in our case we don have schema registry url. What will be yaml file for us (edited)

It seems likely that if you do not have a schema registry URL, you can modify the YAML configuration to exclude the schema registry settings. This is based on the understanding that the schema registry is used to manage schemas for Kafka topics, and if you are not using it, you can omit those parts of the configuration.

Here’s how you might adjust your YAML configuration:

source:
  type: "kafka"
  config:
    platform_instance: "YOUR_CLUSTER_ID"
    connection:
      bootstrap: "abc-defg.eu-west-1.aws.confluent.cloud:9092"
      consumer_config:
        security.protocol: "SASL_SSL"
        sasl.mechanism: "PLAIN"
        sasl.username: "${CLUSTER_API_KEY_ID}"
        sasl.password: "${CLUSTER_API_KEY_SECRET}"
    topic_patterns:
      allow: ['.*']  # Adjust this to specify which topics to include
      deny: ['_confluent.*']  # Adjust this to specify which topics to exclude

sink:
  type: "datahub-rest"
  config:
    server: "http://localhost:8080"  # Replace with your DataHub GMS server URL
    token: "<Your DataHub API Token>"  # Replace with your DataHub API token if required

Key Adjustments:

  • Schema Registry: Removed schema_registry_url and schema_registry_config since you do not have a schema registry.
  • Topic Patterns: Ensure you specify which topics to include or exclude using allow and deny.

This configuration assumes that your Kafka topics do not require schema management through a registry. If your topics are using JSON or another format that does not require a schema registry, this setup should work.

If you need further customization or have specific requirements, please let me know!

Sources:

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><