Configuring Kafka as a Source Platform in DataHub

Original Slack Thread

<@U06TM7M40RK> how would the schema metadata for Kafka source look like in Datahub? Can you find some screenshots from users who have onboarded Kafka as a source platform ?
Also share some examples of lineage involving kafka topics

I’m not able to find any data sources to answer your question. You might want to specify the version of DataHub you are using or the specific Kafka setup details. If this doesn’t help, please reach out to the DataHub support team for further assistance. :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

<@U06TM7M40RK> how to onboard kafka as a source platform ? What are the required configurations ?

To onboard Kafka as a source platform in DataHub, you need to configure the Kafka ingestion source with the necessary connection and authentication details. Here are the key steps and configurations required:

  1. Kafka Source Configuration: You need to define the Kafka source in your ingestion recipe. This includes specifying the Kafka bootstrap servers, schema registry URL, and any necessary authentication details.

  2. Connection Details:

    • Bootstrap Servers: Specify the Kafka bootstrap servers. You can list multiple servers separated by commas.
    • Schema Registry URL: Provide the URL for the schema registry if you are using one.
  3. Authentication: Depending on your Kafka setup, you may need to configure authentication details such as SASL or Kerberos. For example, if using SASL_SSL, you would specify the security protocol, SASL mechanism, and JAAS configuration.

  4. Environment Variables: You can set environment variables for Kafka configuration, such as KAFKA_BOOTSTRAP_SERVER, KAFKA_SCHEMAREGISTRY_URL, and other security-related properties.

Here is an example configuration for Kafka ingestion with SASL_SSL authentication:

source:
  type: "kafka"
  config:
    platform_instance: "YOUR_CLUSTER_ID"
    connection:
      bootstrap: "broker1:9092,broker2:9092,broker3:9092"
      consumer_config:
        security.protocol: "SASL_SSL"
        sasl.mechanism: "PLAIN"
        sasl.jaas.config: "org.apache.kafka.common.security.plain.PlainLoginModule required username='your-username' password='your-password';"
      schema_registry_url: "YOUR_SCHEMA_REGISTRY_URL"
    group_id: "YOUR_GROUP_ID"
  1. Deploying the Ingestion: Once the configuration is set, you can deploy the ingestion job using the DataHub CLI or other orchestration tools.

For more detailed guidance, you can refer to the <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/docs/sources/kafka/kafka.md|Kafka Ingestion Source Configuration> and <https://datahubproject.io/docs/0.14.0/how/kafka-config|Configuring Kafka in DataHub>.

Would you like more detailed steps or further customization options?

Sources:

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><