Installing DataHub Kafka Sink Plugin in Helm Configuration

Original Slack Thread

<@U06TM7M40RK> how to install datahub-kafka sink plugin in helm?

To install the datahub-kafka sink plugin in a Helm deployment of DataHub, you need to ensure that the necessary configurations are set up in your Helm chart. Here are the steps to achieve this:

  1. Add DataHub Helm Repository: First, ensure that you have added the DataHub Helm repository to your Helm setup.

    helm repo add datahub https://helm.datahubproject.io/
    helm repo update
    
  2. Modify values.yaml: You need to modify the values.yaml file to include the datahub-kafka sink plugin. This involves specifying the plugin in the extraPipPackages section of the Helm chart configuration.

    Here is an example configuration:

    datahub:
      gms:
        extraPipPackages:
          - "acryl-datahub[datahub-kafka]"
    
  3. Deploy DataHub with Helm: After updating the values.yaml file, deploy or upgrade your DataHub Helm chart with the modified configuration.

    helm upgrade --install datahub datahub/datahub -f path/to/your/values.yaml
    
  4. Verify Installation: You can verify that the datahub-kafka plugin is installed by checking the logs of the GMS container or by using the DataHub CLI to list the installed plugins.

    kubectl exec -it <gms-pod-name> -- datahub check plugins --verbose | grep -i kafka
    

This setup ensures that the datahub-kafka sink plugin is included in your DataHub deployment on Kubernetes using Helm.

Would you like more detailed steps or further customization options?

Sources:

<@U06TM7M40RK> how to deploy datahub helm and pass recipes yaml on startup installation?

To deploy DataHub using Helm and pass ingestion recipes as YAML files during startup, you can follow these steps:

  1. Add DataHub Helm Repository: Ensure you have added the DataHub Helm repository to your Helm setup.

    helm repo add datahub https://helm.datahubproject.io/
    helm repo update
    
  2. Create a ConfigMap for Your Recipe: Store your ingestion recipe in a Kubernetes ConfigMap. This ConfigMap will be used by the DataHub ingestion cron job.

    Here is an example of how to define a ConfigMap in a YAML file:

    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: recipe-config
    data:
      my_recipe.yml: |-
        source:
          type: mysql
          config:
            host_port: "localhost:3306"
            database: "your_database"
            username: "your_username"
            password: "your_password"
        sink:
          type: datahub-rest
          config:
            server: "http://datahub-gms:8080"  # Ensure this matches your DataHub GMS service
    

    Apply this ConfigMap to your Kubernetes cluster:

    kubectl apply -f path/to/your/configmap.yaml
    
  3. Modify values.yaml for Ingestion Cron Job: Update your values.yaml file to configure the ingestion cron job to use the ConfigMap.

    Here is an example configuration:

    datahub-ingestion-cron:
      enabled: true
      crons:
        my_ingestion_job:
          schedule: "0 * * * *" # Adjust the schedule as needed
          recipe:
            configmapName: recipe-config
            fileName: my_recipe.yml
    
  4. Deploy DataHub with Helm: Deploy or upgrade your DataHub Helm chart with the modified configuration.

    helm upgrade --install datahub datahub/datahub -f path/to/your/values.yaml
    

This setup ensures that your ingestion recipes are passed to the DataHub deployment during startup using a ConfigMap and a cron job.

Would you like more detailed steps or further customization options?

Sources:

<@U06TM7M40RK> datahub.ingestion.run.pipeline.PipelineInitError: Failed to find a registered sink for type datahub-kafka: datahub-kafka is disabled; try running: pip install ‘acryl-datahub[datahub-kafka]’

The error message you’re encountering indicates that the datahub-kafka sink is not installed or enabled in your environment. To resolve this issue, you need to ensure that the datahub-kafka plugin is installed correctly. Here are the steps to address this:

  1. Install the datahub-kafka Plugin: You need to install the datahub-kafka plugin using pip. This can be done by running the following command:

    pip install 'acryl-datahub[datahub-kafka]'
    

    This command will install the necessary dependencies for using the datahub-kafka sink.

  2. Verify Installation: After installation, verify that the plugin is correctly installed by checking the active plugins. You can do this using the DataHub CLI:

    datahub check plugins
    

    This command should list datahub-kafka as one of the installed plugins.

  3. Update Helm Chart (if applicable): If you are deploying DataHub using Helm, ensure that your Helm chart is configured to include the datahub-kafka plugin. You can do this by adding it to the extraPipPackages section in your values.yaml file:

    datahub:
      gms:
        extraPipPackages:
          - "acryl-datahub[datahub-kafka]"
    

    Then, redeploy your Helm chart:

    helm upgrade --install datahub datahub/datahub -f path/to/your/values.yaml
    
  4. Check for Configuration Issues: Ensure that your ingestion recipe is correctly configured to use the datahub-kafka sink. Here is an example configuration:

    sink:
      type: datahub-kafka
      config:
        connection:
          bootstrap: 'localhost:9092'
          schema_registry_url: 'http://localhost:8081'
    

By following these steps, you should be able to resolve the PipelineInitError related to the datahub-kafka sink.

Would you like more detailed steps or further assistance with any specific part of the setup?

Sources:

<@U06TM7M40RK> datahub is taking too long to finish ingestion from bigquery using GMS as sync.

To address the issue of slow ingestion from BigQuery using GMS as the sink in DataHub, there are several steps and considerations you can take to diagnose and improve the performance:

  1. Check Server-Side Processing: Server-side processing can be a bottleneck. Ensure that your server has sufficient resources (CPU, memory) allocated. Upgrading the server might also be necessary. According to <Slack Slack conversation>, server-side processing can significantly impact performance.

  2. Review Configuration and Resources: Ensure that your DataHub instance has sufficient resources allocated. Sometimes, performance issues can be mitigated by scaling up the resources. Check if there are any specific flags or configurations that might be causing longer run times. For example, certain ingestion settings might need to be adjusted for optimal performance.

  3. Scaling and Ingestion Strategy: Scaling the number of GMS replicas or using standalone consumers can improve ingestion performance. This might require some knowledge of scaling Kafka topic consumption. As per <Slack Slack thread>, scaling strategies can help improve performance.

  4. Optimize BigQuery Ingestion: For performance reasons, DataHub only profiles the latest partition for partitioned tables and the latest shard for sharded tables. Adjusting the partition explicitly with the partition.partition_datetime property can help. More details can be found in the <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/docs/sources/bigquery/bigquery_pre.md|BigQuery Ingestion Documentation>.

  5. Example Configuration Adjustment: If you are using the UI, ensure that your BigQuery ingestion settings are optimized. Here is an example configuration:

    source:
      type: bigquery
      config:
        project_id: "your_project_id"
        credentials_path: "/path/to/credentials.json"
        profiling:
          enabled: true
          partition:
            partition_datetime: "2023-01-01T00:00:00Z"  # Adjust this value based on your needs
        number_of_threads: 10  # Adjust the number of threads for better performance
    
    sink:
      type: "datahub-rest"
      config:
        server: "http://localhost:8080"
    
  1. Additional Questions: To provide more specific guidance, could you please answer the following:
    • How much data are you ingesting?
    • Are there any specific parts of the ingestion process that seem to be slower than others?

Would you like more detailed steps on any of these points or further assistance on a specific part of the process?

Sources: