Installing DataHub Kafka Sink Plugin in Helm Configuration

user-2 · December 30, 2024, 12:04am

<@U06TM7M40RK> how to install datahub-kafka sink plugin in helm?

user-1 · December 30, 2024, 12:04am

To install the datahub-kafka sink plugin in a Helm deployment of DataHub, you need to ensure that the necessary configurations are set up in your Helm chart. Here are the steps to achieve this:

Add DataHub Helm Repository: First, ensure that you have added the DataHub Helm repository to your Helm setup.
```
helm repo add datahub https://helm.datahubproject.io/
helm repo update
```
Modify values.yaml: You need to modify the values.yaml file to include the datahub-kafka sink plugin. This involves specifying the plugin in the extraPipPackages section of the Helm chart configuration.

Here is an example configuration:
```
datahub:
  gms:
    extraPipPackages:
      - "acryl-datahub[datahub-kafka]"
```
Deploy DataHub with Helm: After updating the values.yaml file, deploy or upgrade your DataHub Helm chart with the modified configuration.
```
helm upgrade --install datahub datahub/datahub -f path/to/your/values.yaml
```
Verify Installation: You can verify that the datahub-kafka plugin is installed by checking the logs of the GMS container or by using the DataHub CLI to list the installed plugins.
```
kubectl exec -it <gms-pod-name> -- datahub check plugins --verbose | grep -i kafka
```

This setup ensures that the datahub-kafka sink plugin is included in your DataHub deployment on Kubernetes using Helm.

Would you like more detailed steps or further customization options?

Sources:

https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/sink_docs/datahub.md|https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/sink_docs/datahub.md
https://datahubspace.slack.com/archives/CV2KB471C/p1724318858.540549|https://datahubspace.slack.com/archives/CV2KB471C/p1724318858.540549 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-2 · December 30, 2024, 12:04am

<@U06TM7M40RK> how to deploy datahub helm and pass recipes yaml on startup installation?

user-1 · December 30, 2024, 12:04am

To deploy DataHub using Helm and pass ingestion recipes as YAML files during startup, you can follow these steps:

Add DataHub Helm Repository: Ensure you have added the DataHub Helm repository to your Helm setup.
```
helm repo add datahub https://helm.datahubproject.io/
helm repo update
```

Create a ConfigMap for Your Recipe: Store your ingestion recipe in a Kubernetes ConfigMap. This ConfigMap will be used by the DataHub ingestion cron job.

Here is an example of how to define a ConfigMap in a YAML file:

apiVersion: v1
kind: ConfigMap
metadata:
  name: recipe-config
data:
  my_recipe.yml: |-
    source:
      type: mysql
      config:
        host_port: "localhost:3306"
        database: "your_database"
        username: "your_username"
        password: "your_password"
    sink:
      type: datahub-rest
      config:
        server: "http://datahub-gms:8080"  # Ensure this matches your DataHub GMS service

Apply this ConfigMap to your Kubernetes cluster:

kubectl apply -f path/to/your/configmap.yaml

Modify values.yaml for Ingestion Cron Job: Update your values.yaml file to configure the ingestion cron job to use the ConfigMap.

Here is an example configuration:

datahub-ingestion-cron:
  enabled: true
  crons:
    my_ingestion_job:
      schedule: "0 * * * *" # Adjust the schedule as needed
      recipe:
        configmapName: recipe-config
        fileName: my_recipe.yml

Deploy DataHub with Helm: Deploy or upgrade your DataHub Helm chart with the modified configuration.
```
helm upgrade --install datahub datahub/datahub -f path/to/your/values.yaml
```

This setup ensures that your ingestion recipes are passed to the DataHub deployment during startup using a ConfigMap and a cron job.

Would you like more detailed steps or further customization options?

Sources:

https://datahubspace.slack.com/archives/CUMUWQU66/p1726641121.402129|https://datahubspace.slack.com/archives/CUMUWQU66/p1726641121.402129
https://github.com/datahub-project/datahub/pull/2236|https://github.com/datahub-project/datahub/pull/2236 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-2 · December 30, 2024, 12:04am

<@U06TM7M40RK> datahub.ingestion.run.pipeline.PipelineInitError: Failed to find a registered sink for type datahub-kafka: datahub-kafka is disabled; try running: pip install ‘acryl-datahub[datahub-kafka]’

user-1 · December 30, 2024, 12:04am

The error message you’re encountering indicates that the datahub-kafka sink is not installed or enabled in your environment. To resolve this issue, you need to ensure that the datahub-kafka plugin is installed correctly. Here are the steps to address this:

Install the datahub-kafka Plugin: You need to install the datahub-kafka plugin using pip. This can be done by running the following command:
```
pip install 'acryl-datahub[datahub-kafka]'
```
This command will install the necessary dependencies for using the datahub-kafka sink.
Verify Installation: After installation, verify that the plugin is correctly installed by checking the active plugins. You can do this using the DataHub CLI:
```
datahub check plugins
```
This command should list datahub-kafka as one of the installed plugins.
Update Helm Chart (if applicable): If you are deploying DataHub using Helm, ensure that your Helm chart is configured to include the datahub-kafka plugin. You can do this by adding it to the extraPipPackages section in your values.yaml file:
```
datahub:
  gms:
    extraPipPackages:
      - "acryl-datahub[datahub-kafka]"
```
Then, redeploy your Helm chart:
```
helm upgrade --install datahub datahub/datahub -f path/to/your/values.yaml
```
Check for Configuration Issues: Ensure that your ingestion recipe is correctly configured to use the datahub-kafka sink. Here is an example configuration:
```
sink:
  type: datahub-kafka
  config:
    connection:
      bootstrap: 'localhost:9092'
      schema_registry_url: 'http://localhost:8081'
```

By following these steps, you should be able to resolve the PipelineInitError related to the datahub-kafka sink.

Would you like more detailed steps or further assistance with any specific part of the setup?

Sources:

<https://datahubproject.io/docs/0.14.0/metadata-ingestion/sink_docs/datahub|DataHub Kafka Sink Documentation>
<Issues · datahub-project/datahub · GitHub Issue on Kafka Sink Installation> 0 button 0 button Hint: Mention @runllm in the thread for followups.

user-2 · December 30, 2024, 12:04am

<@U06TM7M40RK> datahub is taking too long to finish ingestion from bigquery using GMS as sync.

user-1 · December 30, 2024, 12:04am

To address the issue of slow ingestion from BigQuery using GMS as the sink in DataHub, there are several steps and considerations you can take to diagnose and improve the performance:

Check Server-Side Processing: Server-side processing can be a bottleneck. Ensure that your server has sufficient resources (CPU, memory) allocated. Upgrading the server might also be necessary. According to <Slack Slack conversation>, server-side processing can significantly impact performance.
Review Configuration and Resources: Ensure that your DataHub instance has sufficient resources allocated. Sometimes, performance issues can be mitigated by scaling up the resources. Check if there are any specific flags or configurations that might be causing longer run times. For example, certain ingestion settings might need to be adjusted for optimal performance.
Scaling and Ingestion Strategy: Scaling the number of GMS replicas or using standalone consumers can improve ingestion performance. This might require some knowledge of scaling Kafka topic consumption. As per <Slack Slack thread>, scaling strategies can help improve performance.
Optimize BigQuery Ingestion: For performance reasons, DataHub only profiles the latest partition for partitioned tables and the latest shard for sharded tables. Adjusting the partition explicitly with the partition.partition_datetime property can help. More details can be found in the <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/docs/sources/bigquery/bigquery_pre.md|BigQuery Ingestion Documentation>.

Example Configuration Adjustment: If you are using the UI, ensure that your BigQuery ingestion settings are optimized. Here is an example configuration:

source:
  type: bigquery
  config:
    project_id: "your_project_id"
    credentials_path: "/path/to/credentials.json"
    profiling:
      enabled: true
      partition:
        partition_datetime: "2023-01-01T00:00:00Z"  # Adjust this value based on your needs
    number_of_threads: 10  # Adjust the number of threads for better performance

sink:
  type: "datahub-rest"
  config:
    server: "http://localhost:8080"

user-1 · December 30, 2024, 12:04am

Additional Questions: To provide more specific guidance, could you please answer the following:
- How much data are you ingesting?
- Are there any specific parts of the ingestion process that seem to be slower than others?

Would you like more detailed steps on any of these points or further assistance on a specific part of the process?

Sources:

https://datahubspace.slack.com/archives/CUMUWQU66/p1720748154.794689|https://datahubspace.slack.com/archives/CUMUWQU66/p1720748154.794689
https://datahubspace.slack.com/archives/CUMUWQU66/p1718198575.280589|https://datahubspace.slack.com/archives/CUMUWQU66/p1718198575.280589 0 button 0 button Hint: Mention @runllm in the thread for followups.

Topic		Replies	Views
Configuring External Kafka and PostgreSQL with Helm Chart via Custom Values File all-things-deployment	6	74	March 4, 2024
Troubleshooting Datahub Deployment with Kubernetes in Azure AKS all-things-deployment	3	95	March 4, 2024
Troubleshooting Datahub Deployment on Kubernetes with Custom Images and Helm Charts all-things-deployment	4	88	March 4, 2024
Installing Plugins for Data Sources in DataHub ingestion	3	27	February 3, 2025
Ingesting Data from Kafka into DataHub ingestion	6	32	November 4, 2024

Installing DataHub Kafka Sink Plugin in Helm Configuration

Related topics