Troubleshooting Slow Ingestion Performance After Upgrading DataHub

Original Slack Thread

Hey guys,
I have upgraded DataHub from v.0.12 to v0.13.2 and my dbt,airflow and snowflake ingestions from the DataHub UI are running super slow. The looker ingestion used to last approx 30 min and now it’s around 1 hour.
Any suggestion or thing I should check?

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?


:information_source: We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a :thumbsup: or :thumbsdown: so other folks in the Community know if you’re still looking for help on the topic!

To address the issue of slow ingestion after upgrading DataHub from v0.12 to v0.13.2, let’s break down the potential causes and solutions:

Summary
The ingestion performance degradation could be due to several factors, including changes in the DataHub version, server-side processing bottlenecks, or configuration issues. Upgrading to the latest version might have introduced new features or changes that impact performance.

Steps to Diagnose and Improve Performance

  1. Check Server-Side Processing:

    • According to <Slack Slack conversation>, server-side processing can be a bottleneck, especially for Looker ingestion. Upgrading the server might help improve performance.
  2. Review Configuration and Resources:

    • Ensure that your DataHub instance has sufficient resources (CPU, memory) allocated. Sometimes, performance issues can be mitigated by scaling up the resources.
    • Check if there are any specific flags or configurations that might be causing longer run times. For example, certain ingestion settings might need to be adjusted for optimal performance.
  3. Scaling and Ingestion Strategy:

    • As per <Slack Slack thread>, scaling the number of GMS replicas or using standalone consumers can improve ingestion performance. This might require some knowledge of scaling Kafka topic consumption.
  4. Upgrade and Compatibility:

    • Ensure that all components (e.g., DataHub, Kafka, Elasticsearch) are compatible and properly configured. Sometimes, version mismatches or misconfigurations can lead to performance issues.

Example Configuration Adjustment

If you are using the CLI for ingestion, you might want to adjust the number of threads or other performance-related settings in your ingestion recipes. Here’s an example of how you might configure a Looker ingestion recipe:

source:
  type: looker
  config:
    base_url: "https://your.looker.instance"
    client_id: "your_client_id"
    client_secret: "your_client_secret"
    # Adjust the number of threads for better performance
    number_of_threads: 10

sink:
  type: "datahub-rest"
  config:
    server: "http://localhost:8080"

Additional Questions
To provide more specific guidance, could you please answer the following:

  1. Are you using the UI or CLI for ingestion?
  2. Which DataHub version are you currently using (e.g., v0.13.2)?
  3. What data sources are you integrating with DataHub (e.g., dbt, Airflow, Snowflake, Looker)?

Sources

  • <Slack GMS replicas and ingestion strategy>
  • <Slack processing bottlenecks>
  • <Slack large result sets in Snowflake ingestion>

Feel free to provide more details or ask follow-up questions! :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

I’m using the UI for the ingestion, i’m in version v0.13.2 and I’m integrating dbt, Snowflake, Looker