Troubleshooting DataHub ingestion issues due to resource constraints on a laptop

Original Slack Thread

This message was deleted.

Hey there! :wave:

<@U06TM7M40RK> Can you help with the questions in this thread?

:information_source: Please continue to tag RunLLM (e.g <@U06TM7M40RK>) to have it continue generating responses!

It seems like you’re encountering issues with the DataHub ingestion process due to resource constraints on your laptop, which is causing the REST endpoint to be overwhelmed. Here are some steps you can take to address this issue:

  1. Reduce Parallelism: You can reduce the number of threads used for ingestion to decrease the load on your system. According to the <https://datahubproject.io/docs/0.14.0/metadata-ingestion/sink_docs/datahub|DataHub documentation>, you can adjust the max_threads parameter in your ingestion recipe. Try setting it to a lower value, such as 5 or even 1, to see if it helps.

    Example configuration:

    sink:
      type: "datahub-rest"
      config:
        server: "http://localhost:8080"
        max_threads: 5
    
  2. Increase Retry Limits: If reducing the number of threads doesn’t fully resolve the issue, you might also consider increasing the retry_max_times parameter to allow more retries for failed requests.

  3. Monitor Resource Usage: Keep an eye on your system’s CPU and memory usage. If possible, close other applications to free up resources for the ingestion process.

  4. Batch Size and Delay: If the above steps do not resolve the issue, consider adjusting the batch size and delay settings in your ingestion configuration to further reduce the load.

Would you like more detailed steps on any of these points or further assistance on a specific part of the process?

Sources:

<@U06TM7M40RK> Where would I set retry_max_times in point 2? How would I set batch size and delay settings in point 4?

To configure retry_max_times, batch size, and delay settings in your DataHub ingestion setup, you need to adjust the configuration in your ingestion recipe. Here’s how you can do it:

Setting retry_max_times

The retry_max_times parameter can be set in the sink configuration of your ingestion recipe. This parameter specifies the maximum number of times to retry if an HTTP request fails. Here’s an example of how to set it:

sink:
  type: "datahub-rest"
  config:
    server: "http://localhost:8080"
    retry_max_times: 5  # Set the maximum number of retries

Configuring Batch Size and Delay

To adjust the batch size and delay settings, you can modify the ingestion configuration to control how data is batched and processed. While the specific parameters for batch size and delay are not explicitly mentioned in the provided data sources, you can generally control these aspects by configuring the ingestion process to handle smaller batches or introduce delays between batches.

For example, if you are using a custom script or tool for ingestion, you might introduce a delay using a sleep function or adjust the batch size by limiting the number of records processed at a time.

Here’s a conceptual example of how you might configure these settings in a Python script:

import time

# Example function to process data in batches
def process_data_in_batches(data, batch_size, delay):
    for i in range(0, len(data), batch_size):
        batch = data[i:i + batch_size]
        # Process the batch
        process_batch(batch)
        # Introduce a delay between batches
        time.sleep(delay)

# Example usage
data = [...]  # Your data to be ingested
batch_size = 100  # Number of records per batch
delay = 2  # Delay in seconds between batches

process_data_in_batches(data, batch_size, delay)

Additional Considerations

  • Monitor Performance: After making these changes, monitor the performance of your ingestion process to ensure that it is running smoothly and not overwhelming your system.
  • Adjust as Needed: You may need to experiment with different values for retry_max_times, batch size, and delay to find the optimal configuration for your setup.

Would you like more detailed steps or further assistance on any specific part of the process?

Sources:

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><