Managing Large Log Files Generated by LDAP Source Recipe in Airflow DAG

Original Slack Thread

<@U06TM7M40RK> Ldap source recipe creates a lot of logs, the file size is more than 8GB.
recipe is called in an airflow DAG.
Airflow is unable to render logs of this much size.
What can be done ?
Here’s how the recipe is called

pipeline = Pipeline.create(pipeline_config)
pipeline.run()
pipeline.pretty_print_summary()
pipeline.raise_from_status()

It seems likely that the large log size issue you’re experiencing with the LDAP source recipe in your Airflow DAG could be due to the verbosity of the logging configuration. To address this, you might consider adjusting the logging level or configuration to reduce the amount of log data generated.

Based on the information from the <Slack Slack discussion>, one approach to manage logging is to modify the logback configuration file used by DataHub. You can replace the logback configuration file in the Docker image or add a new configuration at runtime. This can be done by adjusting the JVM parameters using the JAVA_OPTS environment variable to specify a custom logback configuration file via -Dlogback.configurationFile=/mycustom/logback.xml.

Additionally, you can consider the following steps:

  1. Reduce Log Level: Change the log level to a less verbose level such as WARN or ERROR to reduce the amount of log data generated. This can be done by modifying the logging configuration in your custom logback file.

  2. Log Rotation: Implement log rotation to manage log file sizes. This can be configured in your logging setup to archive old logs and keep the log file size manageable.

  3. Filter Logs: If possible, filter out unnecessary log messages that are not critical for your monitoring or debugging purposes.

  4. Use Airflow’s Log Management: Ensure that Airflow’s log management settings are optimized for handling large logs. This might include configuring Airflow to store logs in a more scalable storage solution if necessary.

Would you like more detailed guidance on any of these steps or further assistance with configuring the logging setup?

Sources:

  • <Slack Slack discussion> :+1: 0 button :-1: 0 button :light_bulb: Hint: Mention @runllm in the thread for followups.

<@U06TM7M40RK> Recipe is called using

pipeline = Pipeline.create(pipeline_config)
pipeline.run()
pipeline.pretty_print_summary()
pipeline.raise_from_status()

The DataHub’s documentation I have ingested does not mention how to handle large log files generated by a recipe called using pipeline = Pipeline.create(pipeline_config), pipeline.run(), pipeline.pretty_print_summary(), and pipeline.raise_from_status(). You might want to consider adjusting the logging configuration to reduce verbosity or implement log rotation to manage log file sizes. If this doesn’t help, please reach out to the DataHub support team for further assistance. :+1: 0 button :-1: 0 button :light_bulb: Hint: Mention @runllm in the thread for followups.