Managing MySQL Disk Space and Log Files in DataHub

Original Slack Thread

<@U06TM7M40RK> what is this urn for? urn:li:query:xxxxxx? I found a lot of records have that urn format

The URN format urn:li:query:xxxxxx is used in DataHub to uniquely identify a query entity. According to the <https://datahubproject.io/docs/0.14.0/what-is-datahub/datahub-concepts|DataHub Concepts documentation>, a URN (Uniform Resource Name) is a scheme of URI used to uniquely define any resource in DataHub. The general format is urn:<Namespace>:<Entity Type>:<ID>. In this case, query is the entity type, and xxxxxx would be the unique identifier for a specific query.

If you are seeing many records with this URN format, it indicates that these records are related to queries that have been cataloged or tracked within your DataHub instance. Queries in DataHub can be associated with datasets and are often used to track how data is being accessed or manipulated.

Would you like more detailed guidance on how these query URNs are used or managed within DataHub?

Sources:

but why does it needed? what for is it? I believe those records bloated up the mysql storage

<@U06TM7M40RK> but why does it needed? what for is it? I believe those records bloated up the mysql storage

The urn:li:query:xxxxxx records in DataHub are used to track and manage metadata related to queries executed on datasets. These records are important for several reasons:

  1. Data Observability: They help in understanding how data is being accessed and used across your data ecosystem. This can be crucial for auditing, compliance, and optimizing data usage.

  2. Data Discovery: By cataloging queries, DataHub can provide insights into which datasets are frequently accessed and how they are being queried, which can aid in data discovery and understanding data dependencies.

  3. Governance: Tracking queries is part of federated governance, allowing organizations to enforce policies and ensure that data usage aligns with governance standards.

However, as you mentioned, these records can contribute to storage bloat, especially if there are a large number of queries being tracked. To manage storage concerns, DataHub provides retention policies that can be configured to limit the number of versions or the age of metadata records stored in the database. According to the <https://github.com/datahub-project/datahub/blob/master/docs/advanced/db-retention.md|Database Retention documentation>, you can configure retention policies to keep only the latest versions or to delete records older than a certain age.

Would you like more detailed guidance on configuring retention policies to manage storage usage?

Sources:

<@U06TM7M40RK> why does when I run this query to get the DB size

        ROUND(SUM(data_length + index_length) / 1024 / 1024, 1) "DB Size in MB" 
FROM information_schema.tables 
GROUP BY table_schema; ```
it only shows 1.8GB of mysql DB size

but when I run this command
```kubectl exec prerequisites-mysql-0 -n datahub -- df -hT```
it shows `/dev/nvme2n1   ext4      32G   21G   11G  67% /bitnami/mysql`?

The discrepancy between the database size reported by your SQL query and the disk usage reported by the df -hT command can be attributed to several factors:

  1. Database Files and Overhead: The SQL query you ran calculates the size of the database by summing up the data and index lengths from information_schema.tables. This only accounts for the actual data and indexes stored in the tables. However, the MySQL data directory (/bitnami/mysql in your case) may contain additional files such as logs, temporary files, and other metadata that are not accounted for in the SQL query. These files can significantly increase the disk usage.

  2. File System Overhead: The file system itself may have overhead that contributes to the total disk usage. This includes metadata and other file system structures that are not part of the database size reported by MySQL.

  3. Unused Space: MySQL may allocate more space than is currently used by the data and indexes to accommodate future growth. This pre-allocated space can also contribute to the difference in reported sizes.

  4. Retention and Logs: If there are retention policies or logs that are not being cleaned up, they can accumulate and take up additional space on the disk.

To address this, you might want to check for large log files or temporary files in the MySQL data directory and consider configuring log rotation or cleanup policies. Additionally, reviewing the MySQL configuration for any settings related to storage and retention might help in managing disk usage.

Would you like more detailed guidance on managing MySQL storage in a Kubernetes environment? :+1: 0 button :-1: 0 button :bulb: Hint: Mention @runllm in the thread for followups.

<@U06TM7M40RK> how can I solve that issue in DataHub? it eats up a lot of mysql disk

To address the issue of MySQL disk usage in DataHub, you can take several steps to manage and reduce the storage consumption:

  1. Configure Database Retention Policies: Implement retention policies to limit the number of metadata versions stored in the database. This can help reduce the amount of historical data that is retained. According to the <https://github.com/datahub-project/datahub/blob/master/docs/advanced/db-retention.md|Database Retention documentation>, you can configure version-based or time-based retention policies to keep only the most recent versions or to delete records older than a certain age.

    Example YAML configuration for retention:

    - entity: "*"
      aspect: "*"
      config:
        retention:
          version:
            maxVersions: 20
          time:
            maxAgeInSeconds: 2592000 # 30 days
    
  2. Log Rotation and Cleanup: Ensure that MySQL logs are rotated and old logs are deleted. This can be configured in MySQL’s configuration files. Additionally, check for any temporary files or binlogs that can be cleaned up.

  3. Increase PVC Size: If the above measures are not sufficient, consider increasing the Persistent Volume Claim (PVC) size for MySQL. This can be done by updating your Kubernetes configuration. As mentioned in the <Slack thread>, you can update the values.yaml file to increase the PVC size.

    Example configuration:

    mysql:
      enabled: true
      primary:
        resources:
          requests:
            cpu: 100m
            memory: 1024Mi
          limits:
            cpu: 200m
            memory: 2048Mi
        persistence:
          size: 16Gi
    
  4. Regular Maintenance: Set up regular maintenance tasks to clean up old data and optimize the database. This can include tasks like OPTIMIZE TABLE to reclaim unused space.

Would you like more detailed guidance on any of these steps?

Sources:

<@U06TM7M40RK> I already have retention applied. how to have log rotate in MySQL DataHub?

To set up log rotation for MySQL in your DataHub deployment, you can configure MySQL’s built-in log rotation features or use external tools like logrotate. Here are the steps to configure log rotation for MySQL:

  1. MySQL Built-in Log Rotation:
    MySQL provides options to manage binary logs and error logs. You can configure these settings in the MySQL configuration file (usually my.cnf or my.ini).

    • Binary Log Rotation:
      You can set the expire_logs_days parameter to automatically purge binary logs older than a specified number of days.

      [mysqld]
      expire_logs_days = 7
      
    • Error Log Rotation:
      MySQL does not automatically rotate error logs, but you can use the FLUSH LOGS command to manually rotate them. You can automate this with a cron job.

  2. Using logrotate:
    If you have access to the server’s file system, you can use the logrotate utility to manage log rotation for MySQL logs. Here’s how you can set it up:

    • Create a logrotate configuration file for MySQL, typically located at /etc/logrotate.d/mysql.
    /var/log/mysql/*.log {
        daily
        rotate 7
        missingok
        compress
        delaycompress
        notifempty
        create 640 mysql adm
        sharedscripts
        postrotate
            # just if mysqld is really running
            if test -x /usr/bin/mysqladmin && \
               /usr/bin/mysqladmin ping &>/dev/null
            then
               /usr/bin/mysqladmin flush-logs
            fi
        endscript
    }
    
    • This configuration rotates MySQL logs daily, keeps 7 days of logs, compresses them, and flushes logs after rotation.
  3. Automate Log Rotation:

    • Ensure that the logrotate service is running and configured to execute regularly (usually via a cron job).

By setting up log rotation, you can manage the size of MySQL logs and prevent them from consuming excessive disk space. Would you like more detailed guidance on any specific part of this setup?

Sources:

<@U06TM7M40RK> I found that the log files eat up a lot of memory
SHOW BINARY LOGS;
is it okay to purge those logs? will it cause an issue to datahub?

Purging binary logs in MySQL can be done safely, but it is important to understand the implications and ensure that it does not disrupt any ongoing processes that rely on these logs. Here are some considerations and steps to safely purge binary logs:

  1. Purpose of Binary Logs: Binary logs in MySQL are used for replication and point-in-time recovery. They record all changes to the database, which can be used to replicate changes to a slave server or to restore the database to a specific point in time.

  2. DataHub Considerations: If your DataHub setup does not rely on MySQL replication or point-in-time recovery, purging binary logs should not directly affect DataHub’s functionality. However, if you have replication set up, ensure that all slave servers have processed the logs before purging them.

  3. Safe Purging: You can safely purge binary logs that are no longer needed. Use the PURGE BINARY LOGS command to delete logs up to a specific point.

    PURGE BINARY LOGS TO 'mysql-bin.000010';
    

    Or, to purge logs older than a specific date:

    PURGE BINARY LOGS BEFORE '2024-01-01 00:00:00';
    
  4. Backup: Before purging, ensure that you have a recent backup of your database. This is a precautionary step to prevent data loss in case of any issues.

  5. Monitoring: After purging, monitor your MySQL server to ensure that everything is functioning as expected.

If you are unsure about the impact of purging binary logs in your specific setup, it might be helpful to consult with your database administrator or review your DataHub and MySQL configurations to ensure that no critical processes depend on these logs.

Would you like more detailed guidance on any specific aspect of this process?

Sources:

<@U06TM7M40RK> is there any method to make the log file size smaller?

To reduce the size of log files in MySQL, you can implement several strategies to manage and limit log growth:

  1. Log Rotation: As previously mentioned, configure log rotation to automatically manage the size of log files. This can be done using MySQL’s built-in features or external tools like logrotate.

  2. Adjust Log Level: If applicable, reduce the verbosity of logs by adjusting the log level. For MySQL, you can configure the general log and slow query log settings to capture only necessary information.

  3. Limit Binary Log Size: You can set a maximum size for binary logs using the max_binlog_size parameter in the MySQL configuration file (my.cnf or my.ini).

    [mysqld]
    max_binlog_size = 100M
    

    This setting limits the size of individual binary log files, causing MySQL to create a new log file once the current one reaches the specified size.

  4. Expire Logs: Use the expire_logs_days parameter to automatically delete binary logs older than a specified number of days.

    [mysqld]
    expire_logs_days = 7
    
  5. Disable Unnecessary Logs: If certain logs are not needed, consider disabling them. For example, if the general query log is not required, you can disable it to save space.

    [mysqld]
    general_log = 0
    
  6. Compress Logs: If you are using logrotate, you can configure it to compress old log files, which can significantly reduce their size.

    Example logrotate configuration:

    /var/log/mysql/*.log {
        daily
        rotate 7
        compress
        delaycompress
        missingok
        notifempty
        create 640 mysql adm
        sharedscripts
        postrotate
            if test -x /usr/bin/mysqladmin && \
               /usr/bin/mysqladmin ping &>/dev/null
            then
               /usr/bin/mysqladmin flush-logs
            fi
        endscript
    }
    

By implementing these strategies, you can effectively manage the size of MySQL log files and prevent them from consuming excessive disk space. Would you like more detailed guidance on any specific configuration?

Sources:

<@U06TM7M40RK> I’m using DataHub helm deployment. how can I reduce the log size according to your answer above?

To reduce the log size in a DataHub deployment using Helm, you can configure MySQL and other components to manage and limit log growth. Here are the steps to achieve this:

  1. Modify MySQL Configuration:
    You can adjust the MySQL configuration by updating the values.yaml file used in your Helm deployment. This file allows you to set parameters for MySQL, including log management settings.

    Example configuration in values.yaml:

    mysql:
      enabled: true
      primary:
        configuration: |
          [mysqld]
          max_binlog_size = 100M
          expire_logs_days = 7
          general_log = 0
    
    • max_binlog_size: Limits the size of individual binary log files.
    • expire_logs_days: Automatically deletes binary logs older than the specified number of days.
    • general_log: Disables the general query log if it’s not needed.
  2. Implement Log Rotation:
    If you have access to the underlying file system, you can use logrotate to manage log rotation for MySQL logs. This involves creating a logrotate configuration file for MySQL logs.

    Example logrotate configuration:

    /var/log/mysql/*.log {
        daily
        rotate 7
        compress
        delaycompress
        missingok
        notifempty
        create 640 mysql adm
        sharedscripts
        postrotate
            if test -x /usr/bin/mysqladmin && \
               /usr/bin/mysqladmin ping &>/dev/null
            then
               /usr/bin/mysqladmin flush-logs
            fi
        endscript
    }
    
  3. Apply Changes:
    After updating the values.yaml file, apply the changes by upgrading your Helm release. This will update the MySQL configuration with the new settings.

    helm upgrade datahub datahub/datahub --values path/to/your/values.yaml
    
  4. Verify Deployment:
    Check the status of the MySQL pods to ensure they are running correctly with the updated configuration.

    kubectl get pods -n datahub
    

By following these steps, you can effectively manage the size of MySQL logs in your DataHub Helm deployment. Would you like more detailed guidance on any specific part of this process?

Sources:

<@U06TM7M40RK> if I put some values in primary.configuration, will it replace the whole my.cnf file, or it’ll only append?

When you specify values in the primary.configuration section of the values.yaml file for a MySQL deployment using Helm, these configurations are typically appended to the existing my.cnf file rather than replacing it entirely. This allows you to customize specific settings without losing the default configurations provided by the Helm chart.

Here’s how it generally works:

  • Appending Configuration: The values you provide in primary.configuration are added to the existing MySQL configuration. This means you can override specific settings or add new ones without affecting other default configurations.

  • Example Configuration:

    mysql:
      enabled: true
      primary:
        configuration: |
          [mysqld]
          max_binlog_size = 100M
          expire_logs_days = 7
          general_log = 0
    
  • Result: The above settings will be appended to the existing my.cnf file, allowing you to manage log sizes and other parameters as needed.

This approach ensures that you can customize your MySQL deployment while retaining the necessary default settings for optimal operation. If you have specific configurations that need to be applied, you can safely add them to the primary.configuration section.

Would you like more detailed guidance on configuring MySQL in your Helm deployment?

Sources:

<@U06TM7M40RK> but max_binlog_size is just making the file size to smaller chunks, right? it’ll still eat the same storage amount. how can I reduce the storage amount due to the file size?