Managing Database Growth in DataHub with Retention Policies

Original Slack Thread

<@U06TM7M40RK> As time goes by, the database capacity will naturally increase due to data from previous versions. How should this be organized?

To manage the growth of your database due to accumulating metadata versions in DataHub, you can configure and enforce retention policies. DataHub supports different types of retention policies to help manage database size effectively:

  1. Indefinite Retention: Keeps all versions of metadata aspects.
  2. Version-based Retention: Retains only the latest N versions of metadata aspects.
  3. Time-based Retention: Retains versions ingested within the last N seconds.

According to the <https://datahubproject.io/docs/0.14.0/advanced/db-retention|DataHub documentation on database retention>, these policies can be applied at different levels, such as global, entity-level, or aspect-level. The retention policies are applied during GMS boot-up and on every ingest if an existing aspect is updated.

To configure these policies, you can create a YAML file specifying the retention settings and place it in the /etc/datahub/plugins/retention directory. For example, you can set a policy to keep only the latest 20 versions of all entity-aspect pairs:

- entity: "*"
  aspect: "*"
  config:
    retention:
      version:
        maxVersions: 20

By restarting the DataHub GMS after creating the YAML file, the new retention policies will be applied.

Would you like more detailed guidance on setting up these retention policies or any other aspect of managing database growth in DataHub?

Sources:

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><