Soft Deletion and its Impact on Elasticsearch and MySQL DB

Original Slack Thread

<@U06TM7M40RK> I want to know if I (soft) deleted entities, where and what values ​​change in Elasticsearch and MySQL DB?

When you soft delete entities in DataHub, changes occur in both Elasticsearch and the MySQL database, but in different ways:

  1. Elasticsearch:

    • Soft-deleted entities are removed from search and browse results. This means that the indices in Elasticsearch are updated to exclude these entities from appearing in search queries. However, the actual data might still exist in Elasticsearch but is marked in a way that it is not returned in search results.
    • According to <Slack Slack discussion>, Elasticsearch is the backend for all GraphQL and search queries, so deleting data from Elasticsearch would mean that assets won’t be found in the UI.
  2. MySQL Database:

    • The MySQL database stores the core metadata entities and aspects. When an entity is soft-deleted, the “Status” aspect of the entity is updated to reflect that it has been removed. This is done by setting a flag or a field in the database that indicates the entity is soft-deleted.
    • As per the <metadata-service | DataHub documentation>, the “Status” aspect is used to represent the lifecycle state of an entity. For soft deletion, the “removed” field in the “Status” aspect is set to true.

Would you like more detailed guidance on any specific part of this process?

Sources:

<@U06TM7M40RK> But if I (soft) delete the container which contains all metadata about certain datasets, could it delete all datasets and make them not be shown in DataHub UI?

Soft deleting a container in DataHub does not automatically delete all datasets or metadata contained within that container. According to the <https://datahubproject.io/docs/0.14.0/how/delete-metadata|DataHub documentation>, when you delete an entity with children, such as a container, it will not delete those children. You will need to delete each child by URN in addition to deleting the parent.

Here’s what happens when you soft delete a container:

  1. Container Visibility: The container itself will be marked as removed, and it will not appear in search or browse results in the DataHub UI. However, the datasets and other metadata entities within the container will remain unaffected unless they are explicitly deleted.

  2. Child Entities: The datasets and other metadata entities within the container will still exist and be visible in the DataHub UI unless they are individually soft deleted. The soft deletion of a container does not cascade to its child entities.

  3. UI Behavior: The datasets will still be accessible and visible in the UI, provided you have direct links to them or search for them specifically. The container’s soft deletion only affects the container’s visibility, not the visibility of its contents.

If you want to ensure that datasets within a container are also soft deleted, you would need to perform a separate soft delete operation on each dataset.

Would you like more detailed guidance on how to perform these operations?

Sources:

<@U06TM7M40RK> Then what should I do if I want to delete all metadata which have a pattern in their urn like they have same project and dataset name but not table name in bigquery.

To delete all metadata entities in DataHub that match a specific pattern in their URN, such as having the same project and dataset name but different table names in BigQuery, you can use the DataHub CLI with filters. Here’s how you can achieve this:

  1. Identify the Pattern: Determine the pattern in the URNs that you want to match. For example, if your URNs are structured like urn:li:dataset:(urn:li:dataPlatform:bigquery,project.dataset.table,PROD), you might want to match all entities with project.dataset.

  2. Use the DataHub CLI with Filters: You can use the datahub delete command with the --platform and --entity-type filters to target specific datasets. Unfortunately, the CLI does not support regex or pattern matching directly, but you can filter by platform and entity type.

  3. Perform the Delete: Use the --dry-run flag first to ensure you are targeting the correct entities, and then perform the actual delete.

Here’s an example of how you might structure your commands:

# Dry run to see what entities will be deleted
datahub delete --platform bigquery --entity-type dataset --dry-run

# Soft delete the entities
datahub delete --platform bigquery --entity-type dataset
  1. Scripted Approach for Pattern Matching: If you need more granular control, such as matching specific parts of the URN, you might need to script the process. You can use a combination of GraphQL queries to fetch the URNs that match your pattern and then use the CLI to delete them.

Here’s a basic outline of how you might script this:

  • Use a GraphQL query to fetch all datasets and filter them based on your pattern.
  • Extract the URNs from the query results.
  • Use the datahub delete command to delete each URN.

Would you like more detailed guidance on scripting this process or any other specific part of the operation?

Sources:

<@U06TM7M40RK> How can I execute reindexing in Elasticresearch?

To execute reindexing in Elasticsearch for DataHub, you can follow these steps:

  1. Run the Upgrade Job: You can reindex Elasticsearch by running the DataHub upgrade job with the -a clean option. This will completely reindex Elasticsearch and is useful for resolving inconsistencies or applying schema changes.

    Here’s a general example of how to execute the reindexing process:

    kubectl exec -it <datahub-gms-pod> -- /bin/bash
    /datahub/datahub-upgrade -a clean
    

    This command will initiate the reindexing process, which can help resolve inconsistencies and apply schema changes.

  2. Reindex Specific Indices: If you want to reindex specific indices, you can use the submitReindex method from the ESIndexBuilder class. This method allows you to specify the source and destination indices, batch size, and other parameters.

    Here’s an example of how you might do this programmatically:

    private String submitReindex(String sourceIndex, String destinationIndex) throws IOException {
        return submitReindex(new String[] {sourceIndex}, destinationIndex, 2500, null, null);
    }
    

    This method is part of the ESIndexBuilder class, which handles reindexing tasks. You can find more details in the https://github.com/datahub-project/datahub/blob/master/metadata-io/src/main/java/com/linkedin/metadata/search/elasticsearch/indexbuilder/ESIndexBuilder.java|ESIndexBuilder.java file.

  3. Verify the Reindexing: After reindexing, you can verify that the indices have been updated by checking the document counts or using Elasticsearch’s _cat/indices API to list the indices and their statuses.

Would you like more detailed steps on any of these methods or any other aspect of the reindexing process?

Sources:

<@U06TM7M40RK> What data is the target which is cleaned by cleanUpJob (cronjob)?

The cleanUpJob (cron job) in DataHub is designed to clean up legacy metadata from previous versions of DataHub. This job is particularly useful after performing an upgrade to ensure that the system is free of outdated or redundant data. Here are the key aspects of what the cleanup job targets:

  1. Legacy Metadata: The cleanup job focuses on removing metadata that is no longer relevant or has been superseded by newer versions. This includes metadata that might have been left behind after an upgrade.

  2. Outdated Indices: It may also target outdated indices in Elasticsearch that are no longer needed, ensuring that the search index is optimized and up-to-date.

  3. Redundant Data: Any redundant data that does not conform to the current schema or data model may be cleaned up to maintain the integrity and performance of the DataHub instance.

  4. Configuration and Execution: The cleanup job can be executed in different deployment environments, such as Docker Compose or Helm deployments. For Helm deployments, the cleanup job is deployed as a cron job template named datahub-datahub-cleanup-job-template. It is intended to be run in an ad-hoc fashion once you have validated that your DataHub deployment is healthy post-upgrade.

    • For Docker Compose, you can run the run_clean.sh script located under docker/datahub-upgrade/nocode.
    • For Helm, you can create a job from the cron job template using kubectl.

For more details on executing the cleanup job, you can refer to the <No Code Upgrade (In-Place Migration Guide) | DataHub documentation on Helm deployments>.

Would you like more detailed information on how to execute the cleanup job or its specific functions?

Sources:

you can soft delete all datasets within a given container using CLI. Docs here: https://datahubproject.io/docs/how/delete-metadata#selecting-entities-to-delete

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><

<@U04583XMT5Z> Hi but if want to preserve other data except the data which has certain pattern on its urn. You mean soft deleting all datasets, and then ingesting all datasets from the beginning?

yes that would work. otherwise you might need to write a script to find the specific datasets and then soft-delete them

<@U04583XMT5Z> Thank you for your help. Actually I already have deleted them with each urns. It takes so much time haha.
Btw have you ever experienced this error before?

https://datahubspace.slack.com/archives/C029A3M079U/p1729613040235639

I could find some cases in the channel which is similar with mine but it’s not exactly same so…