Removing Old Items from DataHub by Query and Script

Original Slack Thread

Hi commutity
Are there any straightforward way to remove old items (not synced for a certain period of time) from datahub
by one query or some script?

Hey there! :wave:

<@U06TM7M40RK> Can you help with the questions in this thread?

:information_source: Please continue to tag RunLLM (e.g <@U06TM7M40RK>) to have it continue generating responses!

To remove old items from DataHub that have not been synced for a certain period of time, you can use a combination of GraphQL queries and the DataHub CLI. Here’s a general approach:

  1. Identify Stale Entities: Use a GraphQL query to find entities that have not been ingested since a specific date. You can use the lastIngested field to filter these entities.

  2. Delete Entities: Use the DataHub CLI to soft-delete the identified entities by their URNs.

Step-by-Step Process

Step 1: Query Entities by Last Ingested Date

You can use a GraphQL query to find entities that haven’t been ingested since a certain date. Here’s an example query:

{
  search(input: {
    type: DATASET,
    query: "*",
    filters: [
      {
        field: "lastIngested",
        value: "2023-01-01T00:00:00.000Z",
        condition: "LT"
      }
    ],
    start: 0,
    count: 100
  }) {
    entities {
      entity {
        urn
      }
    }
  }
}

This query searches for datasets that were last ingested before January 1, 2023. Adjust the type, value, and condition as needed.

Step 2: Delete Entities

Once you have the URNs of the entities, use the DataHub CLI to soft-delete them:

datahub delete --urn "&lt;urn1&gt;" --urn "&lt;urn2&gt;" --soft

Example Script

You can automate this process using a script that combines the GraphQL query and the CLI command. Here’s a basic example in Python:

import requests
import json
import subprocess

# GraphQL endpoint and query
url = '<http://localhost:8080/api/graphql>'
query = """
{
  search(input: {
    type: DATASET,
    query: "*",
    filters: [
      {
        field: "lastIngested",
        value: "2023-01-01T00:00:00.000Z",
        condition: "LT"
      }
    ],
    start: 0,
    count: 100
  }) {
    entities {
      entity {
        urn
      }
    }
  }
}
"""

# Execute GraphQL query
response = <http://requests.post|requests.post>(url, json={'query': query})
data = response.json()

# Extract URNs
urns = [entity['entity']['urn'] for entity in data['data']['search']['entities']]

# Soft delete entities
for urn in urns:
    subprocess.run(['datahub', 'delete', '--urn', urn, '--soft'])

Important Considerations

  • Dry Run: Always perform a dry run first to ensure you are deleting the correct entities.
  • Batch Processing: If you have a large number of entities, consider batching the deletions to avoid overloading the system.

Would you like more detailed information on any specific part of this process?

Sources:

query a bit outdated. filters and entities fields are not by schema of input

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><