Hi commutity
Are there any straightforward way to remove old items (not synced for a certain period of time) from datahub
by one query or some script?
Hey there!
<@U06TM7M40RK> Can you help with the questions in this thread?
Please continue to tag RunLLM (e.g <@U06TM7M40RK>) to have it continue generating responses!
To remove old items from DataHub that have not been synced for a certain period of time, you can use a combination of GraphQL queries and the DataHub CLI. Here’s a general approach:
-
Identify Stale Entities: Use a GraphQL query to find entities that have not been ingested since a specific date. You can use the
lastIngested
field to filter these entities. -
Delete Entities: Use the DataHub CLI to soft-delete the identified entities by their URNs.
Step-by-Step Process
Step 1: Query Entities by Last Ingested Date
You can use a GraphQL query to find entities that haven’t been ingested since a certain date. Here’s an example query:
{
search(input: {
type: DATASET,
query: "*",
filters: [
{
field: "lastIngested",
value: "2023-01-01T00:00:00.000Z",
condition: "LT"
}
],
start: 0,
count: 100
}) {
entities {
entity {
urn
}
}
}
}
This query searches for datasets that were last ingested before January 1, 2023. Adjust the type
, value
, and condition
as needed.
Step 2: Delete Entities
Once you have the URNs of the entities, use the DataHub CLI to soft-delete them:
datahub delete --urn "<urn1>" --urn "<urn2>" --soft
Example Script
You can automate this process using a script that combines the GraphQL query and the CLI command. Here’s a basic example in Python:
import requests
import json
import subprocess
# GraphQL endpoint and query
url = '<http://localhost:8080/api/graphql>'
query = """
{
search(input: {
type: DATASET,
query: "*",
filters: [
{
field: "lastIngested",
value: "2023-01-01T00:00:00.000Z",
condition: "LT"
}
],
start: 0,
count: 100
}) {
entities {
entity {
urn
}
}
}
}
"""
# Execute GraphQL query
response = <http://requests.post|requests.post>(url, json={'query': query})
data = response.json()
# Extract URNs
urns = [entity['entity']['urn'] for entity in data['data']['search']['entities']]
# Soft delete entities
for urn in urns:
subprocess.run(['datahub', 'delete', '--urn', urn, '--soft'])
Important Considerations
- Dry Run: Always perform a dry run first to ensure you are deleting the correct entities.
- Batch Processing: If you have a large number of entities, consider batching the deletions to avoid overloading the system.
Would you like more detailed information on any specific part of this process?
Sources:
- <https://datahubproject.io/docs/0.13.1/how/delete-metadata|Removing Metadata from DataHub> 1 button 0 button Hint: Mention <@U06TM7M40RK> in the thread for followups.
query a bit outdated. filters
and entities
fields are not by schema of input
Hi ! Did this interaction resolve your issue or do you still require assistance ?
Please tag @ RunLLM if you still need further assistance !
Or tag @Datahub Community Support if you require a person to take a look
Hope this helps ! ><