To filter and soft-delete entities based on their last sync date, you can leverage the lastIngested
timestamp in DataHub. Unfortunately, there isn’t a direct command in the DataHub CLI to filter by sync date. However, you can achieve this by using a combination of GraphQL queries to identify the entities and then using the datahub delete
command to soft-delete them.
Step-by-Step Process
- Query Entities by Last Ingested Date: Use a GraphQL query to fetch entities that have not been ingested since a certain date.
- Delete Entities: Use the
datahub delete
command with the URNs obtained from the GraphQL query to soft-delete those entities.
Step 1: Query Entities by Last Ingested Date
You can use the lastIngested
field in a GraphQL query to find entities that haven’t been ingested since a specific date.
Here is an example GraphQL query to find such entities:
{
search(input: {
type: DATASET,
query: "*",
filters: [
{
field: "lastIngested",
value: "2023-01-01T00:00:00.000Z",
condition: "LT"
}
],
start: 0,
count: 100
}) {
entities {
entity {
urn
}
}
}
}
This query searches for datasets that were last ingested before January 1, 2023. You can adjust the type
, value
, and condition
as needed.
Step 2: Delete Entities
Once you have the URNs of the entities, you can use the datahub delete
command to soft-delete them.
Here is an example command to soft-delete a list of URNs:
datahub delete --urn "<urn1>" --urn "<urn2>" --soft
Example Script
You can automate this process using a script that combines the GraphQL query and the CLI command. Here is a basic example in Python:
import requests
import json
import subprocess
# GraphQL endpoint and query
url = 'http://localhost:8080/api/graphql'
query = """
{
search(input: {
type: DATASET,
query: "*",
filters: [
{
field: "lastIngested",
value: "2023-01-01T00:00:00.000Z",
condition: "LT"
}
],
start: 0,
count: 100
}) {
entities {
entity {
urn
}
}
}
}
"""
# Execute GraphQL query
response = requests.post(url, json={'query': query})
data = response.json()
# Extract URNs
urns = [entity['entity']['urn'] for entity in data['data']['search']['entities']]
# Soft delete entities
for urn in urns:
subprocess.run(['datahub', 'delete', '--urn', urn, '--soft'])
Important Considerations
- Dry Run: Always perform a dry run first to ensure you are deleting the correct entities.
- Batch Processing: If you have a large number of entities, consider batching the deletions to avoid overloading the system.
Would you like more detailed information on any specific part of this process?
Sources: