Stateful Ingestion Behavior in DataHub 0.13.3 with Metabase Integration

Original Slack Thread

Hi! DataHub 0.13.3. UI. Metabase ingestion.

We modified the Metabase recipe to:

        stateful_ingestion:
            enabled: true
            remove_stale_metadata: true
...```
in order to enable stateful ingestion, but it doesn't always work.
I created and deleted a new question in Metabase, and ok, it is ingested and then soft-deleted in DataHub
```"soft_deleted_stale_entities": [
        "urn:li:chart:(metabase,3813)"
      ],```
But I also have many charts ingested previously to upgrade to 0.13.3 that are not soft-deleted.
Is this the expected behavior? Shouldn't stateful ingestion soft-delete every chart not found in the last run?![attachment](https://files.slack.com/files-pri/TUMKD5EGJ-F079YLR988M/image.png)

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

<@U06TM7M40RK> Can you help with the questions in this thread?


:information_source: We’re experimenting with a new AI assistant for questions in the <#CUMUWQU66> channel to help you track down helpful resources more quickly. Keep in mind that responses may not always be spot-on; feel free to ask it follow-up questions and don’t forget to provide its responses a :thumbsup: or :thumbsdown: so other folks in the Community know if you’re still looking for help on the topic!

To address your question about stateful ingestion in DataHub 0.13.3 for Metabase, it seems like you are encountering an issue where charts ingested before the upgrade are not being soft-deleted as expected. Here are a few points to consider:

  1. Stateful Ingestion Behavior: According to <https://datahubproject.io/docs/0.13.1/metadata-ingestion/docs/dev_guides/stateful|DataHub documentation>, stateful ingestion works by comparing the current state with the previous state. If an entity (e.g., a chart) is not found in the current state but was present in the previous state, it should be soft-deleted.

  2. Previous State Consideration: As per <Slack Slack conversation>, if stateful ingestion was not enabled before, DataHub cannot compare the current state with any previous state files, and therefore, it cannot soft-delete entities that were ingested before stateful ingestion was enabled.

  3. Soft-Deletion Confirmation: In another <Slack discussion>, it was confirmed that Metabase ingestion does not perform soft deletes for archived questions by default, and the stateful_ingestion.remove_stale_metadata option is necessary to enable this functionality.

Given these points, it appears that the behavior you are experiencing is expected. Stateful ingestion can only soft-delete entities that were ingested after stateful ingestion was enabled because it relies on comparing the current state with the state from previous runs.

Would you like more details on how to handle entities ingested before enabling stateful ingestion, or any other specific aspect of stateful ingestion?

Sources:

<@U06TM7M40RK>, how can I manually delete (or soft-delete) entities not synced in the last run?

To manually delete or soft-delete entities that were not synced in the last run, you can use the DataHub CLI. Here are the steps to achieve this:

  1. Identify the Entities to Delete: You can use the datahub delete command with the --dry-run option to list entities that would be deleted without actually performing the deletion. This helps you verify the entities before proceeding.

  2. Soft Delete the Entities: Once you have verified the entities, you can run the datahub delete command without the --dry-run option to perform the actual soft delete.

Here is an example of how to perform these steps:

Step 1: Dry Run to Identify Entities

datahub delete --platform metabase --dry-run --only-soft-deleted

This command will list all entities from the Metabase platform that were previously soft-deleted but not yet hard-deleted.

Step 2: Soft Delete the Entities

datahub delete --platform metabase --soft

This command will perform a soft delete on all entities from the Metabase platform.

Example Commands

# Dry run to identify entities
datahub delete --platform metabase --dry-run --only-soft-deleted

# Soft delete the entities
datahub delete --platform metabase --soft

Additional Options

  • Using Filters: You can further filter the entities by specifying additional parameters such as --entity-type, --env, etc.
  • Recursive Deletes: If you need to delete entities recursively, you can use the --recursive flag.

References

Would you like more details on any of these steps or additional examples?

Sources:

<@U06TM7M40RK> please, provide the command to filter by sync date, in order to soft-delete any entities not found in the last sync

To filter and soft-delete entities based on their last sync date, you can leverage the lastIngested timestamp in DataHub. Unfortunately, there isn’t a direct command in the DataHub CLI to filter by sync date. However, you can achieve this by using a combination of GraphQL queries to identify the entities and then using the datahub delete command to soft-delete them.

Step-by-Step Process

  1. Query Entities by Last Ingested Date: Use a GraphQL query to fetch entities that have not been ingested since a certain date.
  2. Delete Entities: Use the datahub delete command with the URNs obtained from the GraphQL query to soft-delete those entities.

Step 1: Query Entities by Last Ingested Date

You can use the lastIngested field in a GraphQL query to find entities that haven’t been ingested since a specific date.

Here is an example GraphQL query to find such entities:

{
  search(input: {
    type: DATASET,
    query: "*",
    filters: [
      {
        field: "lastIngested",
        value: "2023-01-01T00:00:00.000Z",
        condition: "LT"
      }
    ],
    start: 0,
    count: 100
  }) {
    entities {
      entity {
        urn
      }
    }
  }
}

This query searches for datasets that were last ingested before January 1, 2023. You can adjust the type, value, and condition as needed.

Step 2: Delete Entities

Once you have the URNs of the entities, you can use the datahub delete command to soft-delete them.

Here is an example command to soft-delete a list of URNs:

datahub delete --urn "<urn1>" --urn "<urn2>" --soft

Example Script

You can automate this process using a script that combines the GraphQL query and the CLI command. Here is a basic example in Python:

import requests
import json
import subprocess

# GraphQL endpoint and query
url = 'http://localhost:8080/api/graphql'
query = """
{
  search(input: {
    type: DATASET,
    query: "*",
    filters: [
      {
        field: "lastIngested",
        value: "2023-01-01T00:00:00.000Z",
        condition: "LT"
      }
    ],
    start: 0,
    count: 100
  }) {
    entities {
      entity {
        urn
      }
    }
  }
}
"""

# Execute GraphQL query
response = requests.post(url, json={'query': query})
data = response.json()

# Extract URNs
urns = [entity['entity']['urn'] for entity in data['data']['search']['entities']]

# Soft delete entities
for urn in urns:
    subprocess.run(['datahub', 'delete', '--urn', urn, '--soft'])

Important Considerations

  • Dry Run: Always perform a dry run first to ensure you are deleting the correct entities.
  • Batch Processing: If you have a large number of entities, consider batching the deletions to avoid overloading the system.

Would you like more detailed information on any specific part of this process?

Sources: