Addressing Stale Views in DataHub through Stateful Ingestion

Original Slack Thread

Hey team. We are using Datahub version 0.13.1.3. While ingesting Postgres data into DataHub, I let views also to be ingested. After ingestion, I realised that I wanted to remove them. So I set include_views to False (and also the related views lineage ones as well). I also set stateful_ingestion in order to remove the stale data as well. However, after ingestion, no stale data was removed. I expected the views to be removed, as they were “stale” in the new ingestion run with the include_views: False set. Could someone help me with this?

Hey there! :wave:

<@U06TM7M40RK> Can you help with the questions in this thread?

:information_source: Please continue to tag RunLLM (e.g <@U06TM7M40RK>) to have it continue generating responses!

To address the issue of stale views not being removed after setting include_views: False and enabling stateful_ingestion, here are some key points to consider:

  1. Stateful Ingestion Configuration: Ensure that stateful ingestion is properly configured in your ingestion recipe. According to the <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/docs/dev_guides/stateful.md|Stateful Ingestion documentation>, you need to enable stateful ingestion and set remove_stale_metadata to true.

  2. Pipeline Name: The stateful ingestion relies on the pipeline_name to track the state. Ensure that the pipeline_name is consistent across runs. If the pipeline_name is changed, the previous state will not be available for comparison.

Here is an example configuration for enabling stateful ingestion for Postgres:

source:
  type: "postgres"
  config:
    username: <user_name>
    password: <password>
    host_port: <host_port>
    database: <database_name>
    include_tables: True
    include_views: False
    stateful_ingestion:
        enabled: True
        remove_stale_metadata: True
pipeline_name: "your_pipeline_name"
sink:
  type: "datahub-rest"
  config:
    server: 'http://localhost:8080'
  1. Soft Deletion: If you have already ingested views without stateful ingestion and then disabled the inclusion of views, you can perform a one-time soft delete of all entities for the sources as follows:
datahub delete --platform postgres --soft

After performing the soft delete, run the ingestion with stateful ingestion enabled. The real existing entities will reappear on the UI, and the deleted ones will remain soft-deleted.

Would you like more detailed steps or further assistance on this?

Sources:

<@U04RU5J1XU2>, was stateful ingestion enabled when you did the initial ingestion?
Stateful ingestion compares the previous runs entities with the current run and remove which only existed in the previous run.

Ahh yes. Could be the reason. I didn’t have it enabled :slightly_smiling_face:

I deleted it all and reingested it again with the stateful ingestion enabled :slightly_smiling_face:

Hi :slight_smile: ! Did this interaction resolve your issue or do you still require assistance ?

Please tag @ RunLLM if you still need further assistance !

Or tag @Datahub Community Support if you require a person to take a look :slight_smile:

Hope this helps ! ><