Hey team. We are using Datahub version 0.13.1.3. While ingesting Postgres data into DataHub, I let views
also to be ingested. After ingestion, I realised that I wanted to remove them. So I set include_views
to False (and also the related views
lineage ones as well). I also set stateful_ingestion
in order to remove the stale data as well. However, after ingestion, no stale data was removed. I expected the views
to be removed, as they were “stale” in the new ingestion run with the include_views: False
set. Could someone help me with this?
Hey there!
<@U06TM7M40RK> Can you help with the questions in this thread?
Please continue to tag RunLLM (e.g <@U06TM7M40RK>) to have it continue generating responses!
To address the issue of stale views not being removed after setting include_views: False
and enabling stateful_ingestion
, here are some key points to consider:
-
Stateful Ingestion Configuration: Ensure that stateful ingestion is properly configured in your ingestion recipe. According to the <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/docs/dev_guides/stateful.md|Stateful Ingestion documentation>, you need to enable stateful ingestion and set
remove_stale_metadata
totrue
. -
Pipeline Name: The stateful ingestion relies on the
pipeline_name
to track the state. Ensure that thepipeline_name
is consistent across runs. If thepipeline_name
is changed, the previous state will not be available for comparison.
Here is an example configuration for enabling stateful ingestion for Postgres:
source:
type: "postgres"
config:
username: <user_name>
password: <password>
host_port: <host_port>
database: <database_name>
include_tables: True
include_views: False
stateful_ingestion:
enabled: True
remove_stale_metadata: True
pipeline_name: "your_pipeline_name"
sink:
type: "datahub-rest"
config:
server: 'http://localhost:8080'
- Soft Deletion: If you have already ingested views without stateful ingestion and then disabled the inclusion of views, you can perform a one-time soft delete of all entities for the sources as follows:
datahub delete --platform postgres --soft
After performing the soft delete, run the ingestion with stateful ingestion enabled. The real existing entities will reappear on the UI, and the deleted ones will remain soft-deleted.
Would you like more detailed steps or further assistance on this?
Sources:
- <https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/docs/dev_guides/stateful.md|Stateful Ingestion documentation> 0 button 0 button Hint: Mention @runllm in the thread for followups.
<@U04RU5J1XU2>, was stateful ingestion enabled when you did the initial ingestion?
Stateful ingestion compares the previous runs entities with the current run and remove which only existed in the previous run.
Ahh yes. Could be the reason. I didn’t have it enabled
I deleted it all and reingested it again with the stateful ingestion enabled
Hi ! Did this interaction resolve your issue or do you still require assistance ?
Please tag @ RunLLM if you still need further assistance !
Or tag @Datahub Community Support if you require a person to take a look
Hope this helps ! ><