Searching Users by Email IDs on Datahub and Deleting Ingested Data

Original Slack Thread

Hi All,

can users be searched based on their email IDs? Right now I am not able to search users on datahub:attachment

Hey there, it looks like we do index users based on their emails in search: https://github.com/datahub-project/datahub/blob/3acd25ba1d2881597e5a0574331b6b81f7375d94/metadata-models/src/main/pegasus/com/linkedin/identity/CorpUserInfo.pdl#L37-L44|https://github.com/datahub-project/datahub/blob/3acd25ba1d2881597e5a0574331b6b81f7[…]-models/src/main/pegasus/com/linkedin/identity/CorpUserInfo.pdl

I took a look on the <DataHub site> users and groups feature and after typing and deleting a character, I can see users that are in the system. I’m wondering if your user ingestion worked correctly?attachment

Do you see them on your users page? (the equivalent of this page: https://demo.datahubproject.io/settings/identities/users for your deployment)

I do see it, thanks <@U04UKA5L5LK> !

I do have one more question, does deleting an ingestion remove all the data from datahub? like If I synced a snowflake DB that I don’t want to know anymore, does deleting the ingestion get rid of the data?

if not what’s the best way to delete those? cc: <@U04UKA5L5LK>

Hey, we should have a script to do bulk deletes like this! Tagging <@U04N9PYJBEW> who would be the most familiar.

Deleting an ingestion source will not remove any of the data associated. See https://datahubproject.io/docs/next/how/delete-metadata/#delete-cli-usage for information on the delete CLI which can perform this bulk deletion. Very soon we’ll be able to support deleting all urns within a container, like a snowflake database, assuming you’ve ingested those urns recently (around past month). If you really need to delete data just from a specific ingestion run and can’t use the filters described in that doc, then let me know – you’ll need to run a more complex script

Thanks for the response! I need to delete data from a specific ingestion.

Did you use stateful ingestion (via stateful_ingestion: enabled in your recipe) for that ingestion source? If not, we may need to find a workaround

stateful_ingestion: enabled is there in my recipe. Sorry for the delayed response. I have been unwell

You can run something like this:

pipeline_name = "&lt;pipeline_name&gt;"
graph = DataHubGraph(DatahubClientConfig(server=..., token=...))
checkpoint = graph.get_latest_pipeline_checkpoint(pipeline_name, platform)
if checkpoint:
    urns = checkpoint.state.urns
    timestamp = int(time.time() * 1000)
    run_id = f"soft-delete-by-pipeline-{timestamp}"
    for urn in progressbar.progressbar(urns):
        graph.soft_delete_urn(urn, run_id=run_id)```
Where `pipeline_name` is the name of the ingestion source you are deleting. This may be specified in the recipe, but if not, then you can find it in your logs after the line:
&gt; ```Committing ingestion checkpoint for pipeline```
It should look something like `urn:li:dataHubIngestionSource:&lt;uuid&gt;`

Thank you! We will try this