Searching Users by Email IDs on Datahub and Deleting Ingested Data

Original Slack Thread

Hi All,

can users be searched based on their email IDs? Right now I am not able to search users on datahub:attachment

Hey there, it looks like we do index users based on their emails in search:|[…]-models/src/main/pegasus/com/linkedin/identity/CorpUserInfo.pdl

I took a look on the <DataHub site> users and groups feature and after typing and deleting a character, I can see users that are in the system. I’m wondering if your user ingestion worked correctly?attachment

Do you see them on your users page? (the equivalent of this page: for your deployment)

I do see it, thanks <@U04UKA5L5LK> !

I do have one more question, does deleting an ingestion remove all the data from datahub? like If I synced a snowflake DB that I don’t want to know anymore, does deleting the ingestion get rid of the data?

if not what’s the best way to delete those? cc: <@U04UKA5L5LK>

Hey, we should have a script to do bulk deletes like this! Tagging <@U04N9PYJBEW> who would be the most familiar.

Deleting an ingestion source will not remove any of the data associated. See for information on the delete CLI which can perform this bulk deletion. Very soon we’ll be able to support deleting all urns within a container, like a snowflake database, assuming you’ve ingested those urns recently (around past month). If you really need to delete data just from a specific ingestion run and can’t use the filters described in that doc, then let me know – you’ll need to run a more complex script

Thanks for the response! I need to delete data from a specific ingestion.

Did you use stateful ingestion (via stateful_ingestion: enabled in your recipe) for that ingestion source? If not, we may need to find a workaround

stateful_ingestion: enabled is there in my recipe. Sorry for the delayed response. I have been unwell

You can run something like this:

pipeline_name = "&lt;pipeline_name&gt;"
graph = DataHubGraph(DatahubClientConfig(server=..., token=...))
checkpoint = graph.get_latest_pipeline_checkpoint(pipeline_name, platform)
if checkpoint:
    urns = checkpoint.state.urns
    timestamp = int(time.time() * 1000)
    run_id = f"soft-delete-by-pipeline-{timestamp}"
    for urn in progressbar.progressbar(urns):
        graph.soft_delete_urn(urn, run_id=run_id)```
Where `pipeline_name` is the name of the ingestion source you are deleting. This may be specified in the recipe, but if not, then you can find it in your logs after the line:
&gt; ```Committing ingestion checkpoint for pipeline```
It should look something like `urn:li:dataHubIngestionSource:&lt;uuid&gt;`

Thank you! We will try this