Checking Dataset URN Existence before Ingestion into Datahub

Original Slack Thread

#metadata-ingestion
Hello everyone. I’m implementing a version of this python snippet to add tag to a dataset:
https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/examples/library/dataset_add_tag.py
I’m expanding this to take into account various datasets and tables and attach different tags to them, but I noticed that, if the URN of the dataset/table doesn’t exist (meaning that previously I didn’t ingest the corrispondent dataset/table), Datahub will create the asset. I don’t want this behaviour, meaning that I want to check that what is indicated in the code as dataset_urn (line 21) is a URN that exists already in my Datahub view. How can I achieve that?

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

Version of Datahub: 0.13
Data Source: BigQuery

you might youse graphql with a dataset query. It returns a dataset by URN and if it does not find something you know it does not exist

Hello David, can you make an example? I’m working with Python SDK and I have no idea if I can do what you are suggesting with Oython

we are just doing a POST request against the /api/graphql endpoint. something like this

  dataset(urn: $urn) {
    urn
    name
  }
}""".strip(), "variables": kwargs}
response = <http://requests.post|requests.post>(
    url=graphql_endpoint,
    data=bytes(json.dumps(payload), "utf-8"),
    headers=headers,
    timeout=120,
)
response.raise_for_status()```

You can also use graph.exists(urn)

Ah nice. We implemented it in a time where the graph interface wasn’t ready I think :sweat_smile: