Limiting Datahub Ingestion Access in Athena and Usage of Metadata

Original Slack Thread

Hello all! We have one question regarding the Athena ingestion. Does the Athena ingestion need to be able to run queries against every dataset on the lake? Where can I find a list of all the queries that are being run by the Athena ingestion? Thank you!

It is on our interest to limit the access of the Datahub ingestion to the minimum possible, and not just grant access to everything. So more concretly my question is, is it possible we can just grant access to metadata, and not to data, for this ingestion?

Hey there! :wave: Make sure your message includes the following information if relevant, so we can help more effectively!

  1. Are you using UI or CLI for ingestion?
  2. Which DataHub version are you using? (e.g. 0.12.0)
  3. What data source(s) are you integrating with DataHub? (e.g. BigQuery)

Looking at the code, it seems that the only query done is this one? https://github.com/datahub-project/datahub/blob/master/metadata-ingestion/src/datahub/ingestion/source/sql/athena.py#L76C1-L76C66

This is the only query run directly from datahub.
However, datahub uses PyAthena to retrieve all kinds of information about Athena entities, so there are definitely many more queries.
I can’t tell you which queries are necessary, but have you seen the prerequisites section in the docs?
https://datahubproject.io/docs/generated/ingestion/sources/athena/#prerequisities
This might give you an idea of what is required and what can be restricted for your use case.

Ahh I see. Thank you! I still would hope it would not be nesessary to give out Datahub Role access to the data if possible.

It might be possible as long as you don’t want to use the profiling feature, but can’t tell you for sure :slightly_smiling_face:

I will try to dig deeper into this. Thanks a lot!

Would appreciate a follow-up, in case you get it working :blush:

So it seems that I was wrong. In our org, we use Lakeformation to manage access to our Lake. It seems that the ingestions only need DESCRIBE permissions and not SELECT permissions. For some reason we had the worng idea on this. In fact, I remember us testing this in one of the earlier versions ( around v0.10) , and the Athena ingestion failed without the SELECT permissions. Super cool that this is no longer necessary