Hello all! We have one question regarding the Athena ingestion. Does the Athena ingestion need to be able to run queries against every dataset on the lake? Where can I find a list of all the queries that are being run by the Athena ingestion? Thank you!
It is on our interest to limit the access of the Datahub ingestion to the minimum possible, and not just grant access to everything. So more concretly my question is, is it possible we can just grant access to metadata, and not to data, for this ingestion?
This is the only query run directly from datahub.
However, datahub uses PyAthena to retrieve all kinds of information about Athena entities, so there are definitely many more queries.
I can’t tell you which queries are necessary, but have you seen the prerequisites section in the docs? https://datahubproject.io/docs/generated/ingestion/sources/athena/#prerequisities
This might give you an idea of what is required and what can be restricted for your use case.
So it seems that I was wrong. In our org, we use Lakeformation to manage access to our Lake. It seems that the ingestions only need DESCRIBE permissions and not SELECT permissions. For some reason we had the worng idea on this. In fact, I remember us testing this in one of the earlier versions ( around v0.10) , and the Athena ingestion failed without the SELECT permissions. Super cool that this is no longer necessary