Hello! Did something change on the Athena ingestion? We upgraded from 13.1 to 14.0.2 and now it seems like the athena ingestion tries to run select statement on our tables (we are blocking this via Lake Formation). This behaviour was not there before I think, what was the reason for the change? Thank you
Hey there! Make sure your message includes the following information if relevant, so we can help more effectively!
<@U06TM7M40RK> Can you help with the questions in this thread?
_Please continue to tag RunLLM (e.g <@U06TM7M40RK>) to have it continue generating responses!
It appears that there has been a change in the behavior of Athena ingestion between versions 13.1 and 14.0.2, which now involves running SELECT
statements on tables. This change might be related to the profiling feature, which uses SQL queries on the whole table and can be an expensive operation.
According to the <https://datahubproject.io/docs/0.14.0/generated/ingestion/sources/athena|Athena ingestion documentation>, profiling is optionally enabled via configuration and involves running SQL queries on the entire table. This could explain why you are seeing SELECT
statements being executed.
Additionally, the <Athena | DataHub section> for Athena ingestion lists the required permissions, which include athena:StartQueryExecution
and athena:GetQueryResults
. These permissions are necessary for running queries and retrieving results, which aligns with the behavior you are observing.
If you want to avoid running SELECT
statements, you might need to disable the profiling feature in your ingestion configuration. Here is an example of how you can configure the Athena source without enabling profiling:
source:
type: athena
config:
aws_region: my_aws_region
work_group: primary
query_result_location: "<s3://my_staging_athena_results_bucket/results/>"
profiling:
enabled: false
Would you like more details on how to configure the ingestion to avoid running SELECT
statements, or do you have any other specific questions?
Sources:
- <https://datahubproject.io/docs/0.14.0/generated/ingestion/sources/athena|Athena ingestion documentation> 0 button 1 button Hint: Mention <@U06TM7M40RK> in the thread for followups.
We don’t have profilling enabled, maybe I should have mentioned that <@U06TM7M40RK>
Is it possible this query causes issu on your side?
https://github.com/datahub-project/datahub/blob/fa1164aa6393a44465cbe82660ca08725c8ae93d/metadata-ingestion/src/datahub/ingestion/source/sql/athena.py#L480|https://github.com/datahub-project/datahub/blob/fa1164aa6393a44465cbe82660ca08725c[…]d/metadata-ingestion/src/datahub/ingestion/source/sql/athena.py
We try to get the partition info for Athena tables and this was added recently
Hi <@UV14447EU>, yes this seems to be the case. Is there any way to disable this behavior? This is only because we don’t allow our Datahub AWS role to read actual data from our lake
I will check and add way to disable this. Sorry for the inconvenience
No problem at all! Thanks for the quick response. Please let me know when this is fixed, so I can upgrade the cli version.