Why this Issue is created?
In my work I've had requirements to create SQLs that run against AWS Glue Data Catalog Tables where they need to get data from the latest partition only.
Something like :-
SELECT * FROM db.table WHERE partition_col = (SELECT max(partition_col) FROM db.table)
But the issue with above approach is I've seen that pushdown predicate doesn't always read the metadata only and reads the data itself which shouldn't happen.
In Athena, to solve this the query can be written as :-
SELECT * FROM db.table WHERE partition_col = (SELECT max(partition_col) FROM "db"."table$partitions")
But there isn't a similar approach in Spark for V1 tables and the only way is using SHOW PARTITIONS but it doesn't work as a subquery.
Why this Issue is created?
In my work I've had requirements to create SQLs that run against AWS Glue Data Catalog Tables where they need to get data from the latest partition only.
Something like :-
SELECT * FROM db.table WHERE partition_col = (SELECT max(partition_col) FROM db.table)
But the issue with above approach is I've seen that pushdown predicate doesn't always read the metadata only and reads the data itself which shouldn't happen.
In Athena, to solve this the query can be written as :-
SELECT * FROM db.table WHERE partition_col = (SELECT max(partition_col) FROM "db"."table$partitions")
But there isn't a similar approach in Spark for V1 tables and the only way is using SHOW PARTITIONS but it doesn't work as a subquery.