I am running a query through the Athena Query Editor on a table in the Glue Data Catalog and would like to understand why it takes so long to do a simple select * from this data.
Our data is stored in an S3 bucket that is partitioned by year/month/day/hour, with 80 snappy Parquet files per partition that are anywhere between 1 - 10 MB in size each. When I run the following query:
select stringA, stringB, timestampA, timestampB, bigintA, bigintB
from tableA
where year='2021' and month='2' and day = '2'
It scans 700MB but takes over 3 minutes to display the Athena results. I feel that we have already optimized the file format and partitioning for this data, and so I am unsure how else we can improve the performance if we're just trying to select this data out and display it in a tool like QuickSight.
The select * performance was impacted by the number of files that needed to be scanned, which were all relatively small. Repartitioning and removing the hour partition resulted in an improvement in both runtime (14% reduction) and also data scanned (26% reduction) due to snappy compression getting more gains on larger files.
Source: https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/
Related
I have an ETL application where I read messages in 1k batches from a queue & do the following operations:
Parse the messages and create 20 CSV files for 20 different tables and gzip compress them
Copy these files to s3
Create temp tables and use Redshift S3 Copy
Compare master table for duplicates using the temp table and delete duplicates on master tables
Copy the rows on temp table to master table.
Everything was working fine until we noticed huge drop in the message processing throughput. On redshift console I see step 4 for certain tables are taking more than 10 mins now (these tables have billions rows now)
Is this expected behaviour?
i.e. when the table size grows, the dedup operations taking longer? Is there any other alternatives to this pattern?
EDIT: looks the size of the tables in question is in TB's (5.10 TB for one of the tables)
EDIT 2: I have put a where clause while performing step 4. Earlier this step used to scan billions of rows and TB's of data, now it has come down to a couple of thousands of rows and mb's of data. Also the time has reduced too.
I have S3 with compressed JSON data partitioned by year/month/day.
I was thinking that it might reduce the amount of scanned data if construct query with filtering looking something like this:
...
AND year = 2020
AND month = 10
AND day >= 1 "
ORDER BY year, month, day DESC
LIMIT 1
Is this combination of partitioning, order and limit an effective measure to reduce the amount of data being scanned per query?
Partitioning is definitely an effective way to reduce the amount of data that is scanned by Athena. A good article that focuses on performance optimization can be found here: https://aws.amazon.com/de/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/ - and better performance mostly comes from reducing the amount of data that is scanned.
It's also recommended to store the data in a column based format, like Parquet and additionally compress the data. If you store data like that you can optimize queries by just selecting columns you need (there is a difference between select * and select col1,col2,.. in this case).
ORDER BY definitely doesn't limit the data that is scanned, as you need to scan all of the columns in the order by clause to be able to order them. As you have JSON as underlying storage it most likely reads all data.
LIMIT will potentially reduce the amount of data that is read, it depends on the overall size of the data - if limit is way smaller than the overall count of rows it will help.
In general I can recommend to test queries in the Athena interface in AWS - it will tell you the amount of scanned data after a successful execution. I tested on one of my partitioned tables (based on compressed parquet):
partition columns in WHERE clause reduces the amount of scanned data
LIMIT further reduces the amount of scanned data in some cases
ORDER BY leads to reading the all partitions again because it otherwise can't be sorted
We are planning to use Athena as a backend service for our data(stored as parquet files in partitions) in S3.
Some of the things we are interested to find out is how does adding additional columns in where clause of the query affect the query run time.
For example, we have 10million records in one hive partition(partition based on column 'date')
And all queries below return same volume - 10million. would all these queries take same time or does it reduce query run when we add additional columns in where clause(as parquet is columnar fomar)?
I tried to test this but results were not consistent as there was some queuing time as well I guess
select * from table where date='20200712'
select * from table where date='20200712' and type='XXX'
select * from table where date='20200712' and type='XXX' and subtype='YYY'
Parquet file contains page "indexes" (min, max and bloom filters.) If you sorting the data by columns in question during insert for example like this:
insert overwrite table mytable partition (dt)
select col1, --some columns
type,
subtype,
dt
distribute by dt
sort by type, subtype
then these indexes may work efficiently because data withe the same type, subtype will be loaded into the same pages, data pages will be selected using indexes. See some benchmarks here: https://blog.cloudera.com/speeding-up-select-queries-with-parquet-page-indexes/
Switch-on predicate-push-down: https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/cdh_ig_predicate_pushdown_parquet.html
We using GCP logs which being exported into BigQuery using log sink.
We don't have a huge amount of logs but each record seems to be fairly large.
Running a simple query seem to take a lot of time with BigQuery. We wonder is it normal or are we doing anything wrong... And is there anything we can do to make it a bit more practical to analize...
For example, query
SELECT
FORMAT_DATETIME("%Y-%m-%d %H:%M:%S", DATETIME(timestamp, "Australia/Melbourne")) as Melb_time,
jsonPayload.lg.a,
jsonPayload.lg.p
FROM `XXX.webapp_usg_logs.webapp_*`
ORDER BY timestamp DESC
LIMIT 100
takes
Query complete (44.2 sec elapsed, 35.2 MB processed)
Thank you!
Try adding this to your query:
WHERE _TABLE_SUFFIX > FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY))
It will filter to get tables with a TABLE_SUFFIX from within the last 3 days only - instead of having BigQuery look at each table from maybe many years of history.
I have a table on Athena partitioned by day (huge table, TB of data). There's no day column on the table, at least not explicitly. I would expect that a query like the following:
select max(day) from my_table
would scan virtually no data. However, Athena reports that several hundreds of GB are scanned. Any idea why?
===== EDIT 2021-01-14 ===
I've recently bumped on this issue again. It turns out that when the underlying data is parquet then operations on partitions don't consume data. For other data formats that I've tried (including ORC) there is an associated data cost. It doesn't make any sense to me.
I don't know the answer for a fact but I guesstimate:
Athena just does not have the optimization of looking at the partition names only, when only they are queried. This is clear from its behaviour. So it scans everything.
Parquet has min/max for every column whereas ORC does it only if an index is present, AFAIU. Thus for Parquet Athena's query optimizer directs it to look directly at these rollup values, i.e., no scan is performed. It's different for ORC.
I know is a little late to answer this question for you Nicolas but it is important to keep here also some possible solutions.
Unfortunately, this is the way Athena works, Athena will read all data as a tableScan just to list the partitions values.
A possible workaround that works perfectly here is using the metadata of the partition instead of the data information, for example:
Instead of using this syntax:
select max(day) from my_table
Try to use this syntax:
SELECT day FROM my_schema."my_table$partitions" ORDER BY day DESC LIMIT 1
This second statement will read just metadata information and returns the same data you need.
It does not depend on the format but on the compression algorithm used. Snappy for ORC mostly & GZIP for parquet. This is what makes the difference