How to avoid AWS Athena CTAS query creating small files?

How to avoid AWS Athena CTAS query creating small files? - amazon-web-services

I'm unable to figure out what is wrong with my CTAS query, it breaks the data into smaller files while storing inside a partition even though I haven't mentioned any bucketing columns. Is there a way to avoid these small files and store as one single file per partition as files lesser than 128 MB would cause additional overhead?
CREATE TABLE sampledb.yellow_trip_data_parquet
WITH(
format = 'PARQUET'
parquet_compression = 'GZIP',
external_location='s3://mybucket/Athena/tables/parquet/'
partitioned_by=ARRAY['year','month']
)
AS SELECT
VendorID,
tpep_pickup_datetime,
tpep_dropoff_datetime,
passenger_count,
trip_distance,
RatecodeID,
store_and_fwd_flag,
PULocationID,
DOLocationID,
payment_type,
fare_amount,
extra,
mta_tax,
tip_amount,
tolls_amount,
improvement_surcharge,
total_amount,
date_format(date_parse(tpep_pickup_datetime,'%Y-%c-%d %k:%i:%s'),'%Y') AS year,
date_format(date_parse(tpep_pickup_datetime,'%Y-%c-%d %k:%i:%s'),'%c') AS month
FROM sampleDB.yellow_trip_data_raw;

I was able to overcome the issue by creating a bucketing column month_a. Below is the code
CREATE TABLE sampledb.yellow_trip_data_avro
WITH (
format = 'AVRO',
external_location='s3://a4189e1npss3001/Athena/internal_tables/avro/',
partitioned_by=ARRAY['year','month'],
bucketed_by=ARRAY['month_a'],
bucket_count=12
) AS SELECT
VendorID,
tpep_pickup_datetime,
tpep_dropoff_datetime,
passenger_count,
trip_distance,
RatecodeID,
store_and_fwd_flag,
PULocationID,
DOLocationID,
payment_type,
fare_amount,
extra,
mta_tax,
tip_amount,
tolls_amount,
improvement_surcharge,
total_amount,
date_format(date_parse(tpep_pickup_datetime, '%Y-%c-%d %k:%i:%s'),'%c') AS month_a,
date_format(date_parse(tpep_pickup_datetime, '%Y-%c-%d %k:%i:%s'),'%Y') AS year,
date_format(date_parse(tpep_pickup_datetime, '%Y-%c-%d %k:%i:%s'),'%c') AS month
FROM sampleDB.yellow_trip_data_raw;

Athena is a distributed system, and it will scale the execution on your query by some unobservable mechanism. It looks like it decided to use five workers for your CTAS query, which will result in five files in each partition.
You could try explicitly specifying a bucket size of one, but you might still get multiple files, if I remember correctly.

Related

Athena ignore LIMIT in some queries

I have a table with a lot of partitions (something that we're working on reducing)
When I query :
SELECT * FROM mytable LIMIT 10
I get :
"HIVE_EXCEEDED_PARTITION_LIMIT: Query over table 'mytable' can potentially read more than 1000000 partitions"
Why isn't the "LIMIT 10" part of the query sufficient for Athena to return a result without reading more that 1 or 3 partitions ?
ANSWER :
During the query planing phase, Athena attempts to list all partitions potentially needed to answer the query.
Since Athena doesn't know which partitions actually contain data (not empty partitions) it will add all partitions to the list.

Athena plans a query and then executes it. During planning it lists the partitions and all the files in those partitions. However, it does not know anything about the files, how many records they contain, etc.
When you say LIMIT 10 you're telling Athena you want at most 10 records in the result, and since you don't have any grouping or ordering you want 10 arbitrary records.
However, during the planning phase Athena can't know which partitions have files in them, and how many of those files it will need to read to find 10 records. Without listing the partition locations it can't know they're not all empty, and without reading the files it can't know they're not all empty too.
Therefore Athena first has to get the list of partitions, then list each partition's location on S3, even if you say you only want 10 arbitrary records.
In this case there are so many partitions that Athena short-circuits and says that you probably didn't mean to run this kind of query. If the table had fewer partitions Athena would execute the query and each worker would read as little as possible to return 10 records and then stop – but each worker would produce 10 records, because the worker can't assume that other workers would return any records. Finally the coordinator will pick the 10 records out of all the results form all workers to return as the final result.

Limit works on the display operation only, if I am not wrong. So query will still read everything but only display 10 records.
Try to limit data using where condition, that should solve the issue

I think Athena's workers try to read max number of the partitions (relative to the partition size of the table) to get that random chunk of data and stop when query is fulfilled (which in your case, is the specification of the limit).
In your case, it's not even starting to execute the above process because of too many partitions involved. Therefore, if Athena is not planning your random data selection query, you have to explicitly plan it and hand it over to the execution engine.
Something like:
select * from mytable
where (
partition_column in (
select partition_column from mytable limit cast(10 * rand() as integer)
)
)
limit 100

Athena Partitioning limitation and how to best approach the problem I am describing

So here is what is happening
I have a lambda function which reads a file of certain size and pushed to a server(This is the limitation as the server has limited TPS)
The Lambda function therefore cannot read a large file on S3
I am doing a CTAS (I am calculating the size for buckets). So, for example if I have 140M records S and If I need n recoreds in a file of size s, my bucket count is S/s
However Athena complains that it cannot do more than 100 partitions(Its confusing since I am doing bucketing and not partitioning), but my bucket count comes to the count of 75K.
How do I handle this situation? Something I can think of is
Have a Spark job which does repartitioning again.
Manipulate Glue to somehow allow more than 100 partitions
Both approaches dont appeal to me. There must ne a simpler way.

ctas are limited to writing at most 100 partitions. You need to split your query into a first ctas writing up to 100 queries followed by insert queries also writing up to 100 partitions. An alternative approach though, is to create the table directly on glue and just do insert queries (always writing at most 100 partitions).
Doc here https://docs.aws.amazon.com/athena/latest/ug/ctas-insert-into.html

AWS Athena partition fetch all paths

Recently, I've experienced an issue with AWS Athena when there is quite high number of partitions.
The old version had a database and tables with only 1 partition level, say id=x. Let's take one table; for example, where we store payment parameters per id (product), and there are not plenty of IDs. Assume its around 1000-5000. Now while querying that table with passing id number on where clause like ".. where id = 10". The queries were returned pretty fast actually. Assume we update the data twice a day.
Lately, we've been thinking to add another partition level for day like, "../id=x/dt=yyyy-mm-dd/..". This means that partition number grows xID times per day if a month passes and if we have 3000 IDs, we'd approximately get 3000x30=90000 partitions a month. Thus, a rapid grow in number of partitions.
On, say 3 months old data (~270k partitions), we'd like to see a query like the following would return in at most 20 seconds or so.
select count(*) from db.table where id = x and dt = 'yyyy-mm-dd'
This takes like a minute.
The Real Case
It turns out Athena first fetches the all partitions (metadata) and s3 paths (regardless the usage of where clause) and then filter those s3 paths that you would like to see on where condition. The first part (fetching all s3 paths by partitions lasts long proportionally to the number of partitions)
The more partitions you have, the slower the query executed.
Intuitively, I expected that Athena fetches only s3 paths stated on where clause, I mean this would be the one way of magic of the partitioning. Maybe it fetches all paths
Does anybody know a work around, or do we use Athena in a wrong way ?
Should Athena be used only with small number of partitions ?
Edit
In order to clarify the statement above, I add a piece from support mail.
from Support
...
You mentioned that your new system has 360000 which is a huge number.
So when you are doing select * from <partitioned table>, Athena first download all partition metadata and searched S3 path mapped with
those partitions. This process of fetching data for each partition
lead to longer time in query execution.
...
Update
An issue opened on AWS forums. The linked issue raised on aws forums is here.
Thanks.

This is impossible to properly answer without knowing the amount of data, what file formats, and how many files we're talking about.
TL; DR I suspect you have partitions with thousands of files and that the bottleneck is listing and reading them all.
For any data set that grows over time you should have a temporal partitioning, on date or even time, depending on query patterns. If you should have partitioning on other properties depends on a lot of factors and in the end it often turns out that not partitioning is better. Not always, but often.
Using reasonably sized (~100 MB) Parquet can in many cases be more effective than partitioning. The reason is that partitioning increases the number of prefixes that have to be listed on S3, and the number of files that have to be read. A single 100 MB Parquet file can be more efficient than ten 10 MB files in many cases.
When Athena executes a query it will first load partitions from Glue. Glue supports limited filtering on partitions, and will help a bit in pruning the list of partitions – so to the best of my knowledge it's not true that Athena reads all partition metadata.
When it has the partitions it will issue LIST operations to the partition locations to gather the files that are involved in the query – in other words, Athena won't list every partition location, just the ones in partitions selected for the query. This may still be a large number, and these list operations are definitely a bottleneck. It becomes especially bad if there is more than 1000 files in a partition because that's the page size of S3's list operations, and multiple requests will have to be made sequentially.
With all files listed Athena will generate a list of splits, which may or may not equal the list of files – some file formats are splittable, and if files are big enough they are split and processed in parallel.
Only after all of that work is done the actual query processing starts. Depending on the total number of splits and the amount of available capacity in the Athena cluster your query will be allocated resources and start executing.
If your data was in Parquet format, and there was one or a few files per partition, the count query in your question should run in a second or less. Parquet has enough metadata in the files that a count query doesn't have to read the data, just the file footer. It's hard to get any query to run in less than a second due to the multiple steps involved, but a query hitting a single partition should run quickly.
Since it takes two minutes I suspect you have hundreds of files per partition, if not thousands, and your bottleneck is that it takes too much time to run all the list and get operations in S3.

AWS Athena MSCK REPAIR TABLE takes too long for a small dataset

I am having issues with amazon athena, I have a small bucket ( 36430 objects , 9.7 mb ) with 4 levels of partition ( my-bucket/p1=ab/p2=cd/p3=ef/p4=gh/file.csv ) but when I run the command
MSCK REPAIR TABLE db.table
is taking over 25 minutes, and I have plans to put data of the magnitude of TB on Athena and I won't do it if this issue remains
Does anybody know why is taking too long?
Thanks in advance

MSCK REPAIR TABLE can be a costly operation, because it needs to scan the table's sub-tree in the file system (the S3 bucket). Multiple levels of partitioning can make it more costly, as it needs to traverse additional sub-directories. Assuming all potential combinations of partition values occur in the data set, this can turn into a combinatorial explosion.
If you are adding new partitions to an existing table, then you may find that it's more efficient to run ALTER TABLE ADD PARTITION commands for the individual new partitions. This avoids the need to scan the table's entire sub-tree in the file system. It is less convenient than simply running MSCK REPAIR TABLE, but sometimes the optimization is worth it. A viable strategy is often to use MSCK REPAIR TABLE for an initial import, and then use ALTER TABLE ADD PARTITION for ongoing maintenance as new data gets added into the table.
If it's really not feasible to use ALTER TABLE ADD PARTITION to manage the partitions directly, then the execution time might be unavoidable. Reducing the number of partitions might reduce execution time, because it won't need to traverse as many directories in the file system. Of course, then the partitioning is different, which might impact query execution time, so it's a trade-off.

While the marked answer is technically correct, it doesn't address your real issue, which is that you have too many files.
I have a small bucket ( 36430 objects , 9.7 mb ) with 4 levels of
partition ( my-bucket/p1=ab/p2=cd/p3=ef/p4=gh/file.csv )
For such a small table, 36430 files creates a huge amount of overhead on S3, and the partitioning with 4 levels is super-overkill. The partitioning has hindered query performance rather than optimizing it. MSCK is slow because it is waiting for S3 listing among other things.
Athena would read the entire 9.7MB table if it were in one file faster than it would be able to list that huge directory structure.
I recommend removing the partitions completely, or if you really must have them then remove p2, p3 and p4 levels. Also consider processing it into another table to compact the files into larger ones.
Some suggest optimal file sizes are between 64MB and 4GB, which relates to the native block sizes on S3. It's also helpful to have a number of files that is some multiple of the workers in the cluster, although that is unknown with Athena. Your data is smaller than that range, so 1 or perhaps 8 files at most would be appropriate.
Some references:
https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/#OptimizeFileSizes
https://www.upsolver.com/blog/small-file-problem-hdfs-s3

Use Athena Projection to manage partitions automatically.

Hive -- split data across files

Is there a way to instruct Hive to split data into multiple output files? Or maybe cap the size of the output files.
I'm planning to use Redshift, which recommends splitting data into multiple files to allow parallel loading http://docs.aws.amazon.com/redshift/latest/dg/t_splitting-data-files.html
We preprocess all out data in hive, and I'm wondering if there's a way to create, say 10 1GB files which might make copying to redshift faster.
I was looking at https://cwiki.apache.org/Hive/adminmanual-configuration.html and https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties but I can't find anything

There are a couple of ways you could go about splitting Hive output. The first and easiest way is to set the number of reducers. Since each reduces writes to its own output file, the number of reducers you specify will correspond to the number of output files written. Note that some Hive queries will not result in the number of reducers you specify (for example, SELECT COUNT(*) FROM some_table always results in one reducer). To specify the number of reducers run this before your query:
set mapred.reduce.tasks=10
Another way you could split into multiple output files would be to have Hive insert the results of your query into a partitioned table. This would result in at least one file per partition. For this to make sense you must have some reasonable column to partition on. For example, you wouldn't want to partition on a unique id column or you would have one file for each record. This approach will guarantee at least output file per partition, and at most numPartitions * numReducers. Here's an example (don't worry too much about hive.exec.dynamic.partition.mode, it needs to be set for this query to work).
hive.exec.dynamic.partition.mode=nonstrict
CREATE TABLE table_to_export_to_redshift (
id INT,
value INT
)
PARTITIONED BY (country STRING)
INSERT OVERWRITE TABLE table_to_export_to_redshift
PARTITION (country)
SELECT id, value, country
FROM some_table
To get more fine grained control, you can write your own reduce script to pass to hive and have that reduce script write to multiple files. Once you are writing your own reducer, you can do pretty much whatever you want.
Finally, you can forgo trying to maneuver Hive into outputting your desired number of files and just break them apart yourself once Hive is done. By default, Hive stores its tables uncompressed and in plain text in it's warehouse directory (ex, /apps/hive/warehouse/table_to_export_to_redshift). You can use Hadoop shell commands, a MapReduce job, Pig, or pull them into Linux and break them apart however you like.
I don't have any experience with Redshift, so some of my suggestions may not be appropriate for consumption by Redshift for whatever reason.
A couple of notes: Splitting files into more, smaller files is generally bad for Hadoop. You might get a speed increase for Redshift, but if the files are consumed by other parts of the Hadoop ecosystem (MapReduce, Hive, Pig, etc) you might see a performance loss if the files are too small (though 1GB would be fine). Also make sure that the extra processing/developer time is worth the time savings you get for paralleling your Redshift data load.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js