AWS Athena - how to process huge results file - amazon-web-services

Looking for a way to process ~ 4Gb file which is a result of Athena query and I am trying to know:
Is there some way to split Athena's query result file into small pieces? As I understand - it is not possible from Athena side. Also, looks like it is not possible to split it with Lambda - this file too large and looks like s3.open(input_file, 'r') does not work in Lambda :(
Is there some other AWS services that can solve this issue? I want to split this CSV file to small (about 3 - 4 Mb) to send them to external source (POST requests)

You can use the option to CTAS with Athena and use the built-in partition capabilities.
A common way to use Athena is to ETL raw data into a more optimized and enriched format. You can turn every SELECT query that you run into a CREATE TABLE ... AS SELECT (CTAS) statement that will transform the original data into a new set of files in S3 based on your desired transformation logic and output format.
It is usually advised to have the newly created table in a compressed format such as Parquet, however, you can also define it to be CSV ('TEXTFILE').
Lastly, it is advised to partition a large table into meaningful partitions to reduce the cost to query the data, especially in Athena that is charged by data scanned. The meaningful partitioning is based on your use case and the way that you want to split your data. The most common way is using time partitions, such as yearly, monthly, weekly, or daily. Use the logic that you would like to split your files as the partition key of the newly created table.
CREATE TABLE random_table_name
WITH (
format = 'TEXTFILE',
external_location = 's3://bucket/folder/',
partitioned_by = ARRAY['year','month'])
AS SELECT ...
When you go to s3://bucket/folder/ you will have a long list of folders and files based on the selected partition.
Note that you might have different sizes of files based on the amount of data in each partition. If this is a problem or you don't have any meaningful partition logic, you can add a random column to the data and partition with it:
substr(to_base64(sha256(some_column_in_your_data)), 1, 1) as partition_char
Or you can use bucketing and provide how many buckets you want:
WITH (
format = 'TEXTFILE',
external_location = 's3://bucket/folder/',
bucketed_by = ARRAY['column_with_high_cardinality'],
bucket_count = 100
)

You won't be able to do this with Lambda as your memory is maxed out around 3GB and your file system storage is maxed out at 512 MB.
Have you tried just running the split command on the filesystem (if you are using a Unix based OS)?
If this job is reoccurring and needs to be automated and you wanted to still be "serverless", you could create a Docker image that contains a script to perform this task and then run it via a Fargate task.
As for the specific of how to use split, this other stack overflow question may help:
How to split CSV files as per number of rows specified?

You can ask S3 for a range of the file with the Range option. This is a byte range (inclusive), for example bytes=0-1000 to get the first 1000 bytes.
If you want to process the whole file in the same Lambda invocation you can request a range that is about what you think you can fit in memory, process it, and then request the next. Request the next chunk when you see the last line break, and prepend the partial line to the next chunk. As long as you make sure that the previous chunk gets garbage collected and you don't aggregate a huge data structure you should be fine.
You can also run multiple invocations in parallel, each processing its own chunk. You could have one invocation check the file size and then invoke the processing function as many times as necessary to ensure each gets a chunk it can handle.
Just splitting the file into equal parts won't work, though, you have no way of knowing where lines end, so a chunk may split a line in half. If you know the maximum byte size of a line you can pad each chunk with that amount (both at the beginning and end). When you read a chunk you skip ahead until you see the last line break in the start padding, and you skip everything after the first line break inside the end padding – with special handling of the first and last chunk, obviously.

Related

How can I explicitly specify the size of the files to be split or the number of files?

Situation:
If only specify the partition clause, it will be divided into multiple files. The size of one file is less than 1MB (~ 40 files).
What I am thinking of:
I want to explicitly specify the size of the files to be split or the number of files when registering data with CTAS or INSERT INTO.
I have read this article: https://aws.amazon.com/premiumsupport/knowledge-center/set-file-number-size-ctas-athena/
Problem:
Using bucketing method (like said in above article ) can help me specify the number of file or file size. However, it also said that "Note: The INSERT INTO statement isn't supported on bucketed tables". I would like to register data daily with Athena's INSERT INTO in the data mart.
what is the best way to build a partitioned data mart without compromising search efficiency? Is it best to register the data with Glue and save it as one file?

How to set an upper bound to BigQuery's extracted file parts?

Say I have a BigQuery table that contains 3M rows, and I want to export it to gcs.
What I do is standard bq extract <flags> ... <project_id>:<dataset_id>.<table_id> gs://<bucket>/file_name_*.<extension>
I am bound by a limit on the number of rows a file (part) can have. Is there a way to set a hard limit to the size of a file part?
For example, If I want each partition not to be above 10Mb for example, or even better, to set the maximum number of rows allowed to go in a file part? The documentation doesn't seem to mention any flags for this purpose.
You can't do it with BigQuery extract API.
But you can script it (perform an export of thousands of row in a loop) but you will have to pay for the processed data (the extract is free!). You can also set up a Dataflow job for this (but it's also not free!).

AWS Athena partition fetch all paths

Recently, I've experienced an issue with AWS Athena when there is quite high number of partitions.
The old version had a database and tables with only 1 partition level, say id=x. Let's take one table; for example, where we store payment parameters per id (product), and there are not plenty of IDs. Assume its around 1000-5000. Now while querying that table with passing id number on where clause like ".. where id = 10". The queries were returned pretty fast actually. Assume we update the data twice a day.
Lately, we've been thinking to add another partition level for day like, "../id=x/dt=yyyy-mm-dd/..". This means that partition number grows xID times per day if a month passes and if we have 3000 IDs, we'd approximately get 3000x30=90000 partitions a month. Thus, a rapid grow in number of partitions.
On, say 3 months old data (~270k partitions), we'd like to see a query like the following would return in at most 20 seconds or so.
select count(*) from db.table where id = x and dt = 'yyyy-mm-dd'
This takes like a minute.
The Real Case
It turns out Athena first fetches the all partitions (metadata) and s3 paths (regardless the usage of where clause) and then filter those s3 paths that you would like to see on where condition. The first part (fetching all s3 paths by partitions lasts long proportionally to the number of partitions)
The more partitions you have, the slower the query executed.
Intuitively, I expected that Athena fetches only s3 paths stated on where clause, I mean this would be the one way of magic of the partitioning. Maybe it fetches all paths
Does anybody know a work around, or do we use Athena in a wrong way ?
Should Athena be used only with small number of partitions ?
Edit
In order to clarify the statement above, I add a piece from support mail.
from Support
...
You mentioned that your new system has 360000 which is a huge number.
So when you are doing select * from <partitioned table>, Athena first download all partition metadata and searched S3 path mapped with
those partitions. This process of fetching data for each partition
lead to longer time in query execution.
...
Update
An issue opened on AWS forums. The linked issue raised on aws forums is here.
Thanks.
This is impossible to properly answer without knowing the amount of data, what file formats, and how many files we're talking about.
TL; DR I suspect you have partitions with thousands of files and that the bottleneck is listing and reading them all.
For any data set that grows over time you should have a temporal partitioning, on date or even time, depending on query patterns. If you should have partitioning on other properties depends on a lot of factors and in the end it often turns out that not partitioning is better. Not always, but often.
Using reasonably sized (~100 MB) Parquet can in many cases be more effective than partitioning. The reason is that partitioning increases the number of prefixes that have to be listed on S3, and the number of files that have to be read. A single 100 MB Parquet file can be more efficient than ten 10 MB files in many cases.
When Athena executes a query it will first load partitions from Glue. Glue supports limited filtering on partitions, and will help a bit in pruning the list of partitions – so to the best of my knowledge it's not true that Athena reads all partition metadata.
When it has the partitions it will issue LIST operations to the partition locations to gather the files that are involved in the query – in other words, Athena won't list every partition location, just the ones in partitions selected for the query. This may still be a large number, and these list operations are definitely a bottleneck. It becomes especially bad if there is more than 1000 files in a partition because that's the page size of S3's list operations, and multiple requests will have to be made sequentially.
With all files listed Athena will generate a list of splits, which may or may not equal the list of files – some file formats are splittable, and if files are big enough they are split and processed in parallel.
Only after all of that work is done the actual query processing starts. Depending on the total number of splits and the amount of available capacity in the Athena cluster your query will be allocated resources and start executing.
If your data was in Parquet format, and there was one or a few files per partition, the count query in your question should run in a second or less. Parquet has enough metadata in the files that a count query doesn't have to read the data, just the file footer. It's hard to get any query to run in less than a second due to the multiple steps involved, but a query hitting a single partition should run quickly.
Since it takes two minutes I suspect you have hundreds of files per partition, if not thousands, and your bottleneck is that it takes too much time to run all the list and get operations in S3.

Redshift unload's file name

I'm running a Redshift unload command, but am not getting the name I desire. The command is:
UNLOAD ('select * from foo')
TO 's3://mybucket/foo'
CREDENTIALS 'xxxxxx'
GZIP
NULL AS 'NULL'
DELIMITER as '\t'
allowoverwrite
parallel off
The result is mybucket/foo-000.gz. I don't want the slice number to be the end of the file name (it'd be great if it can be eliminated completely), I want to add a file extension at end of the file name. I'd like to see either of the following:
mybucket/foo-000.txt.gz
mybucket/foo.txt.gz
Is there any way to do this (without writing a lambda post process renamer script)?
TL;DR
No.
Explanation:
As it says in Amazon Redshift UNLOAD document, if you do not want it to be split into several parts, you can use PARALLEL FALSE, but it is strongly recommended to leave it enabled. Even then, the file will always include the 000.[EXT] suffix (when the [EXT] exists only when the compression is enabled), because there is a limit to a file size that Redshift can output, as says in the documentation:
By default, UNLOAD writes data in parallel to multiple files,
according to the number of slices in the cluster. The default option
is ON or TRUE. If PARALLEL is OFF or FALSE, UNLOAD writes to one or
more data files serially, sorted absolutely according to the ORDER BY
clause, if one is used. The maximum size for a data file is 6.2 GB.
So, for example, if you unload 13.4 GB of data, UNLOAD creates the
following three files.
s3://mybucket/key000 6.2 GB
s3://mybucket/key001 6.2 GB
s3://mybucket/key002 1.0 GB
Therefore, it will alway add at least the prefix 000, because Redshift doesn't know what size of the file he is going to output in the first place, so he's adding this suffix in case the output will reach the size of 6.2 GB.
If you ask why the use of PARALLEL FALSE is not recommended, I'll try to explain it in several points:
The most important reason is because of the way a Redshift cluster designed. Each cluster includes at least 2 servers, when one of them is a leader node and the rest are data nodes. The purpose of leader node, is to control the data nodes, it hold the necessary information to work with all data in Redshift, either read or write.
When you unload data from Redshift while the flag PARALLEL is TRUE, it will create at least X files, when X is the number of nodes you choose to construct the Redshift cluster of, in the first place. It means, that the data is written directly from the data nodes themselves, which is much faster because it's doing it in parallel and skips the leader node.
When you decide to turn this flag to off, all data is gathered from all of the data nodes into a single node, the leader node, because it needs to reorganize the sorting of the rows to output and also compress it if needed as a single stream. This action causes you data to be written much slower.
Also, this is significantly decreases Redshift cluster performance in a matter of reading and writing data, because everything (read and write queries) goes through the leader node, and as it says above, when the leader node is overloaded, there will be a performance issue.
The queries COPY and UNLOAD work directly with the data nodes, therefore, they behave almost the same way as if you would use PARALLEL TRUE. In the contrary, queries like SELECT, UPDATE, DELETE and INSERT, are processed by the leader node, that's why they suffer from the leader node loads.

Hive -- split data across files

Is there a way to instruct Hive to split data into multiple output files? Or maybe cap the size of the output files.
I'm planning to use Redshift, which recommends splitting data into multiple files to allow parallel loading http://docs.aws.amazon.com/redshift/latest/dg/t_splitting-data-files.html
We preprocess all out data in hive, and I'm wondering if there's a way to create, say 10 1GB files which might make copying to redshift faster.
I was looking at https://cwiki.apache.org/Hive/adminmanual-configuration.html and https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties but I can't find anything
There are a couple of ways you could go about splitting Hive output. The first and easiest way is to set the number of reducers. Since each reduces writes to its own output file, the number of reducers you specify will correspond to the number of output files written. Note that some Hive queries will not result in the number of reducers you specify (for example, SELECT COUNT(*) FROM some_table always results in one reducer). To specify the number of reducers run this before your query:
set mapred.reduce.tasks=10
Another way you could split into multiple output files would be to have Hive insert the results of your query into a partitioned table. This would result in at least one file per partition. For this to make sense you must have some reasonable column to partition on. For example, you wouldn't want to partition on a unique id column or you would have one file for each record. This approach will guarantee at least output file per partition, and at most numPartitions * numReducers. Here's an example (don't worry too much about hive.exec.dynamic.partition.mode, it needs to be set for this query to work).
hive.exec.dynamic.partition.mode=nonstrict
CREATE TABLE table_to_export_to_redshift (
id INT,
value INT
)
PARTITIONED BY (country STRING)
INSERT OVERWRITE TABLE table_to_export_to_redshift
PARTITION (country)
SELECT id, value, country
FROM some_table
To get more fine grained control, you can write your own reduce script to pass to hive and have that reduce script write to multiple files. Once you are writing your own reducer, you can do pretty much whatever you want.
Finally, you can forgo trying to maneuver Hive into outputting your desired number of files and just break them apart yourself once Hive is done. By default, Hive stores its tables uncompressed and in plain text in it's warehouse directory (ex, /apps/hive/warehouse/table_to_export_to_redshift). You can use Hadoop shell commands, a MapReduce job, Pig, or pull them into Linux and break them apart however you like.
I don't have any experience with Redshift, so some of my suggestions may not be appropriate for consumption by Redshift for whatever reason.
A couple of notes: Splitting files into more, smaller files is generally bad for Hadoop. You might get a speed increase for Redshift, but if the files are consumed by other parts of the Hadoop ecosystem (MapReduce, Hive, Pig, etc) you might see a performance loss if the files are too small (though 1GB would be fine). Also make sure that the extra processing/developer time is worth the time savings you get for paralleling your Redshift data load.