Read Large CSV from S3 using Lambda - amazon-web-services

I have multiple compressed (.gzip) csv file in S3 which I wish to parse using preferably Lambda. The largest compressed file seen so far is 80MB. On decompressing, the file size becomes 1.6GB. It is approximately that a single uncompressed file can be approximately 2GB (the file be stored in compressed in S3).
After parsing, I am interested in selected rows from the csv file. I do not expect the memory used by filtered rows to be more than 200MB.
However, given Lambda's limit on time(15 min) & memory (3GB), is using Lambda for such use case a feasible option in longer run? Any alternatives to consider?

Related

How can I explicitly specify the size of the files to be split or the number of files?

Situation:
If only specify the partition clause, it will be divided into multiple files. The size of one file is less than 1MB (~ 40 files).
What I am thinking of:
I want to explicitly specify the size of the files to be split or the number of files when registering data with CTAS or INSERT INTO.
I have read this article: https://aws.amazon.com/premiumsupport/knowledge-center/set-file-number-size-ctas-athena/
Problem:
Using bucketing method (like said in above article ) can help me specify the number of file or file size. However, it also said that "Note: The INSERT INTO statement isn't supported on bucketed tables". I would like to register data daily with Athena's INSERT INTO in the data mart.
what is the best way to build a partitioned data mart without compromising search efficiency? Is it best to register the data with Glue and save it as one file?

How to set a specific compression value in aws glue? If possible, can the compression level and partitions be determined manually in aws glue?

I am looking to ingest data from a source to s3 using AWS Glue.
Is it possible to compress the ingested data in glue to specified value? For example: Compress the data to 500 MB and also be able to partition data based on compression value provided? if yes, how to enable this? I am writing the glue script in Python.
Compression & grouping are similar terms. Compression happens with parquet output. However you can use the 'groupSize': '31457280' (30 mb) to specify the size of the dynamic frame (and is the default output size) of the output file (at least most of them, the last file one is gonna be the remainder).
Also you need to be careful/leverage the Glue CPU type and quantity. like Maximum capacity 10, Worker type Standard.
The G.2X tend to create too many small files (it will/all depend on your situation/inputs.)
If you do nothing but read many small files and write them unchanged in a large group, they will be "default compressed/grouped" into the "groupsize". If you want to see drastic reductions in your file written size, then format the output as parquet. glueContext.create_dynamic_frame_from_options(connection_type = "s3", format="json",connection_options = {"paths":"s3://yourbucketname/folder_name/2021/01/"], recurse':True, 'groupFiles':'inPartition', 'groupSize': '31457280'})

S3 Select Result/Response size

AWS Documentation mentions: The maximum length of a record in the input or result is 1 MB. https://docs.aws.amazon.com/AmazonS3/latest/dev/selecting-content-from-objects.html
However, I'm even able to fetch 2.4GB result on running an S3 Select query through a python lambda, and have seen people working with even more huge result size.
Can someone please highlight the significance of 1 MB mentioned in AWS documentation and what does it mean?
Background:
I recently faced the same question regarding the 1 MB limit. I'm dealing with a large gzip compressed csv file and had to figure out, if S3 Select would be an alternative to processing the file myself. My research makes me feel the author of the previous answer misunderstood the question.
The 1 MB limit referenced by the current AWS S3 Select documentation is referring to the record size:
... The maximum length of a record in the input or result is 1 MB.
The SQL Query is not the input (it has a lower limit though):
... The maximum length of a SQL expression is 256 KB.
Question Response:
I interpret this 1 MB limit the following way:
One row in the queried CSV file (uncompressed input) can't use more than 1 MB of memory
One result record (result row returned by S3 select) also can't use more than 1 MB of memory
To put this in a practical perspective, the following questions discussed the string size in bytes for Python. I'm using an UTF-8 encoding.
This means len(row.encode('utf-8')) (string size in bytes) <= 1024 * 1024 bytes for each csv row represented as UTF-8 encoded string of the input file.
It again means len(response_json.encode('utf-8')) <= 1024 * 1024 bytes for each returned response record (in my case the JSON result).
Note:
In my case, the 1 MB limit works fine. However, this depends a lot on the amount of data in your input (and potentially extra, static columns you might add via SQL).
If the limit 1MB is exceeded and you want to query files without a data base solution involved, using the more expensive AWS Athena might be a solution.
Could you point us to part of documentation which talking about this 1mb?
I have never seen 1 MB limit. Downloading of object is just downloading, and you can download almost unlimited file.
AWS Uplaods files with multipart upload and it has limits up to Terabytes for object and up to Gigabytes for objects part
Docs is here: https://docs.aws.amazon.com/AmazonS3/latest/dev/qfacts.html
Response to the question
As per comment of author below my post:
Limit described here: https://docs.aws.amazon.com/AmazonS3/latest/dev/querying-glacier-archives.html
This docs refers to query for archived objects. So you can do some query on data, without collecting it from the Glacier.
And input query cannot exceed 1MB. Output of that query cannot exceed 1MB.
Input is SQL query
Output is files list.
Find more info here: https://docs.aws.amazon.com/amazonglacier/latest/dev/s3-glacier-select-sql-reference-select.html
So this limit is not for files but for SQL-like queries.

AWS Athena - how to process huge results file

Looking for a way to process ~ 4Gb file which is a result of Athena query and I am trying to know:
Is there some way to split Athena's query result file into small pieces? As I understand - it is not possible from Athena side. Also, looks like it is not possible to split it with Lambda - this file too large and looks like s3.open(input_file, 'r') does not work in Lambda :(
Is there some other AWS services that can solve this issue? I want to split this CSV file to small (about 3 - 4 Mb) to send them to external source (POST requests)
You can use the option to CTAS with Athena and use the built-in partition capabilities.
A common way to use Athena is to ETL raw data into a more optimized and enriched format. You can turn every SELECT query that you run into a CREATE TABLE ... AS SELECT (CTAS) statement that will transform the original data into a new set of files in S3 based on your desired transformation logic and output format.
It is usually advised to have the newly created table in a compressed format such as Parquet, however, you can also define it to be CSV ('TEXTFILE').
Lastly, it is advised to partition a large table into meaningful partitions to reduce the cost to query the data, especially in Athena that is charged by data scanned. The meaningful partitioning is based on your use case and the way that you want to split your data. The most common way is using time partitions, such as yearly, monthly, weekly, or daily. Use the logic that you would like to split your files as the partition key of the newly created table.
CREATE TABLE random_table_name
WITH (
format = 'TEXTFILE',
external_location = 's3://bucket/folder/',
partitioned_by = ARRAY['year','month'])
AS SELECT ...
When you go to s3://bucket/folder/ you will have a long list of folders and files based on the selected partition.
Note that you might have different sizes of files based on the amount of data in each partition. If this is a problem or you don't have any meaningful partition logic, you can add a random column to the data and partition with it:
substr(to_base64(sha256(some_column_in_your_data)), 1, 1) as partition_char
Or you can use bucketing and provide how many buckets you want:
WITH (
format = 'TEXTFILE',
external_location = 's3://bucket/folder/',
bucketed_by = ARRAY['column_with_high_cardinality'],
bucket_count = 100
)
You won't be able to do this with Lambda as your memory is maxed out around 3GB and your file system storage is maxed out at 512 MB.
Have you tried just running the split command on the filesystem (if you are using a Unix based OS)?
If this job is reoccurring and needs to be automated and you wanted to still be "serverless", you could create a Docker image that contains a script to perform this task and then run it via a Fargate task.
As for the specific of how to use split, this other stack overflow question may help:
How to split CSV files as per number of rows specified?
You can ask S3 for a range of the file with the Range option. This is a byte range (inclusive), for example bytes=0-1000 to get the first 1000 bytes.
If you want to process the whole file in the same Lambda invocation you can request a range that is about what you think you can fit in memory, process it, and then request the next. Request the next chunk when you see the last line break, and prepend the partial line to the next chunk. As long as you make sure that the previous chunk gets garbage collected and you don't aggregate a huge data structure you should be fine.
You can also run multiple invocations in parallel, each processing its own chunk. You could have one invocation check the file size and then invoke the processing function as many times as necessary to ensure each gets a chunk it can handle.
Just splitting the file into equal parts won't work, though, you have no way of knowing where lines end, so a chunk may split a line in half. If you know the maximum byte size of a line you can pad each chunk with that amount (both at the beginning and end). When you read a chunk you skip ahead until you see the last line break in the start padding, and you skip everything after the first line break inside the end padding – with special handling of the first and last chunk, obviously.

Apache Hadoop: Insert compress data into HDFS

I need to upload 100 text files into HDFS to do some data transformation with Apache Pig.
In you opinion, what is the best option:
a) Compress all the text files and upload only one file,
b) Load all the text files individually?
It depends - on your files size, cluster parameters and processing methods.
If your text files are comparable in size with HDFS block size (i.e. block size = 256 MB, file size = 200 MB), it makes sense to load them as is.
If your text files are very small, there would be typical HDFS & small files problem - each file will occupy 1 hdfs block (not physically), so NameNode (which handles metadata) will suffer some overhead on managing lot of blocks. To solve this you could either merge your files into single one, use hadoop archives (HAR) or some custom file format (Sequence Files for example).
If custom format is used, you will have to do extra work with processing - it will be required to use custom input formats.
In my opinion, 100 is not that much to significantly affect NameNode performance, so both options seem to be viable.