Merge S3 files into multiple <1GB S3 files - amazon-web-services

I have multiple S3 files in a bucket.
Input S3 bucket :
File1 - 2GB data
File 2 - 500MB data
File 3 - 1Gb Data
file 4 - 2GB data
and so on. Assume there are 50 such files. Data within files is of same schema, lets say attribute1, attribute 2.
I want to merge these files and output into a new bucket as follows, such that each file is less than 1GB in same schema as before.
Files 1 - < 1GB
Files 2 - < 1GB
Files 3 - < 1GB
I am looking for AWS based solutions which I can deliver using AWS CDK. I was considering following two solutions :
AWS Athena - reads and writes to S3 but not sure if I can set up a 1GB limit while writing.
AWS Lambda - read file sequentially, store in memory, when size is near 1GB, write to new file in s3 bucket. Repeat until all files completed. I'm worried about the 15 min timeout, not sure if lambda will be able to process.
Expected scales -> Overall file input size sum : 1 TB
What would be a good way to go about implementing this? Hope I have phrased the question right, I'd be happy to comment if any doubts.
Thanks!
Edit :
Based on a comment ->
Apologies for calling it a merge. More of a reset. All files have the same schema, placed in csv files. In terms of pseudo code
List<Files> listOfFiles = ReadFromS3(key)
New file named temp.csv
for each file : listOfFiles :
append file to temp.csv
List<1GBGiles> finalList = Break down temp.csv into sets of 1GB each
for(File file : finalList)
writeToS3(finalList)

Amazon Athena can run a query across multiple objects in a given Amazon S3 path, as long as they all have the same format (eg same columns in a CSV file).
It can store the result in a new External Table, with a location pointing to an S3 bucket, by using a CREATE TABLE AS command and a LOCATION parameter.
The size of the output files can be controlled by setting the number of output buckets (which is not the same as an S3 bucket).
See:
Bucketing vs partitioning - Amazon Athena
Set the number or size of files for a CTAS query in Amazon Athena

If your process includes ETL(Extraction Transformation Load) post process, you could use AWS GLUE
Please find here an example for Glue using s3 as a source.
If you’d like to use it with Java SDK, the best starting points are:
the Glue GitHub repo
The aws Java code sample catalog for Glue
Out of all of them your the Tutorial to create a crawler (that you can find in GitHub as per above url) should match your case as it crawls an S3 bucket and put it in a glue catalog for transformation.

Related

Are parquet files splittable when stored in AWS S3?

I know that parquet files are splittable if they are stored in block storage. E.g stored on HDFS
Are they also splittable when stored in object storage such as AWS s3?
This confuses me because, object storage is supposed to be atomic. You either access the entire file or none of the file. You can't even change meta data on an S3 file without rewriting the entire file. On the other hand, AWS reccomends using splittable file formats in S3 to improve the performance of Athena and other frameworks in the hadoop ecosystem.
Yes, Parquet files are splittable.
S3 supports positioned reads (range requests), which can be used to read only selected portions of the input file (object).
I'm not 100% sure what you mean here, but generally (I think), you have parquet partition on partition keys and save columns into blocks of rows. When I have used in it AWS S3 it has saved like:
|-Folder
|--Partition Keys
|---Columns
|----Rows_1-100.snappy.parquet
|----Rows_101-200.snappy.parquet
This handles the splitting efficiencies you mention.

S3 avoid loading of duplicate files

I have the following work flow.
I need to identify duplicate files on S3 in order to avoid duplicates on my destination ( Redshift ).
Load files to S3 every 4 hours from FTP Server ( File storage structure : year/month/date/hour/minute/filename)
Load S3 to Redshift once all of the files are pulled ( for that interval )
This is a continuous job that will be running every 4 hour.
Problem :
Some times the files with same content but different file names are present on S3. These files can belong to different intervals or different days. For example if a files arrives say one.csv on 1st Oct 2018 and contains 1,2.3,4 as a content then it is possible that on 10th Oct 2018 a file may arrive with same content 1,2,3,4 but with different file name.
I want to avoid to load this file to S3 if the contents are same.
I know that I can use file hash to identify the two identical files, But my problem is how to achieve this on S3 and that too with so much of files.
What will be the best approach to proceed ?
Bascially, I want to avoid loading of data to S3 that is already present.
You can add another table in redshift ( or anywhere else actually like MySQL or dynamodb ) which will contain Etag/md5 hash of files uploaded.
You might already be having a script which is running every 4 hours and is loading data into redshift. In this same script, after data is loaded successfully into redshift; just make an entry into this table. Also, put a check in this same script(from this new table) before loading data into Redshift.
You need to make sure, that you load this new table with all the Etags of files you have already loaded into redshift.

does the pricing for s3 data transfer out of the internet includes for reading file contents

I have a web app with a download buttons to download objects from s3 buckets. I also have plot buttons to read the contents of csv files in s3 bucket using pandas read_csv to read the columns and make visualizations. I wanted to understand if the price for s3 data transfer out of the internet is only for actually download of files or it also includes just reading the contents too because the bytes are transferred over the internet in that case as well.
S3 does not operate like a file system. There is no notion of reading and writing portions of files as you would to a local or remote drive. To read a file you must always download the entire file and then read portions as needed. That is why AWS only shows pricing for data transfer.

AWS Athena Returning Zero Records from Tables Created from GLUE Crawler input csv from S3

Part One :
I tried glue crawler to run on dummy csv loaded in s3 it created a table but when I try view table in athena and query it it shows Zero Records returned.
But the demo data of ELB in Athena works fine.
Part Two (Scenario:)
Suppose I Have a excel file and data dictionary of how and what format data is stored in that file , I want that data to be dumped in AWS Redshift What would be best way to achieve this ?
I experienced the same issue. You need to give the folder path instead of the real file name to the crawler and run it. I tried with feeding folder name to the crawler and it worked. Hope this helps. Let me know. Thanks,
I experienced the same issue. try creating separate folder for single table in s3 buckets than rerun the glue crawler.you will get a new table in glue data catalog which has the same name as s3 bucket folder name .
Delete Crawler ones again create Crawler(only one csv file should be not more available in s3 and run the crawler)
important note
one CSV file run it we can view the records in Athena.
I was indeed providing the S3 folder path instead of the filename and still couldn't get Athena to return any records ("Zero records returned", "Data scanned: 0KB").
Turns out the problem was that the input files (my rotated log files automatically uploaded to S3 from Elastic Beanstalk) start with underscore (_), e.g. _var_log_nginx_rotated_access.log1534237261.gz! Apparently that's not allowed.
The structure of the s3 bucket / folder is very important :
s3://<bucketname>/<data-folder>/
/<type-1-[CSVs|Parquets etc]>/<files.[csv or parquet]>
/<type-2-[CSVs|Parquets etc]>/<files.[csv or parquet]>
...
/<type-N-[CSVs|Parquets etc]>/<files.[csv or parquet]>
and specify in the "include path" of the Glue Crawler:
s3://<bucketname e.g my-s3-bucket-ewhbfhvf>/<data-folder e.g data>
Solution: Select path of folder even if within folder you have many files. This will generate one table and data will be displayed.
So in many such cases using EXCLUDE PATTERN in Glue Crawler helps me.
This is sure that instead of directly pointing the crawler to the file, we should point it to the directory and even by doing so when we do not get any records, Exclude Pattern comes to rescue.
You will have to devise some pattern by which only the file which u want gets crawled and rest are excluded. (suggesting to do this instead of creating different directories for each file and most of the times in production bucket, doing such changes is not feasible )
I was having data in S3 bucket ! There were multiple directories and inside each directory there were snappy parquet file and json file. The json file was causing the issue.
So i ran the crawler on the master directory that was containing many directories and in the EXCLUDE PATTERN i gave - * / *.json
And this time, it did no create any table for the json file and i was able to see the records of the table using Athena.
for reference - https://docs.aws.amazon.com/glue/latest/dg/define-crawler.html
Pointing glue crawler to the S3 folder and not the acutal file did the trick.
Here's what worked for me: I needed to move all of my CSVs into their own folders, just pointing Glue Crawler to the parent folder ('csv/' for me) was not enough.
csv/allergies.csv -> fails
csv/allergies/allergies.csv -> succeeds
Then, I just pointed AWS Glue Crawler to csv/ and everything was parsed out well.

download, process, upload large number of s3 files with spark

I have a large amount of files (~500k hdf5) inside a s3 bucket which I need to process and reupload to another s3 bucket.
I am pretty new to such tasks, so I am not quite sure if my approach is correct here. I do the following:
I use boto to get the list of keys inside the bucket and parallelize it with spark:
s3keys = bucket.list()
data = sc.parallelize(s3keys)
data = data.map(lambda x: download_process_upload(x))
result = data.collect()
where download_process_upload is a function which downloads the file specified by the key, does some processing on it and re-uploads it to another bucket (returning 1 if everything was successful, and 0 if there was an error)
So in the end I could do
success_rate = sum(result) / float(len(s3keys))
I have read that spark map statements should be stateless, while my custom map function definitely is not stateless. It downloads the file to disk and then loads it into memory etc.
So is this the proper way to do such a task?
I've successfully used your methodology to download and process data from S3. I have not tried to upload the data from within a map statement. But, I see no reason why you wouldn't be able to read the file from s3, process it, and then upload it to a new location.
Also, you can save a few keystrokes and take the explicit lambda out of the map statement like this data = data.map(download_process_upload)