Storing many small files (on S3)? - amazon-web-services

I have 2 million zipped HTML files (100-150KB) being added each day that I need to store for a long time.
Hot data (70-150 million) is accessed semi regularly, anything older than that is barely ever accessed.
This means each day I'm storing an additional 200-300GB worth of files.
Now, Standard storage costs $0.023 per GB and $0.004 for Glacier.
While Glacier is cheap, the problem with it is that it has additional costs, so it would be a bad idea to dump 2 million files into Glacier:
PUT requests to Glacier $0.05 per 1,000 requests
Lifecycle Transition Requests into Glacier $0.05 per 1,000 requests
Is there a way of gluing the files together, but keeping them accessible individually?

An important point, that if you need to provide quick access to these files, then Glacier can give you access to the file in up to 12 hours. So the best you can do is to use S3 Standard – Infrequent Access (0,0125 USD per GB with millisecond access) instead of S3 Standard. And maybe for some really not using data Glacier. But it still depends on how fast do you need that data.
Having that I'd suggest following:
as html (text) files have a good level of compression, you can compress historical data in big zip files (daily, weekly or monthly) as together they can have even better compression;
make some index file or database to know where each html-file is stored;
read only desired html-files from archives without unpacking whole zip-file. See example in python how to implement that.

Glacier would be extremely cost sensitive when it comes to the number of files. The best method would be to create a Lambda function that handles zip, unzip operations for you.
Consider this approach:
Lambda creates archive_date_hour.zip of the 2 Million files from that day by hour, this solves the "per object" cost problem by creating 24 giant archival files.
Set a policy on the s3 bucket to move expired objects to glacier over 1 day old.
Use an unzipping Lambda function to fetch and extract potential hot items from the glacier bucket from within the zip files.
Keep the main s3 bucket for hot files with high frequent access, as a working directory for the zip/unzip operations, and for collecting new files daily

Your files are just too small. You will need to combine them probably in an ETL pipeline such as glue. You can also use the Range header i.e. -range bytes=1000-2000 to download part of an object on S3.
If you do that you'll need to figure out the best way to track the bytes ranges, such as after combining the files recording the range for each one, and changing the clients to use the range as well.
The right approach though depends on how this data is accessed and figuring out the patterns. If somebody who looks at TinyFileA also looks at TinyFileB you could combine them together and just send them both along with other files they are likely to use. I would be figuring out logical groupings of files which make sense to consumers and will reduce the number of requests they need, without sending too much irrelevant data.

Related

AWS: Speed up copy of large number of very small files

I have a single bucket with a large number of very small text files (betwen 500 bytes to 1.2k). This bucket currently contains over 1.7 Million files and will be ever increasing.
The way that I add data to this bucket is by generating batches of files (in the order 50.000 files) and transfering those files into the bucket.
Now the problem is this. If I transfer the files one by one in a loop it takes an unbareably long amount of time. So if all the files a in a directory origin_directory I would do
aws s3 cp origin_directory/filename_i s3://my_bucket/filename_i
I would do this command 50000 times.
Right now I'm testing this on a set of about 280K files. Doing this would take approximately 68 hours according to my calculations. However I found out that I can sync:
aws s3 sync origin_directory s3://my_bucket/
Now this, works much much faster. (Will take about 5 hours, according to my calculations). However, the sync needs to figure out what to copy (files present in the directory and not present in the bucket). Since the files in the bucket will be ever increasing, I'm thinking that this will take longer and longer as times moves on.
However, since I delete the information after every sync, I know that the sync operation needs to transfer all files in that directory.
So my question is, is there a way to start a "batch copy" similar to the sync, without actually doing the sync?
You can use:
aws s3 cp --recursive origin_directory/ s3://my_bucket/
This is the same as a sync, but it will not check whether the files already exist.
Also, see Use of Exclude and Include Filters to learn how to specify wildcards (eg all *.txt files).
When copying a large number of files using aws s3 sync or aws s3 cp --recursive, the AWS CLI will parallelize the copying, making it much faster. You can also play with the AWS CLI S3 Configuration to potentially optimize it for your typical types of files (eg copy more files simultaneously).
try using https://github.com/mondain/jets3t
it does this same function but works in parallel, so it will complete the job much faster.

Amazon AWS S3 Glacier: is there a file hierarchy

Does Amazon AWS S3 Glacier support some semblance of file hierarchy inside a Vault for Archives?
For example, in AWS S3, objects are given hierarchy via /. For example: all_logs/some_sub_category/log.txt
I am storing multiple .tar.gz files, and would like:
All files in the same Vault
Within the Vault, files are grouped into several categories (as opposed to flat structure)
I could not find how to do this documented anywhere. If file hierarchy inside S3 Glacier is possible, can you provide brief instructions for how to do so?
Does Amazon AWS S3 Glacier support some semblance of file hierarchy inside a Vault for Archives?
No, there's no hierarchy other than "archives exist inside a vault".
For example, in AWS S3, objects are given hierarchy via /. For example: all_logs/some_sub_category/log.txt
This is actually incorrect.
S3 doesn't have any inherent hierarchy. The character / is absolutely no different than any other character valid for the key of an S3 Object.
The S3 Console — and most S3 client tools, including AWS's CLI — treat the / character in a special way. But notice that it is a client-side thing. The client will make sure that listing happens in such a way that a / behaves as most people would expect, that is, as a "hierarchy separator".
If file hierarchy inside S3 Glacier is possible, can you provide brief instructions for how to do so?
You need to keep track of your hierarchy separately. For example, when you store an archive in Glacier, you could write metadata about that archive in a database (RDS, DynamoDB, etc).
As a side note, be careful about .tar.gz in Glacier, especially if you're talking about (1) a very large archive (2) that is composed of a large number of small individual files (3) which you may want to access individually.
If those conditions are met (and in my experience they often are in real-world scenarios), then using .tar.gz will often lead to excessive costs when retrieving data.
The reason is because you pay per number of requests as well as per size of request. So while having one huge .tar.gz file may reduce your costs in terms of number of requests, the fact that gzip uses DEFLATE, which is a non-splittable compression algorithm, means that you'll have to retrieve the entire .tar.gz archive, decompress it, and finally get the one file that you actually want.
An alternative approach that solves the problem I described above — and that, at the same time, relates back to your question and my answer — is to actually first gzip the individual files, and then tar them together. The reason this solves the problem is that when you tar the files together, the individual files actually have clear bounds inside the tarball. And then, when you request a retrieval from glacier, you can request only a range of the archive. E.g., you could say, "Glacier, give me bytes between 105MB and 115MB of archive X". That way you can (1) reduce the total number of requests (since you have a single tar file), and (2) reduce the total size of the requests and storage (since you have compressed data).
Now, to know which range you need to retrieve, you'll need to store metadata somewhere — usually the same place where you will keep your hierarchy! (like I mentioned above, RDS, DynamoDB, Elasticsearch, etc).
Anyways, just an optimization that could save a tremendous amount of money in the future (and I've worked with a ton of customers who wasted a lot of money because they didn't know about this).

How to diff very large buckets in Amazon S3?

I have a use case where I have to back up a 200+TB, 18M object S3 bucket to another account that changes often (used in batch processing of critical data). I need to add a verification step, but due to the large size of both bucket, object count, and frequency of change this is tricky.
My current thoughts are to pull the eTags from the original bucket and archive bucket, and the write a streaming diff tool to compare the values. Has anyone here had to approach this problem and if so did you come up with a better answer?
Firstly, if you wish to keep two buckets in sync (once you've done the initial sync), you can use Cross-Region Replication (CRR).
To do the initial sync, you could try using the AWS Command-Line Interface (CLI), which has a aws s3 sync command. However, it might have some difficulties with a large number of files -- I suggest you give it a try. It uses keys, dates and filesize to determine which files to sync.
If you do wish to create your own sync app, then eTag is definitely a definitive way to compare files.
To make things simple, activate Amazon S3 Inventory, which can provide a daily listing of all files in a bucket, including eTag. You could then do a comparison between the Inventory files to discover which remaining files require synchronization.
For anyone looking for a way to solve this problem in an automated way (as was I),
I created a small python script that leverages S3 Inventories and Athena to do the comparison somewhat efficiently. (This is basically automation of John Rosenstein's suggestion)
You can find it here https://github.com/forter/s3-compare

How to use S3 and EBS in tandem for cost effective analytics on AWS?

I receive very large (5TB) .csv files from my clients on S3 buckets. I have to process these files, add columns to them and store them back.
I might need to work with the files in the same way as I increase the number of features for future improved models.
Clearly because S3 stores data as objects, every time I make a change, I have to read and write 5TB of data.
What is the best approach I can take to process these data cost effectively and promptly:
Store a 5TB file on S3 as object, every time read the object, do
the processing and save the result back to S3
Store the 5TB on S3 as object, read the object, chunk it to smaller objects and save them back to S3 as multiple objects so in future just work with the chunks I am interested in
Save every thing on EBS from start, mount it to the EC2 and do the processing
Thank you
First, a warning -- the maximum size of an object in Amazon S3 is 5TB. If you are going to add information that results in a larger object, then you will likely hit that limit.
The smarter way of processing this amount of data is to do it in parallel and preferably in multiple, smaller files rather than a single 5TB file.
Amazon EMR (effectively, a managed Hadoop environment) is excellent for performing distributed operations across large data sets. It can process data from many files in parallel and can compress/decompress data on-the-fly. It's complex to learn, but very efficient and capable.
If you are sticking with your current method of processing the data, I would recommend:
If your application can read directly from S3, use that as the source. Otherwise, copy the file(s) to EBS.
Process the data
Store the output locally in EBS, preferably in smaller files (GBs rather than TBs)
Copy the files to S3 (or keep them on EBS if that meets your needs)

AWS S3 Write At Offset

is there any possibility to write at some offset inside S3 stored file? We really really don't want to download it for read-modify-write all the time because files are rather big (few GBs each).
There is no way to append data in S3.
One possible workaround could be to create new files every time (possibly using Kinesis Firehose) and run EMR jobs (possibly using Data Pipeline) to merge these small files at hourly or daily cadence as needed.