Why is reading files on AWS S3 Glacier very slow?

Why is reading files on AWS S3 Glacier very slow? - amazon-web-services

I have just started learning about AWS and come to know AWS S3 has several choices (example: S3 standard, S3 glacier). In general, I believe storages where we can get files faster are more expensive, and those where we can get files slower are cheaper.
I would like to know how this works in terms of technology. Why is the cost lower (probably for Amazon) if the reading speed is lower, and vice versa?

Glacier is an archival data store. It is meant to store data which is very rarely accessed, primary for backup purposes. Exact details of how AWS stores its data are unknown, but its speculated it is stored and shelved on custom tape storage, or something similar, which is much cheaper then any regular hard drives used for frequent data access. Wikipedia writes:
The Register claimed that Glacier runs on Spectra T-Finity tape libraries with LTO-6 tapes.[10][11] Others have conjectured Amazon using off-line shingled magnetic recording hard drives, multi-layer Blu-ray optical discs, or an alternative proprietary storage technology.

Related

Does Dask communicate with HDFS to optimize for data locality?

In Dask distributed documentation, they have the following information:
For example Dask developers use this ability to build in data locality
when we communicate to data-local storage systems like the Hadoop File
System. When users use high-level functions like
dask.dataframe.read_csv('hdfs:///path/to/files.*.csv') Dask talks to
the HDFS name node, finds the locations of all of the blocks of data,
and sends that information to the scheduler so that it can make
smarter decisions and improve load times for users.
However, it seems that the get_block_locations() was removed from the HDFS fs backend, so my question is: what is the current state of Dask regarding to HDFS ? Is it sending computation to nodes where data is local ? Is it optimizing the scheduler to take into account data locality on HDFS ?

Quite right, with the appearance of arrow's HDFS interface, which is now preferred over hdfs3, the consideration of block locations is no longer part of workloads accessing HDFS, since arrow's implementation doesn't include the get_block_locations() method.
However, we already wanted to remove the somewhat convoluted code which made this work, because we found that the inter-node bandwidth on test HDFS deployments was perfectly adequate that it made little practical difference in most workloads. The extra constrains on the size of the blocks versus the size of the partitions you would like in-memory created an additional layer of complexity.
By removing the specialised code, we could avoid the very special case that was being made for HDFS as opposed to external cloud storage (s3, gcs, azure) where it didn't matter which worker accessed which part of the data.
In short, yes the docs should be updated.

What is the best way to transfer data from AWS SQS to S3?

Here is the case - I have a large dataset, temporally retained in AWS SQS (around 200GB).
My main goal is to store the data so I can access it for building a machine learning model using also AWS. I believe, I should transfer the data to a S3 bucket. And while it is straightforward when you deal with small datasets, I am not sure what the best way to handle large ones is.
There is no way I can do it locally on my laptop, is it? So, do I create a ec2 instance and process the data there? Amazon has so many different solutions and ways of integration so it is kinda confusing.
Thanks for your help!

for building a machine learning model using also AWS. I believe, I should transfer the data to a S3 bucket.
Imho good idea. Indeed, S3 is the best option to retain data and be able to reuse them (unlike sqs). AWS tools (sagemaker, ml) can directly use content stored in s3. Most of the machine learning framework can read files, where you can easily copy files from s3 or mount a bucket as a filesystem (not my favourite option, but possible)
And while it is straightforward when you deal with small datasets, I am not sure what the best way to handle large ones is.
It depends on what data do you have a how you want to store and process the data files.
If you plan to have a file for each sqs message, I'd suggest to create a lambda function (assuming you can read and store the message reasonably fast).
If you want to aggregate and/or concatenate the source messages or processing a message would take too long, you may rather write a script to read and process the data on a server.
There is no way I can do it locally on my laptop, is it? So, do I create a ec2 instance and process the data there?
well - in theory you can do it on your laptop, but it would mean downloading 200G and uploading 200G (not counting the overhead and speed latency)
Your intuition is IMHO good, having EC2 in the same region would be most feasible, accessing all data almost locally
Amazon has so many different solutions and ways of integration so it is kinda confusing.
you have many options feasible for different use cases, often overlapping, so indeed it may look confusing

AWS S3 Standard Infrequent Access vs Reduced Redundancy storage class when coupled with CloudFront?

I'm using CloudFront to cache and distribute all of my thumbnails currently stored on S3 in Standard storage class. Since CloudFront caches originals and accesses them only every 24 hours, it makes sense to use a cheaper storage class than Standard: either Standard Infrequent Access (IA) or Reduced Redundancy (RR). But I'm not sure which one would be more suitable and cost effective.
Standard-IA has the cheapest storage among all (58% cheaper than Standard class and 47% cheaper than RR), but 60% more expensive requests than both Standard and RR. However, all files under 128kb stored in Standard-IA class are rounded to 128kb when calculating cost, which would apply to most of my thumbnail images.
Meanwhile, storage in RR class is only 20% cheaper than Standard, but its request cost is 60% cheaper than that of Standard-IA.
I'm unsure which one would be most cost effective in practice and would appreciate anyone with experience using both to give some feedback.

There's a problem with the premise of your question. The fact that CloudFront may cache your objects for some period of time actually has little relevance when selecting an S3 storage class.
REDUCED_REDUNDANCY is sometimes less expensive¹ because S3 stores your data on fewer physical devices, reducing the reliability somewhat in exchange for lower pricing... and in the event of failures, the object is statistically more likely to be lost by S3. If S3 loses the object because of the reduced redundancy, CloudFront will at some point begin returning errors.
The deciding factor in choosing this storage class is whether the object is easily replaced.
Reduced Redundancy Storage (RRS) is an Amazon S3 storage option that enables customers to reduce their costs by storing noncritical, reproducible data at lower levels of redundancy than Amazon S3’s standard storage. It provides a cost-effective, highly available solution for distributing or sharing content that is durably stored elsewhere, or for storing thumbnails, transcoded media, or other processed data that can be easily reproduced.
https://aws.amazon.com/s3/reduced-redundancy/
STANDARD_IA (infrequent access) is less expensive for a different reason: the storage savings are offset by retrieval charges. If an object is downloaded more than once per month, the combined charge will exceed the cost of STANDARD. It is intended for objects that will genuinely be accessed infrequently. Since CloudFront has multiple edge locations, each with its own independent cache,² whether an object is "currently stored in" CloudFront is not a question with a simple yes/no answer. It is also not possible to "game the system" by specifying large Cache-Control: max-age values. CloudFront has no charge for its cache storage, so it's only sensible that an object can be purged from the cache before the expiration time you specify. Indeed, anecdotal observations confirm what the docs indicate, that objects are sometimes purged from CloudFront due to a relative lack of "popularity."
The deciding factor in choosing this storage class is whether the increased data transfer (retrieval) charges will be low enough to justify the storage charge savings that they offset. Unless the object is expected to be downloaded less than once or twice a month, this storage class does not represent a cost savings.
Standard/Infrequent Access should be reserved for things you really don't expect to be needed often, like tarballs and database dumps and images unlikely to be reviewed after they are first accessed, such as (borrowing an example from my world) a proof-of-purchase/receipt scanned and submitted by a customer for a rebate claim. Once the rebate has been approved, it's very unlikely we'll need to look at that receipt again, but we do need to keep it on file. Hello, Standard_IA. (Note that S3 does this automatically for me, after the file has been stored for 30 days, using a lifecycle policy on the bucket).
Standard - IA is ideally suited for long-term file storage, older data from sync and share, backup data, and disaster recovery files.
https://aws.amazon.com/s3/faqs/#sia
Side note: one alternative mechanism for saving some storage cost is to gzip -9 the content before storing, and set Content-Encoding: gzip. I have been doing this for years with S3 and am still waiting for my first support ticket to come in reporting a browser that can't handle it. Even content that is allegedly already compressed -- such as .xlsx spreadsheets -- will often shrink a little bit, and every byte you squeeze out means slightly lower storage and download bandwidth charges.
Fundamentally, if your content is easily replaceable, such as resized images where you still have the original... or reports that can easily be rerun from source data... or content backed up elsewhere (AWS is essentially always my first choice for cloud services, but I do have backups of my S3 assets stored in another cloud provider's storage service, for example)... then reduced redundancy is a good option.
¹ REDUCED_REDUNDANCY is sometimes less expensive only in some regions as of late 2016. Prior to that, it was priced lower than STANDARD, but in an odd quirk of the strange world of competitive pricing, as a result of S3 price reductions announced in November, 2016, in some AWS regions, the STANDARD storage class is now slightly less expensive than REDUCED_REDUNDANCY ("RRS"). For example, in us-east-1, Standard was reduced from $0.03/GB to $0.023/GB, but RRS remained at $0.024/GB... leaving no obvious reason to ever use RRS in that region. The structure of the pricing pages leaves the impression that RRS may no longer be considered a current-generation offering by AWS. Indeed, it's an older offering than both STANDARD_IA and GLACIER. It is unlikely to ever be fully deprecated or eliminated, but they may not be inclined to reduce its costs to a point that lines up with the other storage classes if it's no longer among their primary offerings.
² "CloudFront has multiple edge locations, each with its own independent cache" is still a technically true statement, but CloudFront quietly began to roll out and then announced some significant architectural changes in late 2016, with the introduction of the regional edge caches. It is now, in a sense, "less true" that the global edge caches are indepenent. They still are, but it makes less of a difference, since CloudFront is now a two-tier network, with the global (outer tier) edge nodes sometimes fetching content from the regional (inner tier) edge nodes, instead of directly from the origin server. This should have the impact of increasing the likelihood of an object being considered to be "in" the cache, since a cache miss in the outer tier might be transformed into a hit by the inner tier, which is also reported to have more available cache storage space than some or all of the outer tier. It is not yet clear from external observations how much of an impact this has on hit rates on S3 origins, as the documentation indicates the regional edges are not used for S3 (only custom origins) but it seems less than clear that this universally holds true, particularly with the introduction of Lambda#Edge. It might be significant, but as of this writing, I do not believe it to have any material impact on my answer to the question presented here.

Since CloudFront caches originals and accesses them only every 24 hours
You can actually make CloudFront cache things for much longer if you want. You just need to add metadata to your objects that sets a Cache Control header, and according to the S3 documentation you can specify an age up to 100 years. You simply set a max-age in seconds, so if you really want to have your objects cached for 100 years:
Cache-Control: max-age=3153600000
As for your main question regarding SIA vs. RR, you've pretty much hit on all the differences between the two. It's just a matter of calculating the costs of using one vs. the other. You'll just need to run some calculations and see what the cost estimates are. If you have 100 thumbnails all under 128K then SIA will charge you for 100 * 128K bytes, whereas RR will just charge you for the costs of the total size of those 100 thumbnails. Similarly, if you set a fairly high cache timeout in CloudFront then you may see only 10 fetches from S3 each day, so SIA would charge you for retrieval of 10 * 128K bytes each day while RR would only charge you for the cost of the size of those 10 thumbnails.
Using some real numbers based on the size & quantity of your thumbnails and the amount of traffic you anticipate it should be pretty easy to come up with cost estimates.
FYI, you might also want to take a look at some of these slideshows and/or these videos. These are all from Amazon's re:Invent conferences, and these links should provide you with S3-specific presentations at those conferences.

Transferring lots of small files between EC2 and Amazon S3

I'm building a browser game and I have a lot of small files that need to be transfered between my EC2 instancce and S3 when players perform some key actions.
Although transferring a single big file is fairly fast, transferring multiple small files is extremely slow. I'm using Amazon's PHP SDK.
Is there a way to overcome this weakness in S3? Thanks.

It looks like combining the two solutions below is the way to go.
http://improve.dk/archive/2011/11/07/pushing-the-limits-of-amazon-s3-upload-performance.aspx
http://gearman.org/

If this transfer has to be made from EC2 instance to S3 then may be you can try using s3fuse , which will basically mount your s3 drive to storage volume of EC2 instance.

The performance of S3 is not constant and can be quite slow sometimes. If you need real-time performance for a shared object I would take a look at the AWS memcached service although I have not used it.

How exactly are you uploading files? is there a multithreaded method in the SDK? I'm asking because I've had to implement my own method for downloading stuff faster than the SDK.
Do you need to read those files right away? how many events do you have per second, do you need them ordered?
My first thought would be to make a local buffer that uploads batches every once in a while.
Then, if that's too slow, I'd store them in a fast buffer first, instead of S3, and flush it every once in a while. My choices would be simple stuff like SQS or Redis. SQS has theoretically unlimited throughput for random queues and 300 batches per second (1 batch = 1..10 messages = 0..256kb) for FIFO queues - which you can further increase.
Then you have streams, Lambda and whatever.

Backup: Amazon S3 or Glacier - lots of little files?

I'm trying to understand the complicated Amazon Glacier pricing model. I don't want to store a huge amount of data, a few GB's say 10. I hope never to download the files and if I did need to I don't care how long it takes.
Is there a cost per file I upload? Is it cheaper to zip lots of tiny files and upload in a few chunks or does 10,000 say images not matter? (cannot get a straight answer to this during searching)
Am I able to request the download of a whole Archive/Bucket or is it file-by-file?

I know this is a bit old, but you may still find my answer helpful (I hope). The other answer is based on S3 which wasn't your question I believe.
Glacier is intended for rare file access. Having that in mind they sort of punish you if you need to retrieve many files at once. In your particular case I would suggest uploading 10.000 separate files instead of let's say 100 ZIP files with 100 files each. The reason is very simple. Glacier will let you download for free only 5% of the total archive and is prorated daily. So if, for example, you need to download 10 photos you took on a weekend, you would be able to get those 10 photos for free if they are spread in the vault. On the other hand, if you have a ZIP file that has 100 photos inside, you'll be forced to download that zip that will probably be more than 5% of the total archive meaning you'll be paying some fees for the retrieval.
The only reason it makes sense to upload fewer files is to avoid high upload requests (10.000 files usually mean 10.000 requests). Requests are charged $0,05 per 1000. This fees are much lower that retrieval fees (taking into account the limits imposed), that's why I would always recommend uploading separate files. Of course you may zip files that make sense to be together.
Retrieval costs are very complex in Amazon Glacier. They have a good explanation here:
http://aws.amazon.com/glacier/faqs/#How_much_data_can_I_retrieve_for_free
But even there you'll need to pay attention on the calculations to get a clear idea on how costs are billed.
Regarding this question:
Am I able to request the download of a whole Archive/Bucket or is it file-by-file?
Requests are by file-by-file, although you can select many files at once and download them altogether.
Deciding whether to use S3 or Glacier really depends on your needs on file access. If you will rearly need access to your files then Glacier is your answer. Otherwise for 10GB S3 can still be cheap and be more flexible than Glacier.
In my case I find family photos to be a very precious thing. That's why I have a 100GB backup on glacier with all my family photos. I don't intend to access it unless there is some kind of disaster at home. In that case, I think I would not mind the retrieval cost if that saved something I really care about. But that's just me.

Detailed pricing information for S3 is available here. Specifics of the API functions available are here.
For S3, you are mostly charged for upload bandwidth (bytes sent TO S3), download bandwidth (bytes received FROM S3), and storage (bytes IN S3). You are also charged for the number and type of API calls.
So, if you upload your 10GB of data to S3 in 10,000 1MB files, store it for a month, and then download each of the files once, you'll be charged:
$0.00 for upload bandwidth (this is free)
$0.10 for the 10,000 PUT requests to upload the files
$0.95 for storing the 10GB for a month
$1.08 for 10GB download bandwidth (the first is free, then $0.12/GB)
$0.01 for the 10,000 GET requests to download the files
That's $2.14. If you uploaded and downloaded once each, but kept the data for a year, only the storage cost would go up to 12 * $0.95, or $11.40. If your files averaged only 100KB, so you had 100,000 of them, you'd pay 10 times as much for the PUT and GET requests, or $1.10 instead of $0.11.
You can only upload and download a single file per operation. If you combined your files into one using Zip, you'd only save by using fewer operations, which, as you can see, are pretty cheap to start with.
There is one quirk here, though. I'm pretty sure you are charged for all bandwidth usage when uploading and downloading, including request headers, not just the bodies containing your data. So if your files were really tiny the request headers might become significant, perhaps as much as the files themselves. In that case your bandwidth costs would double.
Glacier pricing is more complicated, and I've never used it myself. Basically, it reduces storage cost by almost ten-fold, leaving the other costs the same, and adding costs to archive and restore per object. Those costs seem to be significant if you have a lot of small objects, need to get a lot of your files at a time, or get files frequently. Glacier seems to be best when you have a lot of data (terabytes or more, not just gigabytes), but few operations. Given that you only have 10GB of data, S3 is so inexpensive it doesn't seem worth it to consider Glacier.
Finally, AWS has a free usage tier for the first year, which looks like it would cover all your costs except for half the storage charges.

Better use few larger files than lot of small ones
There are two approaches to putting files into Amazon Glacier. You either interact with vaults directly, or use S3 as frontend.
I am using S3 (and Amazon Management Console) so that I am able to see content of the archive and at the same time have it stored cheaply in Glacier.
This approach has one drawback - as storing any piece of information in Glacier has some data overhead (which you pay for too), then there is logically a break even point. Before 2014-04 price reduction I made a calculation and critical size is about 16 kB, storing smaller files in Glacier (using AWS S3 as frontend) was more expensive than keeping it only on S3. With price reduction for S3 storage (Glacier did not change) the break even point went even higher.
I guess, that even without S3 as frontend, the situation will be similar, even though a bit more friendly to smaller files.

Since November 21, 2016, Amazon updated the free tier policy for Glacier retrievals and updated the "5% of your average monthly storage" policy in favor of a flat 10GB free per month. However, if your retrieval policy was set prior to that day, then you're still on the "5%" policy and the other answers here still apply to you.
If your retrieval policy was set after Nov 21, 2016, and you're in the OP's shoes:
You're only storing 10GB, so you could retrieve all of your data for free once per month using Standard retrievals. It would make no difference if all 10,000 photos are zipped into one zip file or not (for retrievals).
The only variable in this scenario is number of upload requests. 10,000 requests at a price of $0.05 per 1,000 is only $0.50 and that's a one time fee for your specific case.
More pricing info at AWS Glacier FAQ
UPDATE:
Glacier docs recommend using multipart upload for files larger than 100MB.
I came to this conclusion independently after a couple timeouts when trying to upload an 8GB file.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js