AWS S3 batch operation - Got Dinged Pretty Hard - amazon-web-services

We used the newly introduced AWS S3 batch operation to back up our S3 bucket, which had about 15 TB of data, to Glacier S3 . Prior to backing up we had estimated the bandwidth and storage costs and also taken into account mandatory 90 day storage requirement for Glacier.
However, the actual costs turned out to be massive compared to our estimated cost. We somehow overlooked the UPLOAD requests costs which runs at $0.05 per 1000 requests. We have many millions of files and each file upload was considered as a request and we are looking at several thousand dollars worth of spend :(
I am wondering if there was any way to avoid this?

The concept of "backup" is quite interesting.
Traditionally, where data was stored on one disk, a backup was imperative because it's not good to have a single point-of-failure.
Amazon S3, however, stores data on multiple devices across multiple Availability Zones (effectively multiple data centers), which is how they get their 99.999999999% durability and 99.99% availability. (Note that durability means the likelihood of retaining the data, which isn't quite the same as availability which means the ability to access the data. I guess the difference is that during a power outage, the data might not be accessible, but it hasn't been lost.)
Therefore, the traditional concept of taking a backup in case of device failure has already been handled in S3, all for the standard cost. (There is an older Reduced Redundancy option that only copied to 2 AZs instead of 3, but that is no longer recommended.)
Next comes the concept of backup in case of accidental deletion of objects. When an object is deleted in S3, it is not recoverable. However, enabling versioning on a bucket will retain multiple versions including deleted objects. This is great where previous histories of objects need to be kept, or where deletions might need to be undone. The downside is that storage costs include all versions that are retained.
There is also the new object lock capabilities in S3 where objects can be locked for a period of time (eg 3 years) without the ability to delete them. This is ideal for situations where information must be retained for a period and it avoids accidental deletion. (There is also a legal hold capability that is the same, but can be turned on/off if you have appropriate permissions.)
Finally, there is the potential for deliberate malicious deletion if an angry staff member decides to take revenge on your company for not stocking their favourite flavour of coffee. If an AWS user has the necessary permissions, they can delete the data from S3. To guard against this, you should limit who has such permissions and possibly combine it with versioning (so they can delete the current version of an object, but it is actually retained by the system).
This can also be addressed by using Cross-Region Replication of Amazon S3 buckets. Some organizations use this to copy data to a bucket owned by a different AWS account, such that nobody has the ability to delete data from both accounts. This is closer to the concept of a true backup because the copy is kept separate (account-wise) from the original. The extra cost of storage is minimal compared to the potential costs if the data was lost. Plus, if you configure the replica bucket to use the Glacier Deep Archive storage class, the costs can be quite low.
Your copy to Glacier is another form of backup (and offers cheaper storage than S3 in the long-term), but it would need to be updated at a regular basis to be a continuous backup (eg by using backup software that understands S3 and Glacier). The "5c per 1000 requests" cost means that it is better used for archives (eg large zip files) rather than many, small files.
Bottom line: Your need for a backup might be as simple as turning on Versioning and limiting which users can totally delete an object (including all past versions) from the bucket. Or, create a bucket replica and store it in Glacier Deep Archive storage class.

Related

Intelligent Tiering Issues

I was hoping to get other's perspective on something that we've done and beginning to realize it was not the best idea.
Here's some information about our "environment":
Account A: We have an AWS account that acts as a data lake (we upload transaction data to S3 daily)
Account B: We have another AWS account that our business partners use to access the data in Account A
A few months back, we enabled Intelligent Tiering in S3 where objects are moved to Archive and Deep Archive in 90 and 180 days, respectively. We're now seeing the downfall of this decision. OUr business partners are unable to query data (in account A) from 3 months ago in Athena (account B). Oof.
I guess we did not understand the purpose of intelligent tiering and had hoped that Athena would be able to move tiered objects back into standard s3 when someone queries the data (as in instant retrieval).
There's definitely some use cases that we missed in vetting intelligent tiering.
I am curious how are others leveraging intelligent tiering? Are you only tiering objects that your business partners do not need as "instant retrieval"?
If you goal is to reduce Storage costs, it is worth investigating and understanding the various Storage Classes offered by Amazon S3.
They generally fall into three categories:
Instantly available: This is the 'standard' class
Instantly available, lower storage cost but higher retrieval cost: This is the 'Infrequent Access' classes. They can be cheaper for data that is only accessed once per month or less. If they are accessed more often, then the Request charges outweigh the savings in Storage costs.
Archived: This is typically the Glacier classes. Avoid them if you want to use Amazon Athena.
See that table on: Comparing the Amazon S3 storage classes
For your use-case, you might consider keeping data in Standard by default (since it is heavily accessed), and then move data older than 90 days to S3 One Zone - Infrequent Access. It will still be accessible, but will have a lower storage cost if rarely used.
I would also recommend converting your data to Snappy-compressed Parquet format (preferably partitioned), which will reduce the amount of storage required and will allow Athena to selectively pick which objects it needs to access. It will will also make Athena run faster and reduce the cost of Athena queries.
See: Top 10 Performance Tuning Tips for Amazon Athena | AWS Big Data Blog

S3 Standard to Glacier - Lifecycle Transition Cost

I wanted to confirm my understanding of the cost for lifecycle policy based transition of files from Standard to Glacier is correct as mentioned with below example.
Per 1000 files of transfer, we get charged a $0.06 (ap-south-1 region) to transfer to Glacier.
Eg:
Bucket A: Has 1 million files (3TB total size). If we move all the objects to Glacier, we will be charged 1000000*0.06/1000 = $60
Bucket B: Has 300 files (3TB total size). If we move all the objects to Glacier, we will be charged $0.06 or less (as it has less than 1000 files in the bucket)
Yes, the transition costs are indeed driven by the number of files being moved. It is similar to performing a new PUT operation to S3. You pay based on the number of requests being made. Once the data (files) are part of that storage class, then you are charged for the storage based on the class.
As you may note, transition to Glaicer (or a PUT to Glacier) is around 10 times costlier than a corresponding PUT to S3 standard. In ap-south-1, S3 PUT is charged at $0.005 per 1000 requests, while Glacier transition (or Glacier PUT) is charged at $0.06 per 1000 requests (as of May 2020).
Also, there are additional costs that need to be considered while moving data from S3 to Glacier. Hence it is always a good idea to do a cost analysis of whether it makes sense to move data from S3 to Glacier and determine when, if at all, you would see any savings.
I have covered such a cost analysis with various costs involved in great details in a blog post in case you are interested.
http://pragmaticnotes.com/2020/04/22/s3-to-glacier-lifecycle-transition-see-if-its-worth-it
Hope this helps!

How to migrate millions of files in aws s3 bucket from one account to another really fast

I have an s3 bucket in account A with millions of files that take up many GBs
I want to migrate all this data into a new bucket in account B
So far, I've given account B permissions to run s3 commands on the bucket in account A.
I am able to get some results with the
aws s3 sync command with the setting aws configure set default.s3.max_concurrent_requests 100
its fast but it only does a speed of some 20,000 parts per minute.
Is there an approach to sync/move data across aws buckets in different accounts REALLY fast?
I tried to do aws transfer acceleration but it seems that that is good for uploading and downloading from the buckets and I think it works within an aws account.
20,000 parts per minute.
That's > 300/sec, so, um... that's pretty fast. It's also 1.2 million per hour, which is also pretty respectable.
S3 Request Rate and Performance Considerations implies that 300 PUT req/sec is something of a default performance threshold.
At some point, make too many requests too quickly and you'll overwhelm your index partition and you'll start encountering 503 Slow Down errors -- though hopefully aws-cli will handle that gracefully.
The idea, though, seems to be that that S3 will scale up to accommodate the offered workload, so if you leave this process running, you may find that it actually does get faster with time.
Or...
If you expect a rapid increase in the request rate for a bucket to more than 300 PUT/LIST/DELETE requests per second or more than 800 GET requests per second, we recommend that you open a support case to prepare for the workload and avoid any temporary limits on your request rate.
http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html
Note, also, that it says "temporary limits." This is where I come to the conclusion that, all on its own, S3 will -- at some point -- provision more index capacity (presumably this means a partition split) to accommodate the increased workload.
You might also find that you get away a much higher aggregate trx/sec if you run multiple separate jobs, each handling a different object prefix (e.g. asset/1, asset/2, asset/3, etc. depending on how the keys are designed in your bucket, because you're not creating such a hot spot in the object index.
The copy operation going on here is an internal S3-to-S3 copy. It isn't download + upload. Transfer acceleration is only used for actual downloads.

Deleting a large number of Versions from Amazon S3 Bucket

Ok so I have a slight problem I have had a back up program running on a NAS to an Amazon S3 bucket and have had versioning turned enabled on the bucket. The NAS stores around 900GB of data.
I've had this running for a number of months now, and have been watching the bill go up and up for the cost of Amazons Glacier service (which my versioning lifecycle rules stored objects in). The cost has eventually got so high that I have had to suspend Versioning on the bucket in an effort to stop any more costs.
I now have a large number of versions on all our objects screenshot example of one file:
I have two questions:
I'm currently looking for a way to delete this large number of versioned files, from Amazons own documentation it would appear I have to delete each version individually is this correct? If so what is the best way to achieve this? I assume it would be some kind of script which would have to list each item in a bucket and issue a DELETEVERSION to each versioned object? This would be a lot of requests and I guess that leads onto my next question.
What are the cost implications of deleting a large amount of Glacier objects in this way? It seems cost of deletion of objects in Glacier is expensive, does this also apply to versions created in S3?
Happy to provide more details if needed,
Thanks
Deletions from S3 are free, even if S3 has migrated the object to glacier, unless the object has been in glacier for less than 3 months, because glacier is intended for long-term storage. In that case, only, you're billed for the amount of time left (e.g., for an object stored for only 2 months, you will be billed an early deletion charge equal to 1 more month).
You will still have to identify and specify the versions to delete, but S3 accepts up to 1000 objects or versions (max 1k entites) in a single multi-delete request.
http://docs.aws.amazon.com/AmazonS3/latest/API/multiobjectdeleteapi.html

Expiry date for Glacier backups

Is there a way to set an expiry date in Amazon Glacier? I want to copy in weekly backup files, but I dont want to hang on to more than 1 years worth.
Can the files be set to "expire" after one year, or is this something I will have to do manually?
While not available natively within Amazon Glacier, AWS has recently enabled Archiving Amazon S3 Data to Amazon Glacier, which makes working with Glacier much easier in the first place already:
[...] Amazon S3 was designed for rapid retrieval. Glacier, in
contrast, trades off retrieval time for cost, providing storage for as
little at $0.01 per Gigabyte per month while retrieving data within
three to five hours.
How would you like to have the best of both worlds? How about rapid
retrieval of fresh data stored in S3, with automatic, policy-driven
archiving to lower cost Glacier storage as your data ages, along with
easy, API-driven or console-powered retrieval? [emphasis mine]
[...] You can now use Amazon Glacier as a storage option for Amazon S3.
This is enabled by facilitating Amazon S3 Object Lifecycle Management, which not only drives the mentioned Object Archival (Transition Objects to the Glacier Storage Class) but also includes optional Object Expiration, which allows you to achieve what you want as outlined in section Before You Decide to Expire Objects within Lifecycle Configuration Rules:
The Expiration action deletes objects
You might have objects in Amazon S3 or archived to Amazon Glacier. No
matter where these objects are, Amazon S3 will delete them. You will
no longer be able to access these objects. [emphasis mine]
So at the small price of having your objects stored in S3 for a short time (which actually eases working with Glacier a lot due to removing the need to manage archives/inventories) you gain the benefit of optional automatic expiration.
You can do this in the AWS Command Line Interface.
http://docs.aws.amazon.com/AmazonS3/latest/dev/object-lifecycle-mgmt.html