S3 intelligent tiering based on percentage - amazon-web-services

Is there a way to maintain the "last" 20% of the objects stored in a specific bucket in a S3 Standard and the rest in Standard-IA?
I might be wrong, but it looks like intelligent tiering allows me to auto-transit objects solely on the last time they were accessed.
Side note - aws documentation is hell on earth.

With intelligent tiering it transits objects based on last accessed only. Idea is that the files you access more frequently are stored as standard tier, so you wouldn't be charged the access fee (as you're charged for infrequent access).
So, if you want to do it based on creation time, you would have to go the standard way, by making lifecycle rules.
There's no way to do this based on number of files in any scenario, only based on time that has passed.

Related

Intelligent Tiering Issues

I was hoping to get other's perspective on something that we've done and beginning to realize it was not the best idea.
Here's some information about our "environment":
Account A: We have an AWS account that acts as a data lake (we upload transaction data to S3 daily)
Account B: We have another AWS account that our business partners use to access the data in Account A
A few months back, we enabled Intelligent Tiering in S3 where objects are moved to Archive and Deep Archive in 90 and 180 days, respectively. We're now seeing the downfall of this decision. OUr business partners are unable to query data (in account A) from 3 months ago in Athena (account B). Oof.
I guess we did not understand the purpose of intelligent tiering and had hoped that Athena would be able to move tiered objects back into standard s3 when someone queries the data (as in instant retrieval).
There's definitely some use cases that we missed in vetting intelligent tiering.
I am curious how are others leveraging intelligent tiering? Are you only tiering objects that your business partners do not need as "instant retrieval"?
If you goal is to reduce Storage costs, it is worth investigating and understanding the various Storage Classes offered by Amazon S3.
They generally fall into three categories:
Instantly available: This is the 'standard' class
Instantly available, lower storage cost but higher retrieval cost: This is the 'Infrequent Access' classes. They can be cheaper for data that is only accessed once per month or less. If they are accessed more often, then the Request charges outweigh the savings in Storage costs.
Archived: This is typically the Glacier classes. Avoid them if you want to use Amazon Athena.
See that table on: Comparing the Amazon S3 storage classes
For your use-case, you might consider keeping data in Standard by default (since it is heavily accessed), and then move data older than 90 days to S3 One Zone - Infrequent Access. It will still be accessible, but will have a lower storage cost if rarely used.
I would also recommend converting your data to Snappy-compressed Parquet format (preferably partitioned), which will reduce the amount of storage required and will allow Athena to selectively pick which objects it needs to access. It will will also make Athena run faster and reduce the cost of Athena queries.
See: Top 10 Performance Tuning Tips for Amazon Athena | AWS Big Data Blog

AWS S3 batch operation - Got Dinged Pretty Hard

We used the newly introduced AWS S3 batch operation to back up our S3 bucket, which had about 15 TB of data, to Glacier S3 . Prior to backing up we had estimated the bandwidth and storage costs and also taken into account mandatory 90 day storage requirement for Glacier.
However, the actual costs turned out to be massive compared to our estimated cost. We somehow overlooked the UPLOAD requests costs which runs at $0.05 per 1000 requests. We have many millions of files and each file upload was considered as a request and we are looking at several thousand dollars worth of spend :(
I am wondering if there was any way to avoid this?
The concept of "backup" is quite interesting.
Traditionally, where data was stored on one disk, a backup was imperative because it's not good to have a single point-of-failure.
Amazon S3, however, stores data on multiple devices across multiple Availability Zones (effectively multiple data centers), which is how they get their 99.999999999% durability and 99.99% availability. (Note that durability means the likelihood of retaining the data, which isn't quite the same as availability which means the ability to access the data. I guess the difference is that during a power outage, the data might not be accessible, but it hasn't been lost.)
Therefore, the traditional concept of taking a backup in case of device failure has already been handled in S3, all for the standard cost. (There is an older Reduced Redundancy option that only copied to 2 AZs instead of 3, but that is no longer recommended.)
Next comes the concept of backup in case of accidental deletion of objects. When an object is deleted in S3, it is not recoverable. However, enabling versioning on a bucket will retain multiple versions including deleted objects. This is great where previous histories of objects need to be kept, or where deletions might need to be undone. The downside is that storage costs include all versions that are retained.
There is also the new object lock capabilities in S3 where objects can be locked for a period of time (eg 3 years) without the ability to delete them. This is ideal for situations where information must be retained for a period and it avoids accidental deletion. (There is also a legal hold capability that is the same, but can be turned on/off if you have appropriate permissions.)
Finally, there is the potential for deliberate malicious deletion if an angry staff member decides to take revenge on your company for not stocking their favourite flavour of coffee. If an AWS user has the necessary permissions, they can delete the data from S3. To guard against this, you should limit who has such permissions and possibly combine it with versioning (so they can delete the current version of an object, but it is actually retained by the system).
This can also be addressed by using Cross-Region Replication of Amazon S3 buckets. Some organizations use this to copy data to a bucket owned by a different AWS account, such that nobody has the ability to delete data from both accounts. This is closer to the concept of a true backup because the copy is kept separate (account-wise) from the original. The extra cost of storage is minimal compared to the potential costs if the data was lost. Plus, if you configure the replica bucket to use the Glacier Deep Archive storage class, the costs can be quite low.
Your copy to Glacier is another form of backup (and offers cheaper storage than S3 in the long-term), but it would need to be updated at a regular basis to be a continuous backup (eg by using backup software that understands S3 and Glacier). The "5c per 1000 requests" cost means that it is better used for archives (eg large zip files) rather than many, small files.
Bottom line: Your need for a backup might be as simple as turning on Versioning and limiting which users can totally delete an object (including all past versions) from the bucket. Or, create a bucket replica and store it in Glacier Deep Archive storage class.

AWS S3 how to get prefix cost in period

I have a bucket that receives something around 20 new prefixes in a day.
The prefixes have files that are our products, and we need to know how much each product costs to keep on air.
I was researching how to get the total cost of each product (storage and data transfer) with 'Cost Explorer' and 'CloudWatch'.
The first does not seems to help me, while cloudwatch does have prefix or tags options, but I need to previously specify which prefix to watch over.
Is there a way to get this cost without previous configuration?
Cost is easy, since it is based on volume of data. Use Amazon S3 Inventory to obtain a daily listing of content.
Access costs is not available broken down by prefix. Instead, use Amazon S3 Server Access Logging to break down access by object and, therefore, by prefix. Then allocate the billed data transfer costs amongst prefixes. Use Bytes Sent to determine volume.

Deleting a large number of Versions from Amazon S3 Bucket

Ok so I have a slight problem I have had a back up program running on a NAS to an Amazon S3 bucket and have had versioning turned enabled on the bucket. The NAS stores around 900GB of data.
I've had this running for a number of months now, and have been watching the bill go up and up for the cost of Amazons Glacier service (which my versioning lifecycle rules stored objects in). The cost has eventually got so high that I have had to suspend Versioning on the bucket in an effort to stop any more costs.
I now have a large number of versions on all our objects screenshot example of one file:
I have two questions:
I'm currently looking for a way to delete this large number of versioned files, from Amazons own documentation it would appear I have to delete each version individually is this correct? If so what is the best way to achieve this? I assume it would be some kind of script which would have to list each item in a bucket and issue a DELETEVERSION to each versioned object? This would be a lot of requests and I guess that leads onto my next question.
What are the cost implications of deleting a large amount of Glacier objects in this way? It seems cost of deletion of objects in Glacier is expensive, does this also apply to versions created in S3?
Happy to provide more details if needed,
Thanks
Deletions from S3 are free, even if S3 has migrated the object to glacier, unless the object has been in glacier for less than 3 months, because glacier is intended for long-term storage. In that case, only, you're billed for the amount of time left (e.g., for an object stored for only 2 months, you will be billed an early deletion charge equal to 1 more month).
You will still have to identify and specify the versions to delete, but S3 accepts up to 1000 objects or versions (max 1k entites) in a single multi-delete request.
http://docs.aws.amazon.com/AmazonS3/latest/API/multiobjectdeleteapi.html

can i meter or set a size limit to an s3 folder

I'd like to set up a separate s3 bucket folder for each of my mobile app users for them to store their files. However, I also want to set up size limits so that they don't use up too much storage. Additionally, if they do go over the limit I'd like to offer them increased space if they sign up for a premium service.
Is there a way I can set folder file size limits through s3 configuration or api? If not would I have to use the apis somehow to calculate folder size on every upload? I know that there is the devpay feature in Amazon but it might be a hassle for users to sign up with Amazon if they want to just use small amount of free space.
There does not appear to be a way to do this, probably at least in part because there is actually no such thing as "folders" in S3. There is only the appearance of folders.
Amazon S3 does not have concept of a folder, there are only buckets and objects. The Amazon S3 console supports the folder concept using the object key name prefixes.
— http://docs.aws.amazon.com/AmazonS3/latest/UG/FolderOperations.html
All of the keys in an S3 bucket are actually in a flat namespace, with the / delimiter used as desired to conceptually divide objects into logical groupings that look like folders, but it's only a convenient illusion. It seems impossible that S3 would have a concept of the size of a folder, when it has no actual concept of "folders" at all.
If you don't maintain an authoritative database of what's been stored by clients (which suggests that all uploads should pass through an app server rather than going directly to S3, which is the the only approach that makes sense to me at all) then your only alternative is to poll S3 to discover what's there. An imperfect shortcut would be for your application to read the S3 bucket logs to discover what had been uploaded, but that is only provided on a best-effort basis. It should be reliable but is not guaranteed to be perfect.
This service provides a best effort attempt to log all access of objects within a bucket. Please note that it is possible that the actual usage report at the end of a month will slightly vary.
Your other option is to develop your own service that sits between users and Amazon S3, that monitors all requests to your buckets/objects.
— http://aws.amazon.com/articles/1109#13
Again, having your app server mediate all requests seems to be the logical approach, and would also allow you to detect immediately (as opposed to "discover later") that a user had exceeded a threshold.
I would maintain a seperate database in the cloud to hold each users total hdd usage count. Its easy to manage the count via S3 Object Lifecycle Events which could easily trigger a Lambda which in turn writes to a DB.