Intelligent Tiering Issues - amazon-web-services

I was hoping to get other's perspective on something that we've done and beginning to realize it was not the best idea.
Here's some information about our "environment":
Account A: We have an AWS account that acts as a data lake (we upload transaction data to S3 daily)
Account B: We have another AWS account that our business partners use to access the data in Account A
A few months back, we enabled Intelligent Tiering in S3 where objects are moved to Archive and Deep Archive in 90 and 180 days, respectively. We're now seeing the downfall of this decision. OUr business partners are unable to query data (in account A) from 3 months ago in Athena (account B). Oof.
I guess we did not understand the purpose of intelligent tiering and had hoped that Athena would be able to move tiered objects back into standard s3 when someone queries the data (as in instant retrieval).
There's definitely some use cases that we missed in vetting intelligent tiering.
I am curious how are others leveraging intelligent tiering? Are you only tiering objects that your business partners do not need as "instant retrieval"?

If you goal is to reduce Storage costs, it is worth investigating and understanding the various Storage Classes offered by Amazon S3.
They generally fall into three categories:
Instantly available: This is the 'standard' class
Instantly available, lower storage cost but higher retrieval cost: This is the 'Infrequent Access' classes. They can be cheaper for data that is only accessed once per month or less. If they are accessed more often, then the Request charges outweigh the savings in Storage costs.
Archived: This is typically the Glacier classes. Avoid them if you want to use Amazon Athena.
See that table on: Comparing the Amazon S3 storage classes
For your use-case, you might consider keeping data in Standard by default (since it is heavily accessed), and then move data older than 90 days to S3 One Zone - Infrequent Access. It will still be accessible, but will have a lower storage cost if rarely used.
I would also recommend converting your data to Snappy-compressed Parquet format (preferably partitioned), which will reduce the amount of storage required and will allow Athena to selectively pick which objects it needs to access. It will will also make Athena run faster and reduce the cost of Athena queries.
See: Top 10 Performance Tuning Tips for Amazon Athena | AWS Big Data Blog

Related

S3 intelligent tiering based on percentage

Is there a way to maintain the "last" 20% of the objects stored in a specific bucket in a S3 Standard and the rest in Standard-IA?
I might be wrong, but it looks like intelligent tiering allows me to auto-transit objects solely on the last time they were accessed.
Side note - aws documentation is hell on earth.
With intelligent tiering it transits objects based on last accessed only. Idea is that the files you access more frequently are stored as standard tier, so you wouldn't be charged the access fee (as you're charged for infrequent access).
So, if you want to do it based on creation time, you would have to go the standard way, by making lifecycle rules.
There's no way to do this based on number of files in any scenario, only based on time that has passed.

AWS S3 batch operation - Got Dinged Pretty Hard

We used the newly introduced AWS S3 batch operation to back up our S3 bucket, which had about 15 TB of data, to Glacier S3 . Prior to backing up we had estimated the bandwidth and storage costs and also taken into account mandatory 90 day storage requirement for Glacier.
However, the actual costs turned out to be massive compared to our estimated cost. We somehow overlooked the UPLOAD requests costs which runs at $0.05 per 1000 requests. We have many millions of files and each file upload was considered as a request and we are looking at several thousand dollars worth of spend :(
I am wondering if there was any way to avoid this?
The concept of "backup" is quite interesting.
Traditionally, where data was stored on one disk, a backup was imperative because it's not good to have a single point-of-failure.
Amazon S3, however, stores data on multiple devices across multiple Availability Zones (effectively multiple data centers), which is how they get their 99.999999999% durability and 99.99% availability. (Note that durability means the likelihood of retaining the data, which isn't quite the same as availability which means the ability to access the data. I guess the difference is that during a power outage, the data might not be accessible, but it hasn't been lost.)
Therefore, the traditional concept of taking a backup in case of device failure has already been handled in S3, all for the standard cost. (There is an older Reduced Redundancy option that only copied to 2 AZs instead of 3, but that is no longer recommended.)
Next comes the concept of backup in case of accidental deletion of objects. When an object is deleted in S3, it is not recoverable. However, enabling versioning on a bucket will retain multiple versions including deleted objects. This is great where previous histories of objects need to be kept, or where deletions might need to be undone. The downside is that storage costs include all versions that are retained.
There is also the new object lock capabilities in S3 where objects can be locked for a period of time (eg 3 years) without the ability to delete them. This is ideal for situations where information must be retained for a period and it avoids accidental deletion. (There is also a legal hold capability that is the same, but can be turned on/off if you have appropriate permissions.)
Finally, there is the potential for deliberate malicious deletion if an angry staff member decides to take revenge on your company for not stocking their favourite flavour of coffee. If an AWS user has the necessary permissions, they can delete the data from S3. To guard against this, you should limit who has such permissions and possibly combine it with versioning (so they can delete the current version of an object, but it is actually retained by the system).
This can also be addressed by using Cross-Region Replication of Amazon S3 buckets. Some organizations use this to copy data to a bucket owned by a different AWS account, such that nobody has the ability to delete data from both accounts. This is closer to the concept of a true backup because the copy is kept separate (account-wise) from the original. The extra cost of storage is minimal compared to the potential costs if the data was lost. Plus, if you configure the replica bucket to use the Glacier Deep Archive storage class, the costs can be quite low.
Your copy to Glacier is another form of backup (and offers cheaper storage than S3 in the long-term), but it would need to be updated at a regular basis to be a continuous backup (eg by using backup software that understands S3 and Glacier). The "5c per 1000 requests" cost means that it is better used for archives (eg large zip files) rather than many, small files.
Bottom line: Your need for a backup might be as simple as turning on Versioning and limiting which users can totally delete an object (including all past versions) from the bucket. Or, create a bucket replica and store it in Glacier Deep Archive storage class.

Related to AWS S3 storage

when we choose Reduced Redundancy Storage or Infrequent Access storage what action did AWS takes on their side. Are they using different type of storage to reduce the cost ?
AWS does not reveal details of the inner workings of their systems.
Instead, they provide an API and documented explanations of the services provided. This means that the internal methods can change, while still supporting the published API.
RRS S3 vs Standard S3:
Durability: As far as RRS is concerned, the reduced cost is due to inferior durability. Standard S3 provides 99.999999999% but RRS provides 99.99% durability. ie. On average, 1 object out of 10k(0.01%) objects per year may be expected to be lost.
Facility Tolerance: Objects stored in Standard S3 is replicated across more facilities dhan RRS. Facility tolerance factor for RRS is 1 ie. It can withstand upto loss of 1 facility.
Performance: Performance wise, RRS provied the same latency and throughput as that of Standard S3. If you data is critical and you need very high redundancy, it is recommended to go with Standard S3. Any non-critical and easily reproducible data can be stored in RRS.
Pricing: The main advantage of RRS used to be the cost. Now that amazon has removed the cost advantage, there is not much difference in price between RRS S3 and Std S3(maybe some minute difference in some regions). Why pay the same price for lesser benefit? So you might actually prefer Std S3 in this aspect. AWS might as well deprecate this feature as its of limited use now.
Std Infrequent Access S3 vs Standard S3:
Performance, durability and availability wise, both the storage offer the same package.
Purpose: The main difference between these two types depends on your need. As the name suggests, STD IA S3 is more suitable for less frequently accessed and long term data(But available and retrievable immediately when required). This is also one of the factor which differentiates Std IA S3 from glacier.
Pricing: The pricing model is also defined according to the purpose of use. IA S3 offers low cost for storing objects (almost 50% of that of Std S3), but they charge you almost double for operations(put/copy/retrieve) when compared to Std S3. So if you are storing data in IA and accessing frequently, then you may end up paying more than if you had used Std S3. Database backups/DR data can be stored in IA. You don't access these data regularly, but it will be available immediately during critical scenarios. IA S3 also has a minimum object size of 128kb. Smaller objects will be charged for 128KB of storage.

Deleting a large number of Versions from Amazon S3 Bucket

Ok so I have a slight problem I have had a back up program running on a NAS to an Amazon S3 bucket and have had versioning turned enabled on the bucket. The NAS stores around 900GB of data.
I've had this running for a number of months now, and have been watching the bill go up and up for the cost of Amazons Glacier service (which my versioning lifecycle rules stored objects in). The cost has eventually got so high that I have had to suspend Versioning on the bucket in an effort to stop any more costs.
I now have a large number of versions on all our objects screenshot example of one file:
I have two questions:
I'm currently looking for a way to delete this large number of versioned files, from Amazons own documentation it would appear I have to delete each version individually is this correct? If so what is the best way to achieve this? I assume it would be some kind of script which would have to list each item in a bucket and issue a DELETEVERSION to each versioned object? This would be a lot of requests and I guess that leads onto my next question.
What are the cost implications of deleting a large amount of Glacier objects in this way? It seems cost of deletion of objects in Glacier is expensive, does this also apply to versions created in S3?
Happy to provide more details if needed,
Thanks
Deletions from S3 are free, even if S3 has migrated the object to glacier, unless the object has been in glacier for less than 3 months, because glacier is intended for long-term storage. In that case, only, you're billed for the amount of time left (e.g., for an object stored for only 2 months, you will be billed an early deletion charge equal to 1 more month).
You will still have to identify and specify the versions to delete, but S3 accepts up to 1000 objects or versions (max 1k entites) in a single multi-delete request.
http://docs.aws.amazon.com/AmazonS3/latest/API/multiobjectdeleteapi.html

Expiry date for Glacier backups

Is there a way to set an expiry date in Amazon Glacier? I want to copy in weekly backup files, but I dont want to hang on to more than 1 years worth.
Can the files be set to "expire" after one year, or is this something I will have to do manually?
While not available natively within Amazon Glacier, AWS has recently enabled Archiving Amazon S3 Data to Amazon Glacier, which makes working with Glacier much easier in the first place already:
[...] Amazon S3 was designed for rapid retrieval. Glacier, in
contrast, trades off retrieval time for cost, providing storage for as
little at $0.01 per Gigabyte per month while retrieving data within
three to five hours.
How would you like to have the best of both worlds? How about rapid
retrieval of fresh data stored in S3, with automatic, policy-driven
archiving to lower cost Glacier storage as your data ages, along with
easy, API-driven or console-powered retrieval? [emphasis mine]
[...] You can now use Amazon Glacier as a storage option for Amazon S3.
This is enabled by facilitating Amazon S3 Object Lifecycle Management, which not only drives the mentioned Object Archival (Transition Objects to the Glacier Storage Class) but also includes optional Object Expiration, which allows you to achieve what you want as outlined in section Before You Decide to Expire Objects within Lifecycle Configuration Rules:
The Expiration action deletes objects
You might have objects in Amazon S3 or archived to Amazon Glacier. No
matter where these objects are, Amazon S3 will delete them. You will
no longer be able to access these objects. [emphasis mine]
So at the small price of having your objects stored in S3 for a short time (which actually eases working with Glacier a lot due to removing the need to manage archives/inventories) you gain the benefit of optional automatic expiration.
You can do this in the AWS Command Line Interface.
http://docs.aws.amazon.com/AmazonS3/latest/dev/object-lifecycle-mgmt.html