Related to AWS S3 storage - amazon-web-services

when we choose Reduced Redundancy Storage or Infrequent Access storage what action did AWS takes on their side. Are they using different type of storage to reduce the cost ?

AWS does not reveal details of the inner workings of their systems.
Instead, they provide an API and documented explanations of the services provided. This means that the internal methods can change, while still supporting the published API.

RRS S3 vs Standard S3:
Durability: As far as RRS is concerned, the reduced cost is due to inferior durability. Standard S3 provides 99.999999999% but RRS provides 99.99% durability. ie. On average, 1 object out of 10k(0.01%) objects per year may be expected to be lost.
Facility Tolerance: Objects stored in Standard S3 is replicated across more facilities dhan RRS. Facility tolerance factor for RRS is 1 ie. It can withstand upto loss of 1 facility.
Performance: Performance wise, RRS provied the same latency and throughput as that of Standard S3. If you data is critical and you need very high redundancy, it is recommended to go with Standard S3. Any non-critical and easily reproducible data can be stored in RRS.
Pricing: The main advantage of RRS used to be the cost. Now that amazon has removed the cost advantage, there is not much difference in price between RRS S3 and Std S3(maybe some minute difference in some regions). Why pay the same price for lesser benefit? So you might actually prefer Std S3 in this aspect. AWS might as well deprecate this feature as its of limited use now.
Std Infrequent Access S3 vs Standard S3:
Performance, durability and availability wise, both the storage offer the same package.
Purpose: The main difference between these two types depends on your need. As the name suggests, STD IA S3 is more suitable for less frequently accessed and long term data(But available and retrievable immediately when required). This is also one of the factor which differentiates Std IA S3 from glacier.
Pricing: The pricing model is also defined according to the purpose of use. IA S3 offers low cost for storing objects (almost 50% of that of Std S3), but they charge you almost double for operations(put/copy/retrieve) when compared to Std S3. So if you are storing data in IA and accessing frequently, then you may end up paying more than if you had used Std S3. Database backups/DR data can be stored in IA. You don't access these data regularly, but it will be available immediately during critical scenarios. IA S3 also has a minimum object size of 128kb. Smaller objects will be charged for 128KB of storage.

Related

Intelligent Tiering Issues

I was hoping to get other's perspective on something that we've done and beginning to realize it was not the best idea.
Here's some information about our "environment":
Account A: We have an AWS account that acts as a data lake (we upload transaction data to S3 daily)
Account B: We have another AWS account that our business partners use to access the data in Account A
A few months back, we enabled Intelligent Tiering in S3 where objects are moved to Archive and Deep Archive in 90 and 180 days, respectively. We're now seeing the downfall of this decision. OUr business partners are unable to query data (in account A) from 3 months ago in Athena (account B). Oof.
I guess we did not understand the purpose of intelligent tiering and had hoped that Athena would be able to move tiered objects back into standard s3 when someone queries the data (as in instant retrieval).
There's definitely some use cases that we missed in vetting intelligent tiering.
I am curious how are others leveraging intelligent tiering? Are you only tiering objects that your business partners do not need as "instant retrieval"?
If you goal is to reduce Storage costs, it is worth investigating and understanding the various Storage Classes offered by Amazon S3.
They generally fall into three categories:
Instantly available: This is the 'standard' class
Instantly available, lower storage cost but higher retrieval cost: This is the 'Infrequent Access' classes. They can be cheaper for data that is only accessed once per month or less. If they are accessed more often, then the Request charges outweigh the savings in Storage costs.
Archived: This is typically the Glacier classes. Avoid them if you want to use Amazon Athena.
See that table on: Comparing the Amazon S3 storage classes
For your use-case, you might consider keeping data in Standard by default (since it is heavily accessed), and then move data older than 90 days to S3 One Zone - Infrequent Access. It will still be accessible, but will have a lower storage cost if rarely used.
I would also recommend converting your data to Snappy-compressed Parquet format (preferably partitioned), which will reduce the amount of storage required and will allow Athena to selectively pick which objects it needs to access. It will will also make Athena run faster and reduce the cost of Athena queries.
See: Top 10 Performance Tuning Tips for Amazon Athena | AWS Big Data Blog

How would you program a strong read-after-write consistency in a distributed system?

Recently, S3 announces strong read-after-write consistency. I'm curious as to how one can program that. Doesn't it violate the CAP theorem?
In my mind, the simplest way is to wait for the replication to happen and then return, but that would result in performance degradation.
AWS says that there is no performance difference. How is this achieved?
Another thought is that amazon has a giant index table that keeps track of all S3 objects and where it is stored (triple replication I believe). And it will need to update this index at every PUT/DELTE. Is that technically feasible?
As indicated by Martin above, there is a link to Reddit which discusses this. The top response from u/ryeguy gave this answer:
If I had to guess, s3 synchronously writes to a cluster of storage nodes before returning success, and then asynchronously replicates it to other nodes for stronger durability and availability. There used to be a risk of reading from a node that didn't receive a file's change yet, which could give you an outdated file. Now they added logic so the lookup router is aware of how far an update is propagated and can avoid routing reads to stale replicas.
I just pulled all this out of my ass and have no idea how s3 is actually architected behind the scenes, but given the durability and availability guarantees and the fact that this change doesn't lower them, it must be something along these lines.
Better answers are welcome.
Our assumptions will not work in the Cloud systems. There are a lot of factors involved in the risk analysis process like availability, consistency, disaster recovery, backup mechanism, maintenance burden, charges, etc. Also, we only take reference of theorems while designing. we can create our own by merging multiple of them. So I would like to share the link provided by AWS which illustrates the process in detail.
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-consistent-view.html
When you create a cluster with consistent view enabled, Amazon EMR uses an Amazon DynamoDB database to store object metadata and track consistency with Amazon S3. You must grant EMRFS role with permissions to access DynamoDB. If consistent view determines that Amazon S3 is inconsistent during a file system operation, it retries that operation according to rules that you can define. By default, the DynamoDB database has 400 read capacity and 100 write capacity. You can configure read/write capacity settings depending on the number of objects that EMRFS tracks and the number of nodes concurrently using the metadata. You can also configure other database and operational parameters. Using consistent view incurs DynamoDB charges, which are typically small, in addition to the charges for Amazon EMR.

Costs of backing up S3 bucket

I have a very large bucket. about 10 million files of 1MB, for a total of 10TB.
Continuously files are added to it (never modified). Let's say 1TB per month.
I backup this bucket to a different one on the same region using a Replication config.
I don't use Galcier for various availabilty and costs considerations.
I'm wondering if I should use Standard access or Infrequent Access storage. As there is a very large amount of files and I'm not sure how the COPY request cost will effect.
What is the difference of costs between the different options? The cost of storage is quite clear, but for copy and other operations, it's not very clear.
A good rule-of-thumb is that Infrequent Access and Glacier are only cheaper if the objects are accessed less than once per month.
This is because those storage classes have a charge for data retrieval.
Let's say data is retrieved once per month:
Standard = $0.023/GB/month
Standard - Infrequent Access = $0.0125/GB/month plus $0.01/GB for retrieval = $0.0225
Glacier = $0.004/GB/month plus ~ $0.01/GB = $0.014 -- a good price, but slow to retrieve
Glacier Deep Archive = $0.00099/GB/month + $0.02 = $0.021
Therefore, if the backup data is infrequently accessed, (less than once per month) it would be a significant saving to use a different storage class. The Same-Region Replication configuration can automatically change Storage Class when copying the objects.
The Request charges would be insignificant compared to these cost savings.

S3 intelligent tiering based on percentage

Is there a way to maintain the "last" 20% of the objects stored in a specific bucket in a S3 Standard and the rest in Standard-IA?
I might be wrong, but it looks like intelligent tiering allows me to auto-transit objects solely on the last time they were accessed.
Side note - aws documentation is hell on earth.
With intelligent tiering it transits objects based on last accessed only. Idea is that the files you access more frequently are stored as standard tier, so you wouldn't be charged the access fee (as you're charged for infrequent access).
So, if you want to do it based on creation time, you would have to go the standard way, by making lifecycle rules.
There's no way to do this based on number of files in any scenario, only based on time that has passed.

AWS S3 batch operation - Got Dinged Pretty Hard

We used the newly introduced AWS S3 batch operation to back up our S3 bucket, which had about 15 TB of data, to Glacier S3 . Prior to backing up we had estimated the bandwidth and storage costs and also taken into account mandatory 90 day storage requirement for Glacier.
However, the actual costs turned out to be massive compared to our estimated cost. We somehow overlooked the UPLOAD requests costs which runs at $0.05 per 1000 requests. We have many millions of files and each file upload was considered as a request and we are looking at several thousand dollars worth of spend :(
I am wondering if there was any way to avoid this?
The concept of "backup" is quite interesting.
Traditionally, where data was stored on one disk, a backup was imperative because it's not good to have a single point-of-failure.
Amazon S3, however, stores data on multiple devices across multiple Availability Zones (effectively multiple data centers), which is how they get their 99.999999999% durability and 99.99% availability. (Note that durability means the likelihood of retaining the data, which isn't quite the same as availability which means the ability to access the data. I guess the difference is that during a power outage, the data might not be accessible, but it hasn't been lost.)
Therefore, the traditional concept of taking a backup in case of device failure has already been handled in S3, all for the standard cost. (There is an older Reduced Redundancy option that only copied to 2 AZs instead of 3, but that is no longer recommended.)
Next comes the concept of backup in case of accidental deletion of objects. When an object is deleted in S3, it is not recoverable. However, enabling versioning on a bucket will retain multiple versions including deleted objects. This is great where previous histories of objects need to be kept, or where deletions might need to be undone. The downside is that storage costs include all versions that are retained.
There is also the new object lock capabilities in S3 where objects can be locked for a period of time (eg 3 years) without the ability to delete them. This is ideal for situations where information must be retained for a period and it avoids accidental deletion. (There is also a legal hold capability that is the same, but can be turned on/off if you have appropriate permissions.)
Finally, there is the potential for deliberate malicious deletion if an angry staff member decides to take revenge on your company for not stocking their favourite flavour of coffee. If an AWS user has the necessary permissions, they can delete the data from S3. To guard against this, you should limit who has such permissions and possibly combine it with versioning (so they can delete the current version of an object, but it is actually retained by the system).
This can also be addressed by using Cross-Region Replication of Amazon S3 buckets. Some organizations use this to copy data to a bucket owned by a different AWS account, such that nobody has the ability to delete data from both accounts. This is closer to the concept of a true backup because the copy is kept separate (account-wise) from the original. The extra cost of storage is minimal compared to the potential costs if the data was lost. Plus, if you configure the replica bucket to use the Glacier Deep Archive storage class, the costs can be quite low.
Your copy to Glacier is another form of backup (and offers cheaper storage than S3 in the long-term), but it would need to be updated at a regular basis to be a continuous backup (eg by using backup software that understands S3 and Glacier). The "5c per 1000 requests" cost means that it is better used for archives (eg large zip files) rather than many, small files.
Bottom line: Your need for a backup might be as simple as turning on Versioning and limiting which users can totally delete an object (including all past versions) from the bucket. Or, create a bucket replica and store it in Glacier Deep Archive storage class.