I'm trying to find out whether storing objects with randomized keys and no "prefix" will give me S3 max performacne of 5500 Get/sec per object or since I don't have a prefix all those objects without prefix fall into a "no-prefix" category and share the 5500 limit.
Example: The following objects are stored directly in a bucket
njfoia74G.obj
njfoia74G.obj
njfoia74G.obj
will I get 5500 Get/Sec for each object or do they share that?
S3 documentation suggests that ky are not part of the prefix so not sure how to calculate throughput for those objects.
https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingMetadata.html#object-keys
Has anyone done a benchmark or have documentation that can answer this?
From Request Rate and Performance Guidelines - Amazon Simple Storage Service:
Your application can achieve at least 3,500 PUT/POST/DELETE and 5,500 GET requests per second per prefix in a bucket.
The root of a bucket is effectively an empty prefix, so all objects in the root would share the limit.
By the way very few systems would approach anywhere near these volumes. If you have millions of users (causing over 10 million requests per hour), then definitely implement some of the recommended techniques. But the vast majority of sites will never need to worry about it.
Related
I've recently deleted around 90 million objects (around 100TB of data) from a "Nearline" GCS bucket, and now that I have an almost-empty bucket it takes >5 seconds to list the single remaining file. Standard buckets of ours that have only a dozen files take ~1s to list.
This occurs consistently from both gsutil as well as Go-based tooling that we've written. This has been tested from multiple VMs ranging in sizes within GCP from the same region as the buckets. All buckets are single-region, the only difference is that the slower one is Nearline, and the others are Standard. Is it really possible that simply listing the files in a bucket takes more than 5 seconds on Nearline?
Since this smells like a garbage collection/vacuum-related slowdown and we've been using it for almost 5 years now I'm inclined to simply delete the bucket and recreate it, but it'd be good to know if anyone has done an accurate characterization of GCP bucket performance with high churn over time.
In the 24h since I've posted this, the performance of this bucket has returned to what I'd consider normal. gsutil ls takes ~1.2s, my custom Go code for listing the bucket takes ~.15s.
By simply waiting and trying again I've answered my own question: yes, the database of bucket keys does seem to have variable (reduced) performance depending on content churn, but it's something that resolves itself relatively quickly.
S3's throughput limits are per-prefix, not per-object:
your application can achieve at least 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per prefix in a bucket
AWS Docs
It seems to follow that if I place each of my s3 objects in a prefix (by itself), I'll effectively be able to have the above throughput per-object.
However I'm a bit suspicious of this conclusion. If it is true, why would someone consider s3 write sharding which is a bit more complicated?
To clarify, what I'm considering is whenever I'm about to save an object in s3 (eg. foo/bar/baz.txt), I save add a folder so that the file has its own prefix (eg. foo/bar/baz.txt/baz.txt). Now I can have 5500 reads per second on the object baz.txt (without the prefix, those 5500 reads per second would be shared across all objects in foo/bar/).
I have a bucket in S3 for which i want to delete all objects with a particular extension.
The easiest solution is to list all keys and checks if it ends with extension and delete it, but this solution is very costly. Can anyone suggest any efficient to achieve this?
Look at S3 Inventory report, if you do not need up-to-the minute accuracy.
Alternatively, you might have to create an index of your S3 objects in DynamoDB or elsewhere so that you can easily find objects with a given suffix. Or even consider restructuring your keys so that they begin with the file extension, then you can list a prefix such as csv/ (obviously this might have negative consequences elsewhere in your application so is not necessarily a good solution).
Note that the price of listing objects in S3 Standard is $0.005 per 1,000 requests and each of those requests will return up to 1,000 S3 keys. I'm not sure how many keys you would be listing but that's $0.005 per million objects.
I have 2 million zipped HTML files (100-150KB) being added each day that I need to store for a long time.
Hot data (70-150 million) is accessed semi regularly, anything older than that is barely ever accessed.
This means each day I'm storing an additional 200-300GB worth of files.
Now, Standard storage costs $0.023 per GB and $0.004 for Glacier.
While Glacier is cheap, the problem with it is that it has additional costs, so it would be a bad idea to dump 2 million files into Glacier:
PUT requests to Glacier $0.05 per 1,000 requests
Lifecycle Transition Requests into Glacier $0.05 per 1,000 requests
Is there a way of gluing the files together, but keeping them accessible individually?
An important point, that if you need to provide quick access to these files, then Glacier can give you access to the file in up to 12 hours. So the best you can do is to use S3 Standard – Infrequent Access (0,0125 USD per GB with millisecond access) instead of S3 Standard. And maybe for some really not using data Glacier. But it still depends on how fast do you need that data.
Having that I'd suggest following:
as html (text) files have a good level of compression, you can compress historical data in big zip files (daily, weekly or monthly) as together they can have even better compression;
make some index file or database to know where each html-file is stored;
read only desired html-files from archives without unpacking whole zip-file. See example in python how to implement that.
Glacier would be extremely cost sensitive when it comes to the number of files. The best method would be to create a Lambda function that handles zip, unzip operations for you.
Consider this approach:
Lambda creates archive_date_hour.zip of the 2 Million files from that day by hour, this solves the "per object" cost problem by creating 24 giant archival files.
Set a policy on the s3 bucket to move expired objects to glacier over 1 day old.
Use an unzipping Lambda function to fetch and extract potential hot items from the glacier bucket from within the zip files.
Keep the main s3 bucket for hot files with high frequent access, as a working directory for the zip/unzip operations, and for collecting new files daily
Your files are just too small. You will need to combine them probably in an ETL pipeline such as glue. You can also use the Range header i.e. -range bytes=1000-2000 to download part of an object on S3.
If you do that you'll need to figure out the best way to track the bytes ranges, such as after combining the files recording the range for each one, and changing the clients to use the range as well.
The right approach though depends on how this data is accessed and figuring out the patterns. If somebody who looks at TinyFileA also looks at TinyFileB you could combine them together and just send them both along with other files they are likely to use. I would be figuring out logical groupings of files which make sense to consumers and will reduce the number of requests they need, without sending too much irrelevant data.
S3 FAQ mentions that "Amazon S3 buckets in all Regions provide read-after-write consistency for PUTS of new objects and eventual consistency for overwrite PUTS and DELETES." However, I don't know how long it takes to get eventual consistency. I tried to search for this but couldn't find an answer in S3 documentation.
Situation:
We have a website consists of 7 steps. When user clicks on save in each step, we want to save a json document (contains information of all 7 steps) to Amazon S3. Currently we plan to:
Create a single S3 bucket to store all json documents.
When user saves step 1 we create a new item in S3.
When user saves step 2-7 we override the existing item.
After user saves a step and refresh the page, he should be able to see the information he just saved. i.e. We want to make sure that we always read after write.
The full json document (all 7 steps completed) is around 20 KB.
After users clicked on save button we can freeze the page for some time and they cannot make other changes until save is finished.
Question:
How long does it take for AWS S3 to save and load an item? (We can freeze our website when document is being saved to S3)
Is there a function to calculate save/load time based on item size?
Is the save/load time gonna be different if I choose another S3 region? If so which is the best region for Seattle?
I wanted to add to #error2007s answers.
How long does it take for AWS S3 to save and load an item? (We can freeze our website when document is being saved to S3)
It's not only that you will not find the exact time anywhere - there's actually no such thing exact time. That's just what "eventual consistency" is all about: consistency will be achieved eventually. You can't know when.
If somebody gave you an upper bound for how long a system would take to achieve consistency, then you wouldn't call it "eventually consistent" anymore. It would be "consistent within X amount of time".
The problem now becomes, "How do I deal with eventual consistency?" (instead of trying to "beat it")
To really find the answer to that question, you need to first understand what kind of consistency you truly need, and how exactly the eventual consistency of S3 could affect your workflow.
Based on your description, I understand that you would write a total of 7 times to S3, once for each step you have. For the first write, as you correctly cited the FAQs, you get strong consistency for any reads after that. For all the subsequent writes (which are really "replacing" the original object), you might observe eventual consistency - that is, if you try to read the overwritten object, you might get the most recent version, or you might get an older version. This is what is referred to as "eventual consistency" on S3 in this scenario.
A few alternatives for you to consider:
don't write to S3 on every single step; instead, keep the data for each step on the client side, and then only write 1 single object to S3 after the 7th step. This way, there's only 1 write, no "overwrites", so no "eventual consistency". This might or might not be possible for your specific scenario, you need to evaluate that.
alternatively, write to S3 objects with different names for each step. E.g., something like: after step 1, save that to bruno-preferences-step-1.json; then, after step 2, save the results to bruno-preferences-step-2.json; and so on, then save the final preferences file to bruno-preferences.json, or maybe even bruno-preferences-step-7.json, giving yourself the flexibility to add more steps in the future. Note that the idea here to avoid overwrites, which could cause eventual consistency issues. Using this approach, you only write new objects, you never overwrite them.
finally, you might want to consider Amazon DynamoDB. It's a NoSQL database, you can securely connect to it directly from the browser or from your server. It provides you with replication, automatic scaling, load distribution (just like S3). And you also have the option to tell DynamoDB that you want to perform strongly consistent reads (the default is eventually consistent reads; you have to change a parameter to get strongly consistent reads). DynamoDB is typically used for "small" records, 20kB is definitely within the range -- the maximum size of a record would be 400kB as of today. You might want to check this out: DynamoDB FAQs: What is the consistency model of Amazon DynamoDB?
How long does it take for AWS S3 to save and load an item? (We can freeze our website when document is being saved to S3)
You will not find the exact time anywhere. If you ask AWS they will give you approx timings. Your file is 20 KB so as per my experience from S3 usage the time will be more or less 60-90 Sec.
Is there a function to calculate save/load time based on item size?
No there is no any function using which you can calculate this.
Is the save/load time gonna be different if I choose another S3 region? If so which is the best region for Seattle?
For Seattle US West Oregon Will work with no problem.
You can also take a look at this experiment for comparison https://github.com/andrewgaul/are-we-consistent-yet