S3 - What Exactly Is A Prefix? And what Ratelimits apply? - amazon-web-services

I was wondering if anyone knew what exactly an s3 prefix was and how it interacts with amazon's published s3 rate limits:
Amazon S3 automatically scales to high request rates. For example,
your application can achieve at least 3,500 PUT/POST/DELETE and 5,500
GET requests per second per prefix in a bucket. There are no limits to
the number of prefixes in a bucket.
While that's really clear I'm not quite certain what a prefix is?
Does a prefix require a delimiter?
If we have a bucket where we store all files at the "root" level (completely flat, without any prefix/delimters) does that count as single "prefix" and is it subject to the rate limits posted above?
The way I'm interpreting amazon's documentation suggests to me that this IS the case, and that the flat structure would be considered a single "prefix". (ie it would be subject to the published rate limits above)
Suppose that your bucket (admin-created) has four objects with the
following object keys:
Development/Projects1.xls
Finance/statement1.pdf
Private/taxdocument.pdf
s3-dg.pdf
The s3-dg.pdf key does not have a prefix, so its object appears
directly at the root level of the bucket. If you open the Development/
folder, you see the Projects.xlsx object in it.
In the above example would s3-dg.pdf be subject to a different rate limit (5500 GET requests /second) than each of the other prefixes (Development/Finance/Private)?
What's more confusing is I've read a couple of blogs about amazon using the first N bytes as a partition key and encouraging about using high cardinality prefixes, I'm just not sure how that interacts with a bucket with a "flat file structure".

You're right, the announcement seems to contradict itself. It's just not written properly, but the information is correct. In short:
Each prefix can achieve up to 3,500/5,500 requests per second, so for many purposes, the assumption is that you wouldn't need to use several prefixes.
Prefixes are considered to be the whole path (up to the last '/') of an object's location, and are no longer hashed only by the first 6-8 characters. Therefore it would be enough to just split the data between any two "folders" to achieve x2 max requests per second. (if requests are divided evenly between the two)
For reference, here is a response from AWS support to my clarification request:
Hello Oren,
Thank you for contacting AWS Support.
I understand that you read AWS post on S3 request rate performance
being increased and you have additional questions regarding this
announcement.
Before this upgrade, S3 supported 100 PUT/LIST/DELETE requests per
second and 300 GET requests per second. To achieve higher performance,
a random hash / prefix schema had to be implemented. Since last year
the request rate limits increased to 3,500 PUT/POST/DELETE and 5,500
GET requests per second. This increase is often enough for
applications to mitigate 503 SlowDown errors without having to
randomize prefixes.
However, if the new limits are not sufficient, prefixes would need to
be used. A prefix has no fixed number of characters. It is any string
between a bucket name and an object name, for example:
bucket/folder1/sub1/file
bucket/folder1/sub2/file
bucket/1/file
bucket/2/file
Prefixes of the object 'file' would be: /folder1/sub1/ ,
/folder1/sub2/, /1/, /2/. In this example, if you spread reads
across all four prefixes evenly, you can achieve 22,000 requests per
second.

This looks like it is obscurely addressed in an amazon release communication
https://aws.amazon.com/about-aws/whats-new/2018/07/amazon-s3-announces-increased-request-rate-performance/
Performance scales per prefix, so you can use as many prefixes as you
need in parallel to achieve the required throughput. There are no
limits to the number of prefixes.
This S3 request rate performance increase removes any previous
guidance to randomize object prefixes to achieve faster performance.
That means you can now use logical or sequential naming patterns in S3
object naming without any performance implications. This improvement
is now available in all AWS Regions. For more information, visit the
Amazon S3 Developer Guide.

S3 prefixes used to be determined by the first 6-8 characters;
This has changed mid-2018 - see announcement
https://aws.amazon.com/about-aws/whats-new/2018/07/amazon-s3-announces-increased-request-rate-performance/
But that is half-truth. Actually prefixes (in old definition) still matter.
S3 is not a traditional “storage” - each directory/filename is a separate object in a key/value object store. And also the data has to be partitioned/ sharded to scale to quadzillion of objects. So yes this new sharding is kinda of “automatic”, but not really if you created a new process that writes to it with crazy parallelism to different subdirectories. Before the S3 learns from the new access pattern, you may run into S3 throttling before it reshards/ repartitions data accordingly.
Learning new access patterns takes time. Repartitioning of the data takes time.
Things did improve in mid-2018 (~10x throughput-wise for a new bucket with no statistics), but it's still not what it could be if data is partitioned properly. Although to be fair, this may not be applied to you if you don't have a ton of data, or pattern how you access data is not hugely parallel (e.g. running a Hadoop/Spark cluster on many Tbs of data in S3 with hundreds+ of tasks accessing same bucket in parallel).
TLDR:
"Old prefixes" still do matter.
Write data to root of your bucket, and first-level directory there will determine "prefix" (make it random for example)
"New prefixes" do work, but not initially. It takes time to accommodate to load.
PS. Another approach - you can reach out to your AWS TAM (if you have one) and ask them to pre-partition a new S3 bucket if you expect a ton of data to be flooding it soon.

In order for AWS to handle billions of requests per second, they need to shard up the data so it can optimise throughput. To do this they split the data into partitions based on the first 6 to 8 characters of the object key. Remember S3 is not a hierarchical filesystem, it is only a key-value store, though the key is often used like a file path for organising data, prefix + filename.
Now this is not an issue if you expect less than 100 requests per second, but if you have serious requirements over that then you need to think about naming.
For maximum parallel throughput you should consider how your data is consumed and use the most varying characters at the beginning of your key, or even generate 8 random character for the first 8 characters of the key.
e.g. assuming first 6 characters define the partition:
files/user/bob would be bad as all the objects would be on one partition files/.
2018-09-21/files/bob would be almost as bad if only todays data is being read from partition 2018-0. But slightly better if the objects are read from past years.
bob/users/files would be pretty good if different users are likely to be using the data at the same time from partition bob/us. But not so good if Bob is by far the busiest user.
3B6EA902/files/users/bob would be best for performance but more challenging to reference, where the first part is a random string, this would be pretty evenly spread.
Depending on your data, you need to think of any one point in time, who is reading what, and make sure that the keys start with enough variation to partition appropriately.
For your example, lets assume the partition is taken from the first 6 characters of the key:
for the key Development/Projects1.xls the partition key would be Develo
for the key Finance/statement1.pdf the partition key would be Financ
for the key Private/taxdocument.pdf the partition key would be Privat
for the key s3-dg.pdf the partition key would be s3-dg.

The upvoted answer on this was a bit misleading for me.
If these are the paths
bucket/folder1/sub1/file
bucket/folder1/sub2/file
bucket/1/file
bucket/2/file
Your prefix for file would actually be
folder1/sub1/
folder1/sub2/
1/file
2/file
https://docs.aws.amazon.com/AmazonS3/latest/dev/ListingKeysHierarchy.html
Please se docs. I had issues with the leading '/' when trying to list keys with the airflow s3hook.

In the case you query S3 using Athena, EMR/Hive or Redshift Spectrum increasing the number of prefixes could mean adding more partitions (as the partititon id is part of the prefix). If using datetime as (one of) your partititon keys the number of partittions (and prefixes) will automatically grow as new data is added over time and the total max S3 GETs per second grow as well.

S3 - What Exactly Is A Prefix?
S3 recently updated their document to better reflect this.
"A prefix is a string of characters at the beginning of the object key name. A prefix can be any length, subject to the maximum length of the object key name (1,024 bytes). "
From - https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-prefixes.html
Note: "You can use another character as a delimiter. There is nothing unique about the slash (/) character, but it is a very common prefix delimiter."
As long as two objects have different prefixes, s3 will provide the documented throughput over time.
Update: https://docs.aws.amazon.com/general/latest/gr/glos-chap.html#keyprefix reflecting the updated definition.

Related

Can S3 ListObjectsV2 return the keys sorted newest to oldest?

I have AWS S3 buckets with hundreds of top-level prefixes (folders). Each prefix contains somewhere between five thousand and a few million files in each prefix - most growing at rate of 10-100k per year. 99% of the time, all I care about are the newest 1-2000 or so in each folder...
Using ListObjectV2 returns me 1000 files and that is the max (setting "MaxKeys" to a higher value still truncates the list at 1000). This would be reasonably fine, however (per the documentation) it's returning me the file list in ascending alphabetical order (which, given my keys/filenames have the date in them effectively results in a oldest->newest sort) ... which is considerably less useful than if it returned me the NEWEST files (or reverse-alphabetical).
One option is to do a continuation allowing me to pull the entire prefix, then use the tail end of the entire array of keys as needed... but that would be (most importantly) slow for large 'folders'. A prefix with 2 million files would require 2,000 separate API calls, just to get the newest few-hundred filenames. (not to mention the costs incurred by pulling the entire bucket list even though I'm only really interested in the newest 1-2000 files.)
Is there a way to have the ListObjectV2 call (or any other s3 call) give me the list of the newest (or reverse-alphabetical) files? New files come in every few minutes - and the most important file is THE most recent file, so doing an S3 Inventory doesn't seem like it would do the trick.
(or, perhaps, a call that gives me filenames in a created-by date range...?)
Using javascript - but I'm sure every language has more-or-less the same features when it comes to trying to list objects from an S3 bucket.
Edit: weird idea: If AWS doesn't offer a 'sort' option on a basic API call for one of it's most popular services... Would it make sense to document all the filenames/keys in a dynamo table and query that instead?
No. The ListObjectsV2() will always return up to 1000 objects alphabetically in the requested Prefix.
You could use Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects.
If you need real-time or fairly fast access to a list of all available objects, your other option would be to trigger an AWS Lambda function whenever objects are created/deleted. The Lambda function would store/update information in a database (eg DynamoDB) that can provide very fast access to the list of objects. You would need to code this solution.

Is it possible to increase the S3 limits for read and write?

Is it possible to increase the S3 limits for read and write per second? I only found the current values in the documentation, but no indication that this limit can be increased. Does anyone know?
A good way to improve those limits is leverage the usage of partitions. Following the documentation, those limits are applied per prefix inside your bucket, thus, the way you store your objects affects the maximum performance. I will give you one example, suppose you use the bucket to store log files. One way you could store is putting everything in the root path.
Ex
2022_02_11_log_a.txt
2022_02_11_log_b.txt
2022_02_11_log_c.txt
2022_02_11_log_d.txt
2022_02_12_log_a.txt
2022_02_12_log_b.txt
2022_02_12_log_c.txt
2022_02_12_log_d.txt
To S3, those objects lives inside the same partition, thus, they will have the maximum throughput defined in the documentation. To improve those limits, you could change the path to the following:
2022_02_11/log_a.txt
2022_02_11/log_b.txt
2022_02_11/log_c.txt
2022_02_11/log_d.txt
2022_02_12/log_a.txt
2022_02_12/log_b.txt
2022_02_12/log_c.txt
2022_02_12/log_d.txt
Now you have two partitions: 2022_02_11 and 2022_02_12. Each one with its own throughput limits.
You should check the access pattern of your files and define partitions that leverage it. If you access pattern is random, you could try to use some hash patterns as part of your object's path.
I will also leave this official documentation about object key naming

AWS S3 - SlowDown: Please reduce your request rate

There is enough similar questions and answers on SO. However little said about prefixes.
First, randomization of prefixes is not needed anymore, see here
This S3 request rate performance increase removes any previous
guidance to randomize object prefixes to achieve faster performance.
That means you can now use logical or sequential naming patterns in S3
object naming without any performance implications.
Now back to my problem. I still get "SlowDown" and I dont get why.
All my objects distributed as following:
/foo/bar/baz/node_1/folder1/file1.bin
/foo/bar/baz/node_1/folder1/file2.bin
/foo/bar/baz/node_1/folder2/file1.bin
/foo/bar/baz/node_2/folder1/file1.bin
/foo/bar/baz/node_2/folder1/file2.bin
Each node has its own prefix, then it is followed by a "folder" name, then a "file" name. There is about 40 "files" in each "folder". Lets say I have ~20 nodes, about 200 "folders" under each node and 40 "files" under each folder. In this case, the prefix consists of common part "/foo/bar/baz", the node and the folder, so even if I upload all 40 files in parallel the pressure on single prefix is 40, right? And even if I upload 40 files to each and every "folder" from all nodes, the pressure still 40 per prefix. Is that correct? If yes, how come I get the "SlowDown"? If no how I supposed to take care of it? Custom RetryStrategy? How come DefaultRetryStrategy which employs exponential backoff does not solve this problem?
EDIT001:
Here the explanation what prefix means
Ok, after a month with AWS support team with assistance from S3 engineering team the short answer is, randomize prefixes the old fashion way.
The long answer, they indeed improved the performance of S3 as stated in the link in the original question, however, you always can bring the S3 to knees. The point is that internally they partition all objects sored in bucket, the partitioning works on the bucket prefixes and it organizes it in the lexicographical order of prefixes , so, no matter what, when you put a lot of files in different "folders" it still put the pressure on the outer part of prefix and then it tries to partition the outer part and this is the moment you will get the "SlowDown". Well, you can exponentially back off with retries, but in my case, 5 minute backoff didnt make the trick, then the last resort is to prepend the prefix with some random token, which, ideally distributed evenly. Thats it.
In less aggressive cases, the S3 engineering team can check your usage and manually partition your bucket (done on bucket level). Didnt work in our case.
And no, no money can buy more requests per prefix, since, I guess there is no entity that can pay Amazon for rewriting the S3 backend.
2020 UPDATE: Well, after implementing randomization for S3 prefixes I can say just one thing, if you try hard, no randomization would help. We are still getting SlowDown but not as frequent as before. There is no other mean to solve this problem except rescheduling the failed operation for later execution.
YET ANOTHER 2020 UPDATE: Hehe, number of LIST request you are doing to your bucket prevents us from partitioning the bucket properly. LOL

Does pseudorandom substring need to be at beginning of key to benefit from S3 partitioning

Based on this resource adding a pseudo-random prefix to an S3 key will increase your GET performance over having a constant prefix.
So a key of the form:
bucket/$randomPrefix-key.txt
Will perform better in GETs than
bucket/$date-key.txt
It also implies that the common prefix portion doesn't matter. From the article:
You can optionally add more prefixes in your key name, before the hash string, to group objects. The following example adds animations/ and videos/ prefixes to the key names.
examplebucket/animations/232a-2013-26-05-15-00-00/cust1234234/animation1.obj
examplebucket/animations/7b54-2013-26-05-15-00-00/cust3857422/animation2.obj
examplebucket/animations/921c-2013-26-05-15-00-00/cust1248473/animation3.obj
examplebucket/videos/ba65-2013-26-05-15-00-00/cust8474937/video2.mpg
examplebucket/videos/8761-2013-26-05-15-00-00/cust1248473/video3.mpg
examplebucket/videos/2e4f-2013-26-05-15-00-01/cust1248473/video4.mpg
examplebucket/videos/9810-2013-26-05-15-00-01/cust1248473/video5.mpg
examplebucket/videos/7e34-2013-26-05-15-00-01/cust1248473/video6.mpg
examplebucket/videos/c34a-2013-26-05-15-00-01/cust1248473/video7.mpg
...
So a key of the form
bucket/foo/bar/baz/$randomPrefix-key.txt
Will apparently work just as well as (1).
My question: what if the pseudorandom prefix is in the middle of the key? Does that work just as well?
For example:
bucket/foo/bar/baz-$pseudoRandomString-key.txt
Your example is no different than the ones in the documentation, for an important reason: slashes / have no intrinsic meaning to S3.
There are no folders in S3. foo/bar.txt and foo/baz.jpg are not "in the same folder."
Technically, they are just two objects whose keys have a common prefix.
The console displays them in a folder, only for organizational convenience.
Amazon S3 has a flat structure with no hierarchy like you would see in a typical file system. However, for the sake of organizational simplicity, the Amazon S3 console supports the folder concept as a means of grouping objects.
http://docs.aws.amazon.com/AmazonS3/latest/UG/FolderOperations.html
Also:
The Amazon S3 data model does not natively support the concept of folders, nor does it provide any APIs for folder-level operations. But the Amazon S3 console supports folders to help you organize your data.
http://docs.aws.amazon.com/AmazonS3/latest/UG/about-using-console.html
Thus the / has no special meaning to the S3 index, and no special meaning relative to the placement of your random prefix.
However, it's important that the characters before the random prefix remain the same, so that partition splits can be accomplished right at the beginning of the random characters.
S3 must be able to split the list of keys beginning with the first random character and find a balance of work to the left of (<) and right of (>=) the split point.
If you have this...
fix/ed/chars/here-then-$random/anything/here
...then S3 says to itself "hmm... it looks like example-bucket/fixed/chars/here-then-* seems to be taking a lot of traffic, but it looks like the next character is always one of 0 1 2 3 4 5 6 7 8 9 a b c d e f and they're pretty well balanced, so I'm going to split it at "8," so that ...then-0* through ...then-7* is in one partition and ...then-8 through ...then-f in another" and #boom, potential performance bottleneck solved.
The partitioning is completely automatic and transparent.
Here's an example of what not to do.
logs/2017-01-23/$random/...
logs/2017-01-24/$random/...
logs/2017-01-25/$random/...
Here, a hot spot develops in a different prefix each day, giving S3 no good options for creating effective partition splits to alleviate any overload. Any split would always end up to the left of (lexically less than) all future uploads, at some point, in this case -- so not an effective split. By contrast, the split, above, puts about half the workload < and the other half >= a split at a single character.
Also worth noting ... if you don't expect a sustained workload > 100 req/sec, at least, this isn't going to give you any benefit at all. Natural randomness in your keyspace may also suffice, and S3 reads can scale essentially indefinately without these optimizations when coupled with CloudFront (and usually faster and often slightly cheaper, since CloudFront bandwidth pricing is slightly lower than S3 in some areas, presumably since it relieves potential Internet congestion from the Internet connections at the S3 regions). When S3 is connected to CloudFront, S3 rates its bandwidth charges at $0.00/GB Out to the Internet, and CloudFront bills that piece, at its rates, instead of S3.

S3 partitioning strategy

fully aware of the documentations about how to name a S3 object within a bucket to optimize performance
can not understand the example in this article https://aws.amazon.com/blogs/aws/amazon-s3-performance-tips-tricks-seattle-hiring-event/
2134857/gamedata/start.png
2134857/gamedata/resource.rsrc
2134857/gamedata/results.txt
2134858/gamedata/start.png
2134858/gamedata/resource.rsrc
2134858/gamedata/results.txt
2134859/gamedata/start.png
2134859/gamedata/resource.rsrc
2134859/gamedata/results.txt
the article says "All these reads and writes will basically always go to the same partitio"
but we should have three partitions
2134857, 2134858, 2134859
, right ?
if we reverse the id
7584312/gamedata/start.png
7584312/gamedata/resource.rsrc
7584312/gamedata/results.txt
8584312/gamedata/start.png
8584312/gamedata/resource.rsrc
8584312/gamedata/results.txt
9584312/gamedata/start.png
9584312/gamedata/resource.rsrc
9584312/gamedata/results.txt
we have also three partitions 7584312, 8584312, 9584312
what is the difference.
What is the definition of a prefix and its relationship to partitioning strategy.
The S3 partitioning does not (always) occur on the full ID. It will usually be some sort of partial match on the ID. It's likely your first example will be on the same partition using a partition match of, say, 2134, 21348, or 213485.
More info from the blog post you linked to:
As we said, S3 has automation that continually looks for areas of the
keyspace that need splitting. Partitions are split either due to
sustained high request rates, or because they contain a large number
of keys (which would slow down lookups within the partition). ... This split
operation happens dozens of times a day all over S3 and simply goes unnoticed
from a user performance perspective.