S3 partitioning strategy - amazon-web-services

fully aware of the documentations about how to name a S3 object within a bucket to optimize performance
can not understand the example in this article https://aws.amazon.com/blogs/aws/amazon-s3-performance-tips-tricks-seattle-hiring-event/
2134857/gamedata/start.png
2134857/gamedata/resource.rsrc
2134857/gamedata/results.txt
2134858/gamedata/start.png
2134858/gamedata/resource.rsrc
2134858/gamedata/results.txt
2134859/gamedata/start.png
2134859/gamedata/resource.rsrc
2134859/gamedata/results.txt
the article says "All these reads and writes will basically always go to the same partitio"
but we should have three partitions
2134857, 2134858, 2134859
, right ?
if we reverse the id
7584312/gamedata/start.png
7584312/gamedata/resource.rsrc
7584312/gamedata/results.txt
8584312/gamedata/start.png
8584312/gamedata/resource.rsrc
8584312/gamedata/results.txt
9584312/gamedata/start.png
9584312/gamedata/resource.rsrc
9584312/gamedata/results.txt
we have also three partitions 7584312, 8584312, 9584312
what is the difference.
What is the definition of a prefix and its relationship to partitioning strategy.

The S3 partitioning does not (always) occur on the full ID. It will usually be some sort of partial match on the ID. It's likely your first example will be on the same partition using a partition match of, say, 2134, 21348, or 213485.
More info from the blog post you linked to:
As we said, S3 has automation that continually looks for areas of the
keyspace that need splitting. Partitions are split either due to
sustained high request rates, or because they contain a large number
of keys (which would slow down lookups within the partition). ... This split
operation happens dozens of times a day all over S3 and simply goes unnoticed
from a user performance perspective.

Related

AWS Athena partition fetch all paths

Recently, I've experienced an issue with AWS Athena when there is quite high number of partitions.
The old version had a database and tables with only 1 partition level, say id=x. Let's take one table; for example, where we store payment parameters per id (product), and there are not plenty of IDs. Assume its around 1000-5000. Now while querying that table with passing id number on where clause like ".. where id = 10". The queries were returned pretty fast actually. Assume we update the data twice a day.
Lately, we've been thinking to add another partition level for day like, "../id=x/dt=yyyy-mm-dd/..". This means that partition number grows xID times per day if a month passes and if we have 3000 IDs, we'd approximately get 3000x30=90000 partitions a month. Thus, a rapid grow in number of partitions.
On, say 3 months old data (~270k partitions), we'd like to see a query like the following would return in at most 20 seconds or so.
select count(*) from db.table where id = x and dt = 'yyyy-mm-dd'
This takes like a minute.
The Real Case
It turns out Athena first fetches the all partitions (metadata) and s3 paths (regardless the usage of where clause) and then filter those s3 paths that you would like to see on where condition. The first part (fetching all s3 paths by partitions lasts long proportionally to the number of partitions)
The more partitions you have, the slower the query executed.
Intuitively, I expected that Athena fetches only s3 paths stated on where clause, I mean this would be the one way of magic of the partitioning. Maybe it fetches all paths
Does anybody know a work around, or do we use Athena in a wrong way ?
Should Athena be used only with small number of partitions ?
Edit
In order to clarify the statement above, I add a piece from support mail.
from Support
...
You mentioned that your new system has 360000 which is a huge number.
So when you are doing select * from <partitioned table>, Athena first download all partition metadata and searched S3 path mapped with
those partitions. This process of fetching data for each partition
lead to longer time in query execution.
...
Update
An issue opened on AWS forums. The linked issue raised on aws forums is here.
Thanks.
This is impossible to properly answer without knowing the amount of data, what file formats, and how many files we're talking about.
TL; DR I suspect you have partitions with thousands of files and that the bottleneck is listing and reading them all.
For any data set that grows over time you should have a temporal partitioning, on date or even time, depending on query patterns. If you should have partitioning on other properties depends on a lot of factors and in the end it often turns out that not partitioning is better. Not always, but often.
Using reasonably sized (~100 MB) Parquet can in many cases be more effective than partitioning. The reason is that partitioning increases the number of prefixes that have to be listed on S3, and the number of files that have to be read. A single 100 MB Parquet file can be more efficient than ten 10 MB files in many cases.
When Athena executes a query it will first load partitions from Glue. Glue supports limited filtering on partitions, and will help a bit in pruning the list of partitions – so to the best of my knowledge it's not true that Athena reads all partition metadata.
When it has the partitions it will issue LIST operations to the partition locations to gather the files that are involved in the query – in other words, Athena won't list every partition location, just the ones in partitions selected for the query. This may still be a large number, and these list operations are definitely a bottleneck. It becomes especially bad if there is more than 1000 files in a partition because that's the page size of S3's list operations, and multiple requests will have to be made sequentially.
With all files listed Athena will generate a list of splits, which may or may not equal the list of files – some file formats are splittable, and if files are big enough they are split and processed in parallel.
Only after all of that work is done the actual query processing starts. Depending on the total number of splits and the amount of available capacity in the Athena cluster your query will be allocated resources and start executing.
If your data was in Parquet format, and there was one or a few files per partition, the count query in your question should run in a second or less. Parquet has enough metadata in the files that a count query doesn't have to read the data, just the file footer. It's hard to get any query to run in less than a second due to the multiple steps involved, but a query hitting a single partition should run quickly.
Since it takes two minutes I suspect you have hundreds of files per partition, if not thousands, and your bottleneck is that it takes too much time to run all the list and get operations in S3.

AWS S3 - SlowDown: Please reduce your request rate

There is enough similar questions and answers on SO. However little said about prefixes.
First, randomization of prefixes is not needed anymore, see here
This S3 request rate performance increase removes any previous
guidance to randomize object prefixes to achieve faster performance.
That means you can now use logical or sequential naming patterns in S3
object naming without any performance implications.
Now back to my problem. I still get "SlowDown" and I dont get why.
All my objects distributed as following:
/foo/bar/baz/node_1/folder1/file1.bin
/foo/bar/baz/node_1/folder1/file2.bin
/foo/bar/baz/node_1/folder2/file1.bin
/foo/bar/baz/node_2/folder1/file1.bin
/foo/bar/baz/node_2/folder1/file2.bin
Each node has its own prefix, then it is followed by a "folder" name, then a "file" name. There is about 40 "files" in each "folder". Lets say I have ~20 nodes, about 200 "folders" under each node and 40 "files" under each folder. In this case, the prefix consists of common part "/foo/bar/baz", the node and the folder, so even if I upload all 40 files in parallel the pressure on single prefix is 40, right? And even if I upload 40 files to each and every "folder" from all nodes, the pressure still 40 per prefix. Is that correct? If yes, how come I get the "SlowDown"? If no how I supposed to take care of it? Custom RetryStrategy? How come DefaultRetryStrategy which employs exponential backoff does not solve this problem?
EDIT001:
Here the explanation what prefix means
Ok, after a month with AWS support team with assistance from S3 engineering team the short answer is, randomize prefixes the old fashion way.
The long answer, they indeed improved the performance of S3 as stated in the link in the original question, however, you always can bring the S3 to knees. The point is that internally they partition all objects sored in bucket, the partitioning works on the bucket prefixes and it organizes it in the lexicographical order of prefixes , so, no matter what, when you put a lot of files in different "folders" it still put the pressure on the outer part of prefix and then it tries to partition the outer part and this is the moment you will get the "SlowDown". Well, you can exponentially back off with retries, but in my case, 5 minute backoff didnt make the trick, then the last resort is to prepend the prefix with some random token, which, ideally distributed evenly. Thats it.
In less aggressive cases, the S3 engineering team can check your usage and manually partition your bucket (done on bucket level). Didnt work in our case.
And no, no money can buy more requests per prefix, since, I guess there is no entity that can pay Amazon for rewriting the S3 backend.
2020 UPDATE: Well, after implementing randomization for S3 prefixes I can say just one thing, if you try hard, no randomization would help. We are still getting SlowDown but not as frequent as before. There is no other mean to solve this problem except rescheduling the failed operation for later execution.
YET ANOTHER 2020 UPDATE: Hehe, number of LIST request you are doing to your bucket prevents us from partitioning the bucket properly. LOL

S3 - What Exactly Is A Prefix? And what Ratelimits apply?

I was wondering if anyone knew what exactly an s3 prefix was and how it interacts with amazon's published s3 rate limits:
Amazon S3 automatically scales to high request rates. For example,
your application can achieve at least 3,500 PUT/POST/DELETE and 5,500
GET requests per second per prefix in a bucket. There are no limits to
the number of prefixes in a bucket.
While that's really clear I'm not quite certain what a prefix is?
Does a prefix require a delimiter?
If we have a bucket where we store all files at the "root" level (completely flat, without any prefix/delimters) does that count as single "prefix" and is it subject to the rate limits posted above?
The way I'm interpreting amazon's documentation suggests to me that this IS the case, and that the flat structure would be considered a single "prefix". (ie it would be subject to the published rate limits above)
Suppose that your bucket (admin-created) has four objects with the
following object keys:
Development/Projects1.xls
Finance/statement1.pdf
Private/taxdocument.pdf
s3-dg.pdf
The s3-dg.pdf key does not have a prefix, so its object appears
directly at the root level of the bucket. If you open the Development/
folder, you see the Projects.xlsx object in it.
In the above example would s3-dg.pdf be subject to a different rate limit (5500 GET requests /second) than each of the other prefixes (Development/Finance/Private)?
What's more confusing is I've read a couple of blogs about amazon using the first N bytes as a partition key and encouraging about using high cardinality prefixes, I'm just not sure how that interacts with a bucket with a "flat file structure".
You're right, the announcement seems to contradict itself. It's just not written properly, but the information is correct. In short:
Each prefix can achieve up to 3,500/5,500 requests per second, so for many purposes, the assumption is that you wouldn't need to use several prefixes.
Prefixes are considered to be the whole path (up to the last '/') of an object's location, and are no longer hashed only by the first 6-8 characters. Therefore it would be enough to just split the data between any two "folders" to achieve x2 max requests per second. (if requests are divided evenly between the two)
For reference, here is a response from AWS support to my clarification request:
Hello Oren,
Thank you for contacting AWS Support.
I understand that you read AWS post on S3 request rate performance
being increased and you have additional questions regarding this
announcement.
Before this upgrade, S3 supported 100 PUT/LIST/DELETE requests per
second and 300 GET requests per second. To achieve higher performance,
a random hash / prefix schema had to be implemented. Since last year
the request rate limits increased to 3,500 PUT/POST/DELETE and 5,500
GET requests per second. This increase is often enough for
applications to mitigate 503 SlowDown errors without having to
randomize prefixes.
However, if the new limits are not sufficient, prefixes would need to
be used. A prefix has no fixed number of characters. It is any string
between a bucket name and an object name, for example:
bucket/folder1/sub1/file
bucket/folder1/sub2/file
bucket/1/file
bucket/2/file
Prefixes of the object 'file' would be: /folder1/sub1/ ,
/folder1/sub2/, /1/, /2/. In this example, if you spread reads
across all four prefixes evenly, you can achieve 22,000 requests per
second.
This looks like it is obscurely addressed in an amazon release communication
https://aws.amazon.com/about-aws/whats-new/2018/07/amazon-s3-announces-increased-request-rate-performance/
Performance scales per prefix, so you can use as many prefixes as you
need in parallel to achieve the required throughput. There are no
limits to the number of prefixes.
This S3 request rate performance increase removes any previous
guidance to randomize object prefixes to achieve faster performance.
That means you can now use logical or sequential naming patterns in S3
object naming without any performance implications. This improvement
is now available in all AWS Regions. For more information, visit the
Amazon S3 Developer Guide.
S3 prefixes used to be determined by the first 6-8 characters;
This has changed mid-2018 - see announcement
https://aws.amazon.com/about-aws/whats-new/2018/07/amazon-s3-announces-increased-request-rate-performance/
But that is half-truth. Actually prefixes (in old definition) still matter.
S3 is not a traditional “storage” - each directory/filename is a separate object in a key/value object store. And also the data has to be partitioned/ sharded to scale to quadzillion of objects. So yes this new sharding is kinda of “automatic”, but not really if you created a new process that writes to it with crazy parallelism to different subdirectories. Before the S3 learns from the new access pattern, you may run into S3 throttling before it reshards/ repartitions data accordingly.
Learning new access patterns takes time. Repartitioning of the data takes time.
Things did improve in mid-2018 (~10x throughput-wise for a new bucket with no statistics), but it's still not what it could be if data is partitioned properly. Although to be fair, this may not be applied to you if you don't have a ton of data, or pattern how you access data is not hugely parallel (e.g. running a Hadoop/Spark cluster on many Tbs of data in S3 with hundreds+ of tasks accessing same bucket in parallel).
TLDR:
"Old prefixes" still do matter.
Write data to root of your bucket, and first-level directory there will determine "prefix" (make it random for example)
"New prefixes" do work, but not initially. It takes time to accommodate to load.
PS. Another approach - you can reach out to your AWS TAM (if you have one) and ask them to pre-partition a new S3 bucket if you expect a ton of data to be flooding it soon.
In order for AWS to handle billions of requests per second, they need to shard up the data so it can optimise throughput. To do this they split the data into partitions based on the first 6 to 8 characters of the object key. Remember S3 is not a hierarchical filesystem, it is only a key-value store, though the key is often used like a file path for organising data, prefix + filename.
Now this is not an issue if you expect less than 100 requests per second, but if you have serious requirements over that then you need to think about naming.
For maximum parallel throughput you should consider how your data is consumed and use the most varying characters at the beginning of your key, or even generate 8 random character for the first 8 characters of the key.
e.g. assuming first 6 characters define the partition:
files/user/bob would be bad as all the objects would be on one partition files/.
2018-09-21/files/bob would be almost as bad if only todays data is being read from partition 2018-0. But slightly better if the objects are read from past years.
bob/users/files would be pretty good if different users are likely to be using the data at the same time from partition bob/us. But not so good if Bob is by far the busiest user.
3B6EA902/files/users/bob would be best for performance but more challenging to reference, where the first part is a random string, this would be pretty evenly spread.
Depending on your data, you need to think of any one point in time, who is reading what, and make sure that the keys start with enough variation to partition appropriately.
For your example, lets assume the partition is taken from the first 6 characters of the key:
for the key Development/Projects1.xls the partition key would be Develo
for the key Finance/statement1.pdf the partition key would be Financ
for the key Private/taxdocument.pdf the partition key would be Privat
for the key s3-dg.pdf the partition key would be s3-dg.
The upvoted answer on this was a bit misleading for me.
If these are the paths
bucket/folder1/sub1/file
bucket/folder1/sub2/file
bucket/1/file
bucket/2/file
Your prefix for file would actually be
folder1/sub1/
folder1/sub2/
1/file
2/file
https://docs.aws.amazon.com/AmazonS3/latest/dev/ListingKeysHierarchy.html
Please se docs. I had issues with the leading '/' when trying to list keys with the airflow s3hook.
In the case you query S3 using Athena, EMR/Hive or Redshift Spectrum increasing the number of prefixes could mean adding more partitions (as the partititon id is part of the prefix). If using datetime as (one of) your partititon keys the number of partittions (and prefixes) will automatically grow as new data is added over time and the total max S3 GETs per second grow as well.
S3 - What Exactly Is A Prefix?
S3 recently updated their document to better reflect this.
"A prefix is a string of characters at the beginning of the object key name. A prefix can be any length, subject to the maximum length of the object key name (1,024 bytes). "
From - https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-prefixes.html
Note: "You can use another character as a delimiter. There is nothing unique about the slash (/) character, but it is a very common prefix delimiter."
As long as two objects have different prefixes, s3 will provide the documented throughput over time.
Update: https://docs.aws.amazon.com/general/latest/gr/glos-chap.html#keyprefix reflecting the updated definition.

Is there any real sense in uniform distributed partition keys for small applications using DynamoDB?

Amazon DynamoDB doc is focused on partition key uniform distribution is the most important point in creating correct db architecture.
From the other hand, when things come to real numbers, you can find that your app will never go out of one partition. That is, according to doc:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.Partitions
partition calculation formula is
( readCapacityUnits / 3,000 ) + ( writeCapacityUnits / 1,000 ) = initialPartitions (rounded up)
So you need more than 1000 writes per second demand (for 1 kb data) to go out from one partition. But according to my calculation for the most of small application you don't even need default 5 writes per second - 1 is enough. (To be precise you can go out of one partition if your data excesses 10Gb but it's also a big number).
The question becomes more important when you realize that creating of any additional indexes requires additional writes per second allocation.
Just imagine, I have some data related to particular user, for example, "posts".
I create "posts" data table and then according to Amazon guidelines I choose the next key format:
partition: id, // post id like uuid
sort: // don't need it
Since there is no any two posts having the same id we don't need sort key here. But then you realize that the most common operation you have is requesting a list of posts for a particular user. So you need to create secondary index like:
partition: userId,
sort: id // post id
But every secondary index requires additional read/write units so the cost of such decision is doubled!
From the other hand, keeping in mind that you have only one partition, you could already have such primary key:
partition: userId
sort: id // post id
That works fine for your purposes and doesn't double your cost.
So the question is: have I missed something? May be partition key is much more effective than sort one even inside one partition?
Addition: you may say "ok, now having userId as partition key for posts is ok but when you have 100000 users in your app you'll run into troubles with scaling". But in reality the trouble can be only for some "transition" case - when you have only a few partitions with a group of active users posts all in one partition and inactive ones in the other one. If you have thousands of users it's natural that you have a lot of users with active posts, the impact of one user is negligible and statistically their posts are evenly distributed between a lot of partitions due to big numbers.
I think its absolutely fine if you make sure you wont exceed partition limits by increasing RCU/WCU or by growth of your data. Moreover, best practices says
If the table will fit entirely into a single partition (taking into consideration growth of your data over time), and if your application's read and write throughput requirements do not exceed the read and write capabilities of a single partition, then your application should not encounter any unexpected throttling as a result of partitioning.

Does pseudorandom substring need to be at beginning of key to benefit from S3 partitioning

Based on this resource adding a pseudo-random prefix to an S3 key will increase your GET performance over having a constant prefix.
So a key of the form:
bucket/$randomPrefix-key.txt
Will perform better in GETs than
bucket/$date-key.txt
It also implies that the common prefix portion doesn't matter. From the article:
You can optionally add more prefixes in your key name, before the hash string, to group objects. The following example adds animations/ and videos/ prefixes to the key names.
examplebucket/animations/232a-2013-26-05-15-00-00/cust1234234/animation1.obj
examplebucket/animations/7b54-2013-26-05-15-00-00/cust3857422/animation2.obj
examplebucket/animations/921c-2013-26-05-15-00-00/cust1248473/animation3.obj
examplebucket/videos/ba65-2013-26-05-15-00-00/cust8474937/video2.mpg
examplebucket/videos/8761-2013-26-05-15-00-00/cust1248473/video3.mpg
examplebucket/videos/2e4f-2013-26-05-15-00-01/cust1248473/video4.mpg
examplebucket/videos/9810-2013-26-05-15-00-01/cust1248473/video5.mpg
examplebucket/videos/7e34-2013-26-05-15-00-01/cust1248473/video6.mpg
examplebucket/videos/c34a-2013-26-05-15-00-01/cust1248473/video7.mpg
...
So a key of the form
bucket/foo/bar/baz/$randomPrefix-key.txt
Will apparently work just as well as (1).
My question: what if the pseudorandom prefix is in the middle of the key? Does that work just as well?
For example:
bucket/foo/bar/baz-$pseudoRandomString-key.txt
Your example is no different than the ones in the documentation, for an important reason: slashes / have no intrinsic meaning to S3.
There are no folders in S3. foo/bar.txt and foo/baz.jpg are not "in the same folder."
Technically, they are just two objects whose keys have a common prefix.
The console displays them in a folder, only for organizational convenience.
Amazon S3 has a flat structure with no hierarchy like you would see in a typical file system. However, for the sake of organizational simplicity, the Amazon S3 console supports the folder concept as a means of grouping objects.
http://docs.aws.amazon.com/AmazonS3/latest/UG/FolderOperations.html
Also:
The Amazon S3 data model does not natively support the concept of folders, nor does it provide any APIs for folder-level operations. But the Amazon S3 console supports folders to help you organize your data.
http://docs.aws.amazon.com/AmazonS3/latest/UG/about-using-console.html
Thus the / has no special meaning to the S3 index, and no special meaning relative to the placement of your random prefix.
However, it's important that the characters before the random prefix remain the same, so that partition splits can be accomplished right at the beginning of the random characters.
S3 must be able to split the list of keys beginning with the first random character and find a balance of work to the left of (<) and right of (>=) the split point.
If you have this...
fix/ed/chars/here-then-$random/anything/here
...then S3 says to itself "hmm... it looks like example-bucket/fixed/chars/here-then-* seems to be taking a lot of traffic, but it looks like the next character is always one of 0 1 2 3 4 5 6 7 8 9 a b c d e f and they're pretty well balanced, so I'm going to split it at "8," so that ...then-0* through ...then-7* is in one partition and ...then-8 through ...then-f in another" and #boom, potential performance bottleneck solved.
The partitioning is completely automatic and transparent.
Here's an example of what not to do.
logs/2017-01-23/$random/...
logs/2017-01-24/$random/...
logs/2017-01-25/$random/...
Here, a hot spot develops in a different prefix each day, giving S3 no good options for creating effective partition splits to alleviate any overload. Any split would always end up to the left of (lexically less than) all future uploads, at some point, in this case -- so not an effective split. By contrast, the split, above, puts about half the workload < and the other half >= a split at a single character.
Also worth noting ... if you don't expect a sustained workload > 100 req/sec, at least, this isn't going to give you any benefit at all. Natural randomness in your keyspace may also suffice, and S3 reads can scale essentially indefinately without these optimizations when coupled with CloudFront (and usually faster and often slightly cheaper, since CloudFront bandwidth pricing is slightly lower than S3 in some areas, presumably since it relieves potential Internet congestion from the Internet connections at the S3 regions). When S3 is connected to CloudFront, S3 rates its bandwidth charges at $0.00/GB Out to the Internet, and CloudFront bills that piece, at its rates, instead of S3.