Performance of listing S3 bucket with prefix and delimiter - amazon-web-services

According to the listing documentation it is possible to treat a large navigate number of keys as though they were hierarchial. I am planning to store a large number of keys (let's say a few hundred million), distributed over a sensible-sized 'hierarchy'.
What is the performance of using a prefix and delimiter? Does it require a full enumeration of keys at the S3 end, and therefore be an O(n) operation? I have no idea whether keys are stored in a big hash table, or whether they have indexing data structures, or if they're stored in a tree or what.
I want to avoid the situation where I have a very large number of keys and navigating the 'hierarchy' suddenly becomes difficult.
So if I have the following keys:
abc/def/ghi/0
abc/def/ghi/1
abc/def/ghi/...
abc/def/ghi/100,000,000,000
Will it affect the speed of the query Delimiter='/, Prefix='abc/def'?

Aside from the Request Rate and Performance Considerations document that Sandeep referenced (which is not applicable to your use case), AWS hasn't publicized very much regarding S3 performance. It's probably private intellectual property. So I doubt you'll find very much information unless you can get it somehow from AWS directly.
However, some things to keep in mind:
Amazon S3 is built for massive scale. Millions of companies are using S3 with millions of keys in millions of buckets.
AWS promotes the prefix + delimiter as a very valid use case.
There are common data structures and algorithms used in computer science that AWS is probably using behind the scenes to efficiently retrieve keys. One such data structure is called a Trie or Prefix Tree.
Based on all of the above, chances are that it's much better than an order O(n) algorithm when you retrieve listing of keys. I think you are safe to use prefixes and delimiters for your hierarchy.

As long as you are not using a continuous sequence (such as date 2016-13-08, 2016-13-09 and so on) in the prefix you shouldn't face any problem. If your keys are auto-generated as a continuous sequence then prepend a randomly generated hash key to the keys (aidk-2016-13-08, ujlk-2016-13-09).
The amazon documentation says:
Amazon S3 maintains an index of object key names in each AWS region. Object keys are stored in UTF-8 binary ordering across multiple partitions in the index. The key name dictates which partition the key is stored in. Using a sequential prefix, such as timestamp or an alphabetical sequence, increases the likelihood that Amazon S3 will target a specific partition for a large number of your keys, overwhelming the I/O capacity of the partition. If you introduce some randomness in your key name prefixes, the key names, and therefore the I/O load, will be distributed across more than one partition.
http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html

Amazon indicates that prefix naming strategies, such as randomized hashing, no longer influence S3 lookup performance.
https://docs.aws.amazon.com/AmazonS3/latest/dev/optimizing-performance.html

Related

Storing a very large array of strings in AWS

I want to store a large array of strings in AWS to be used from my application. The requirements are as follows:
During normal operations, string elements will be added to the array and the array size will continue to grow
I need to enforce uniqueness - i.e. the same string cannot be stored twice
I will have to retrieve the entire array periodically - most probably to put it in a file and use it from the application
I need to backup the data (or at least be convinced that there is a good built-in backup system as part of the features)
I looked at the following:
RDS (MySQL) - this may be overkill and also may become uncomfortably large for a single table (millions of records).
DynamoDB - This is intended for key/value pairs, but I have only a single value per record. Also, and more importantly, retrieving a large number of records seems to be an issue in DynamoDB as the scan operation needs paging and also can be expensive in terms of capacity units, etc.
Single S3 file - This could be a practical solution except that I may need to write to the file (append) concurrently, and that is not a feature that is available in S3. Also, it would be hard to enforce the element uniqueness
DocumentDB - This seems to be too expensive and overkill for this purpose
ElastiCache - I don't have a lot of experience with this and wonder if it would be a good fit for my requirement and if it's practical to have it be backed up periodically. This also uses key/value pairs and it is not advisable to read millions of records (entire data) at the same time
Any insights or recommendations would be helpful.
Update:
I don't know why people are voting to close this. It is definitely a programming related question and I have already gotten extremely useful answers and comments that will help me and hopefully others in the future. Why is there such an obsession with opinionated closure of useful posts on SO?
DynamoDB might be a good fit.
It doesn't matter that you don't have any "value" to your "key". Just use the string as the primary key. That will also enforce uniqueness.
You get on-demand and continuous backups. I don't have experience with these so I can only point you to the documentation.
The full retrieval of the data might be the biggest hassle. It is not recommended to do a full-table SCAN with DynamoDB; it can get expensive. There's a way how to use Data Pipelines to do an export (I also have not used it). Alternatively, you could put together a system by yourself, utilizing DynamoDB streams, e.g. you can push a stream to Kinesis and then to S3.

Cheapest way of getting multiple items from dynamodb

I see "BatchGetItem" API from AWS, that can be used to retrieve multiple items from dynamodb. But the capacity consumed is n units for n items even if they are just 10 bytes long. Is there a cheaper way of retrieving those items? I see the "Query" API but it probably doesn't support ORing of the keys and even the IN operator.
The cheapest way to get a large number of small items out of DynamoDB is a Query.
BatchGetItems costs the same the equivalent number of GetItem calls. Only the latency is lower, but if you have a large number of items to get in different partitions, and you have a language that works well with parallelism, you'll actually get better performance doing a large number of parallel GetItem calls.
One way to re-organize your data would be to store all the small items under the same partition key, and use a Query to just get all the items inside that partition. You'll have to note the writes per second limitations, but if you're adding more items infrequently and you're more worried about the reads this should work great.
In a Query, DynamoDB charges based on the size of data returned, irrespective of the number of items returned. But this seems to be the only place where these economics apply.
Based on what I know of DynamoDB, this makes sense. In a BatchGetItems call, the items can easily be from different partitions (and even different tables), so there's not much efficiency — each item must be individually looked up and fetched from its partition on the DynamoDB network. If all the keys in the batch are from the same partition and sequential on the range key, that's a lucky coincidence, but DynamoDB can't exploit it.
In a Query on the other hand, DynamoDB has perfect knowledge that it's only going to be talking directly to one single partition, and doing a sequential read inside that partition, so that call is much simpler and cheaper with savings passed on to customers.

Will an incrementing integer PK produce uniform workload in DynamoDB

I am looking to index some data in DynamoDB and would like to key on an incrementing integer ID. The higher IDs will get the most of the traffic, however this will be spread evenly across tens of thousands of the highest IDs. Will this create uniform data access which is important for DynamoDB?
AWS don't seem to publish details on the hashing algorithm they use to generate primary keys. I am assuming it is something akin to md5 where, for example, the hash for 3000 is completely different from 3001, 3002 and 3003 and therefore it will result it a uniformly distributed workload.
The reason I ask, is that I know this is not the case in S3 where they suggest reversing auto incrementing IDs in cases like this.
DynamoDB doesn't seem to expose the internal workings of the hashing in documentation. A lot of places seem to quote MD5, but I am not sure if they can be considered authoritative.
An interesting study of distribution of hashes for number sequences is available here. The interesting data sets are Dataset 4 and Dataset 5 which deal with sequence of numbers. Most hashing functions (and MD5 more so) seem to be distributed satisfactorily from the view point of partitioning.
AWS have confirmed that using an incrementing integer ID will create an even workload:
If you are using incrementing numbers as the hash key, they will be distributed equally among the hash key space.
Source: https://forums.aws.amazon.com/thread.jspa?threadID=189362&tstart=0

S3 performance for LIST by prefix with millions of objects in a single bucket

I have a project where there will be about 80 million objects in an S3 bucket. Every day, I will be deleting about 4 million and adding 4 million. The object names will be in a pseudo directory structure:
/012345/0123456789abcdef0123456789abcdef
For deletion, I will need to list all objects with a prefix of 012345/, and then delete them. I am concerned of the time it will take for this LIST operation. While it seems clear that S3's access time for individual assets does not increase for individual objects, I haven't found anything definitive that says that a LIST operation over 80MM objects, searching for 10 objects that all have the same prefix will remain fast in such a large bucket.
In a side comment on a question about the maximum number of objects that can be stored in a bucket (from 2008):
In my experience, LIST operations do take (linearly) longer as object count increases, but this is probably a symptom of the increased I/O required on the Amazon servers, and down the wire to your client.
From the Amazon S3 documentation:
There is no limit to the number of objects that can be stored in a bucket and no difference in performance whether you use many buckets or just a few. You can store all of your objects in a single bucket, or you can organize them across several buckets.
While I am inclined to believe the Amazon documentation, it isn't entirely clear what operations their comment is directed to.
Before committing to this expensive plan, I would like to definitively know if LIST operations when searching by prefix remain fast when buckets contain millions of objects. If someone has real-world experience with such large buckets, I would love to hear your input.
Prefix searches are fast, if you've chosen the prefixes correctly. Here's an explanation: https://cloudnative.io/blog/2015/01/aws-s3-performance-tuning/
I've never seen a problem, but why would you ever list a million files just to pull a few files out of the list? It's not S3 performance, it's likely do to the call just taking longer.
Why not store the file names in a database, index them, then query from there. That'd be a better solution I'd think.

Why does AWS Dynamo SDK do not provide means to store objects larger than 64kb?

I had a use case where I wanted to store objects larger than 64kb in Dynamo DB. It looks like this is relatively easy to accomplish if you implement a kind of "paging" functionality, where you partition the objects into smaller chunks and store them as multiple values for the key.
This got me thinking however. Why did Amazon not implement this in their SDK? Is it somehow a bad idea to store objects bigger than 64kb? If so, what is the "correct" infrastructure to use?
In my opinion, it's an understandable trade-off DynamoDB made. To be highly available and redundant, they need to replicate data. To get super-low latency, they allowed inconsistent reads. I'm not sure of their internal implementation, but I would guess that the higher this 64KB cap is, the longer your inconsistent reads might be out of date with the actual current state of the item. And in a super low-latency system, milliseconds may matter.
This pushes the problem of an inconsistent Query returning chunk 1 and 2 (but not 3, yet) to the client-side.
As per question comments, if you want to store larger data, I recommend storing in S3 and referring to the S3 location from an attribute on an item in DynamoDB.
For the record, the maximum item size in DynamoDB is now 400K, rather than 64K as it was when the question was asked.
From a Design perspective, I think a lot of cases where you can model your problem with >64KB chunks could also be translated to models where you can split those chunks to <64KB chunks. And it is most often a better design choice to do so.
E.g. if you store a complex object, you could probably split it into a number of collections each of which encode one of the various facets of the object.
This way you probably get better, more predictable performance for large datasets as querying for an object of any size will involve a defined number of API calls with a low, predictable upper bound on latency.
Very often Service Operations people struggle to get this predictability out of the system so as to guarantee a given latency at 90/95/99%-tile of the traffic. AWS just chose to build this constraint into the API, as they probably already do for their own website ans internal developments.
Also, of course from an (AWS) implementation and tuning perspective, it is quite comfy to assume a 64KB cap as it allows for predictable memory paging in/out, upper bounds on network roundtrips etc.