I am looking to index some data in DynamoDB and would like to key on an incrementing integer ID. The higher IDs will get the most of the traffic, however this will be spread evenly across tens of thousands of the highest IDs. Will this create uniform data access which is important for DynamoDB?
AWS don't seem to publish details on the hashing algorithm they use to generate primary keys. I am assuming it is something akin to md5 where, for example, the hash for 3000 is completely different from 3001, 3002 and 3003 and therefore it will result it a uniformly distributed workload.
The reason I ask, is that I know this is not the case in S3 where they suggest reversing auto incrementing IDs in cases like this.
DynamoDB doesn't seem to expose the internal workings of the hashing in documentation. A lot of places seem to quote MD5, but I am not sure if they can be considered authoritative.
An interesting study of distribution of hashes for number sequences is available here. The interesting data sets are Dataset 4 and Dataset 5 which deal with sequence of numbers. Most hashing functions (and MD5 more so) seem to be distributed satisfactorily from the view point of partitioning.
AWS have confirmed that using an incrementing integer ID will create an even workload:
If you are using incrementing numbers as the hash key, they will be distributed equally among the hash key space.
Source: https://forums.aws.amazon.com/thread.jspa?threadID=189362&tstart=0
Related
I'm designing a database schema where there are only 2 partition keys, however, both partitions will be accessed equally - every time I access partition A, I will also access partition B. Will this design suffer from hot partitioning?
From my understanding, hot partitioning occurs because of uneven accesses to the partitions resulting in some partitions being "hot". So I'm thinking cardinality does not matter as long as the accesses are even. Is this correct?
Don't worry about uneven.
Put very simply, each partition key may get put with it a certain amount of underlying capacity. The more partition keys, the easier for the system to add capacity as you need it. Does your database need a lot of capacity? If so, you may want to design with a different (more diverse) key strategy. If your needs are basic, don't overthink it and just get going.
According to the listing documentation it is possible to treat a large navigate number of keys as though they were hierarchial. I am planning to store a large number of keys (let's say a few hundred million), distributed over a sensible-sized 'hierarchy'.
What is the performance of using a prefix and delimiter? Does it require a full enumeration of keys at the S3 end, and therefore be an O(n) operation? I have no idea whether keys are stored in a big hash table, or whether they have indexing data structures, or if they're stored in a tree or what.
I want to avoid the situation where I have a very large number of keys and navigating the 'hierarchy' suddenly becomes difficult.
So if I have the following keys:
abc/def/ghi/0
abc/def/ghi/1
abc/def/ghi/...
abc/def/ghi/100,000,000,000
Will it affect the speed of the query Delimiter='/, Prefix='abc/def'?
Aside from the Request Rate and Performance Considerations document that Sandeep referenced (which is not applicable to your use case), AWS hasn't publicized very much regarding S3 performance. It's probably private intellectual property. So I doubt you'll find very much information unless you can get it somehow from AWS directly.
However, some things to keep in mind:
Amazon S3 is built for massive scale. Millions of companies are using S3 with millions of keys in millions of buckets.
AWS promotes the prefix + delimiter as a very valid use case.
There are common data structures and algorithms used in computer science that AWS is probably using behind the scenes to efficiently retrieve keys. One such data structure is called a Trie or Prefix Tree.
Based on all of the above, chances are that it's much better than an order O(n) algorithm when you retrieve listing of keys. I think you are safe to use prefixes and delimiters for your hierarchy.
As long as you are not using a continuous sequence (such as date 2016-13-08, 2016-13-09 and so on) in the prefix you shouldn't face any problem. If your keys are auto-generated as a continuous sequence then prepend a randomly generated hash key to the keys (aidk-2016-13-08, ujlk-2016-13-09).
The amazon documentation says:
Amazon S3 maintains an index of object key names in each AWS region. Object keys are stored in UTF-8 binary ordering across multiple partitions in the index. The key name dictates which partition the key is stored in. Using a sequential prefix, such as timestamp or an alphabetical sequence, increases the likelihood that Amazon S3 will target a specific partition for a large number of your keys, overwhelming the I/O capacity of the partition. If you introduce some randomness in your key name prefixes, the key names, and therefore the I/O load, will be distributed across more than one partition.
http://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html
Amazon indicates that prefix naming strategies, such as randomized hashing, no longer influence S3 lookup performance.
https://docs.aws.amazon.com/AmazonS3/latest/dev/optimizing-performance.html
I see "BatchGetItem" API from AWS, that can be used to retrieve multiple items from dynamodb. But the capacity consumed is n units for n items even if they are just 10 bytes long. Is there a cheaper way of retrieving those items? I see the "Query" API but it probably doesn't support ORing of the keys and even the IN operator.
The cheapest way to get a large number of small items out of DynamoDB is a Query.
BatchGetItems costs the same the equivalent number of GetItem calls. Only the latency is lower, but if you have a large number of items to get in different partitions, and you have a language that works well with parallelism, you'll actually get better performance doing a large number of parallel GetItem calls.
One way to re-organize your data would be to store all the small items under the same partition key, and use a Query to just get all the items inside that partition. You'll have to note the writes per second limitations, but if you're adding more items infrequently and you're more worried about the reads this should work great.
In a Query, DynamoDB charges based on the size of data returned, irrespective of the number of items returned. But this seems to be the only place where these economics apply.
Based on what I know of DynamoDB, this makes sense. In a BatchGetItems call, the items can easily be from different partitions (and even different tables), so there's not much efficiency — each item must be individually looked up and fetched from its partition on the DynamoDB network. If all the keys in the batch are from the same partition and sequential on the range key, that's a lucky coincidence, but DynamoDB can't exploit it.
In a Query on the other hand, DynamoDB has perfect knowledge that it's only going to be talking directly to one single partition, and doing a sequential read inside that partition, so that call is much simpler and cheaper with savings passed on to customers.
I want to put a large number of items into dynamodb (probably around 100k per day. But this could scale upwards in the future).
A small percentage of these will get a lot more hits than the others (not sure on the exact figure, say 2%-5%). I won't be able to determine which in advance.
The hashkey for each is simply a unique positive integer (item_id). And I need the range key to be a unixtime stamp.
The problem is, will I run into a hot key situation with this set up? I'm not sure if partitions are created for every single hashkey value? Or are hashkeys randomly put into different partitions?
If it's the latter I should be safe because the items with more hits will be randomly distributed across the partitions. But if it's the former then some partitions will get a lot more hits than others
Don't be discouraged, no DynamoDB table has the perfectly distributed access patterns like the documentation suggests. You'll have some hot spots, it's normal and OK. You may have to increase your read/write throughput to accommodate the hotspots, and depending on how hot they are that might make a difference in the costs. But at the modest throughput levels you describe, it isn't going to make DynamoDB unusable or anything.
I recommend converting your capacity requirements into the per-second throughput metrics DynamoDB uses. Will the 100,000 per day really be evenly distributed to ~2 per second?
How many are reads vs. writes?
How big are they in 1K capacity chunks?
Is there a big difference between peak and trough usage?
Can caching be employed to smooth out the read pattern?
Yes, the hash keys will be distributed across partitions. Partitions do not correspond to individual items, but to allocations of read/write capacity and storage (Understanding Partition Behavior).
So I see that a yelp url looks like this
http://www.yelp.com/biz_attribute?biz_id=doldrYLTdR9aYckHIsv55Q
The biz_id is a hash-like string, instead of the more commonly seen integer or mongoID. Aside from obfuscation, is there other reasons why one would use a hash as a ID instead of the ID in the database?
One can imagine several reasons, but a good reason that relates to MongoDB is to use a hash as an ID is for a shard key. A good shard key is one that distributes writes across separate shards, thereby achieving good write scaling. A hash is a good way to ensure writes are well distrubuted to separate shards.