UUID as primary key in DynamoDB -- good or bad idea? - amazon-web-services

In a new DynamoDB table, my use cases are already fulfilled by the following key schema design:
partition key: user_id
sort key: entity_id
Basically, access patterns are:
Get specific post by a specific user.
Get specific comment by a specific user.
List all posts by specific user.
List all comments by specific user.
List all entities (post or comment) by a specific user.
What benefits do I get if I use a more random ID as partition key instead and simply use GSIs for my access patterns above?
partition key: pseudo_random_id (This is going to be a UUID in reality. Please ignore that this is NOT a UUID in the illustration).
GSI:
partition key: user_id
sort key: entity_id

You don’t need UUIDs or any pseudo-random ID.
It was once possible that you could have a hot partition if one user is particularly active, but hot partitions are basically a non-issue now because of DynamoDB’s adaptive capacity. Furthermore, you should probably be limiting how fast users can create comments/posts, which would prevent hot partitions even if adaptive capacity didn’t exist.
(Why should you limit the rate a user can post? You don’t want a malicious actor to be able to create a new post every few milliseconds—you should have some sort of rate limit as a protection against denial of service attacks.)

Using a UUID doesn't do anything for you...
It doesn't matter how random the partition key is. All that matters is how many distinct partition keys you have and the volume/velocity of entries for that partition key.
In other words, a unique value is a unique value. Dynamo doesn't care if it's 16 bytes, 36 bytes or 128 bytes.
Dynamo applies it's own hash to the partition key to determine which partition the data will be placed into.

if you are looking at unique + sequence number in DynamoDB, worth to read Atomic Counters as a option. That maintains a counter in a table. But could be a problem for a high load application that requests IDs. Because the UpdateItem is a synchronized per tuple.

Related

Parallel insertion in Dynamo DB

I have 3 clients. All of them want to insert items in the same database.
Whenever a client sends a request,
I need to read the last entered record in ddb.
Increase its id by 1.
Push this new request in the ddb with the increased id.
What's the best aws based architecture to implement this?
What if there were 100 clients?
What use is it to have an increasing Id as a partition key, assuming that's the use-case?
Unlike relational databases where this would be a good pattern, typically in a key-value store it's not as it leada.to difficulty reading the data back.
My suggestion would be to use a useful Id that is known to your application to allow you to read the items back efficiently. If those known values are not unique, then you can add a sort key which will become a primary key to define your uniqueness.

Query all users in DynamoDB with a single-table design

I have a simple single-table design that I want to keep flexible for the future, I currently have 2 entity types: users and videos. Users have a 1:n relationship to videos.
The table's partition key is pk and sort key is sk.
Users: pk=u#<id> and sk=u#<id>, entityType: user
Videos: pk=u#<id> and sk=v#<id>, entityType: video
If I want to fetch all users, does it make sense to create a GSI with PK=entityType and SK=sk?
No, because then all user writes will go to the same PK which isn’t ideal. Instead, setup a GSI with a GSI1PK holding your user ID and you can do a scan against it. Project in the essential attributes. Only set the GSI1PK for user entity types so it’s a sparse GSI.
That is one approach you could take and it would get the job done, but it has a few drawbacks/side effects:
You would also replicate all videos in that GSI, which increases the storage and throughput cost of it
You would create a potentially huge item collection that contains all users, which could lead to a hot partition and may not scale well.
Instead, consider splitting up the huge user partition in the GSI into multiple ones with predictable keys.
If you plan to list your users by username later, you could take the first letter of their username as the partition key and thereby create around 26 (depending on capitalization and character set) different partitions, which would spread out the load a lot better. To list all users, you'd have to issue queries on all the partitions, which is annoying at small sizes, but will be more scalable.
Another option would be to define that you want to spread the users out among n partitions and then use something like hash(user_id) mod n to get a partition key for the GSI. That way you'd have to do n queries to get the values of all partitions.

DynamoDB partitions and custom sharding

Let's assume I have some data I want to store in my DynamoDB table. I want to use as the Primary Key the following structure: {timestamp}_{short_uuid}, e.g. "1643207769_123423-ab31d-12345d-12355". I want to ensure good distribution of these items across the partitions.
I'm wondering if "enforcing" sharding of the data by introducing the hash-range key with a specific range (like 1-20) is a good idea? This means my Primary Key would consist of:
"Partition Key" = "range(1-20)" and "Sort Key": "{timestamp}_{short_uuid}".
In other words, will the hash-range key provide better distribution than just simple partition key (regardless the high cardinality like in my example)? Eventually, I'm not interested on which partition the item will end up, I just want to avoid potential hot partition problem.
With thanks to Alex DeBrie's Everything you need to know about DynamoDB Partitions for much of this information.
Some NoSQL databases expose the partition hashing algorithm and/or the cluster topology, but DynamoDB does not. So, you don't know what it is and you can't control it.
Prior to 2018 you needed to be much more aware of how your items were sharded because DynamoDB shared your table's provisioned read/write capacity evenly across all partitions.
In 2018, AWS introduced adaptive capacity and made it instant in May 2019. So, now your table's provisioned read/write capacity shifts to the partitions where it's needed and, as well as being able to add new partitions as needed, DynamoDB will also split highly-active partitions to provide consistent performance.
The upshot is that as long as you stay within an individual partition's size and throughout limits, you should not worry about primary keys too much.
DynamoDB hash function(which they didn't disclose) will distribute it better than you can as they are topology aware( + you have low cardinality in the partition key).
Not sure about your usage but if you want sorting, then use a sort key.

How do I avoid hot partitions when using DyanmoDB row level access control?

I’m looking at adding row-level permissions to a DynamoDB table using dynamodb:LeadingKeys to restrict access per Provider ID. Currently I only have one provider ID, but I know I will have more. However they providers will vary in size with those sizes being very unbalanced.
If I use Provider ID as my partition key, it seems to me like my DB will end up with very hot partitions for the large providers and mostly unused ones for the smaller providers. Prior to adding the row-level access control I was using deviceId as the partition key since it is a more random name, so partitions well, but now I think I have to move that to the sort key.
Current partitioning that works well:
HASHKEY: DeviceId
With permissions I think I need to go to:
HASHKEY: ProviderID (only a handful of them)
RangeKey: DeviceId
Any suggestions as to a better way to set this up?
In general, you no longer need to worry about hot partitions in DynamoDB, especially if the partition keys which are being requested the most remain relatively constant.
More Info: https://aws.amazon.com/blogs/database/how-amazon-dynamodb-adaptive-capacity-accommodates-uneven-data-access-patterns-or-why-what-you-know-about-dynamodb-might-be-outdated/
Expanding on Michael's comment...
If you don't need a range key now...why add one?
The only reason to have a range key is that you need to Query DDB and return multiple records.
If all you ever need is a single record using GetItem, then you don't need a range key.
Simply concatenate ${ProviderId}.${DeviceId} together to make up your hash key.
Edit
Since you want to be able to list device Ids for a single provider, then you do need providerID as the partition key and deviceID as the range key.
As Icehorn's answer mentions, "hot partitions" aren't as big a deal as they used to be. Unless you expect the data for a single providerID to go over 10GB, I'd start with the simple implementation of hashKey(providerID).
If you have expect more than 10GB of data or you end up with a hot partition...then consider concatenating (1..n) integer to the providerID.
This will mean that you'd have to query multiple partitions to get all the deviceIDs.
This approach is detailed in Multi Tenant SaaS Storage Strategies

DynamoDB table/index schema design for querying multi-valued attributes

I'm building a DynamoDB app that will eventually serve a large number (millions) of users. Currently the app's item schema is simple:
{
userId: "08074c7e0c0a4453b3c723685021d0b6", // partition key
email: "foo#foo.com",
... other attributes ...
}
When a new user signs up, or if a user wants to find another user by email address, we'll need to look up users by email instead of by userId. With the current schema that's easy: just use a global secondary index with email as the Partition Key.
But we want to enable multiple email addresses per user, and the DynamoDB Query operation doesn't support a List-typed KeyConditionExpression. So I'm weighing several options to avoid an expensive Scan operation every time a user signs up or wants to find another user by email address.
Below is what I'm planning to change to enable additional emails per user. Is this a good approach? Is there a better option?
Add a sort key column (e.g. itemTypeAndIndex) to allow multiple items per userId.
{
userId: "08074c7e0c0a4453b3c723685021d0b6", // partition key
itemTypeAndIndex: "main", // sort key
email: "foo#foo.com",
... other attributes ...
}
If the user adds a second, third, etc. email, then add a new item for each email, like this:
{
userId: "08074c7e0c0a4453b3c723685021d0b6", // partition key
itemTypeAndIndex: "Email-2", // sort key
email: "bar#bar.com"
// no more attributes
}
The same global secondary index (with email as the Partition Key) can still be used to find both primary and non-primary email addresses.
If a user wants to change their primary email address, we'd swap the email values in the "primary" and "non-primary" items. (Now that DynamoDB supports transactions, doing this will be safer than before!)
If we need to delete a user, we'd have to delete all the items for that userId. If we need to merge two users then we'd have to merge all items for that userId.
The same approach (new items with same userId but different sort keys) could be used for other 1-user-has-many-values data that needs to be Query-able
Is this a good way to do it? Is there a better way?
Justin, for searching on attributes I would strongly advise not to use DynamoDB. I am not saying, you can't achieve this. However, I see a few problems that will eventually come in your path if you will go this root.
Using sort-key on email-id will result in creating duplicate records for the same user i.e. if a user has registered 5 email, that implies 5 records in your table with the same schema and attribute except email-id attribute.
What if a new use-case comes in the future, where now you also want to search for a user based on some other attribute(for example cell phone number, assuming a user may have more then one cell phone number)
DynamoDB has a hard limit of the number of secondary indexes you can create for a table i.e. 5.
Thus with increasing use-case on search criteria, this solution will easily become a bottle-neck for your system. As a result, your system may not scale well.
To best of my knowledge, I can suggest a few options that you may choose based on your requirement/budget to address this problem using a combination of databases.
Option 1. DynamoDB as a primary store and AWS Elasticsearch as secondary storage [Preferred]
Store the user records in DynamoDB table(let's call it UserTable)as and when a user registers.
Enable DynamoDB table streams on UserTable table.
Build an AWS Lambda function that reads from the table's stream and persists the records in AWS Elasticsearch.
Now in your application, use DynamoDB for fetching user records from id. For all other search criteria(like searching on emailId, phone number, zip code, location etc) fetch the records from AWS Elasticsearch. AWS Elasticsearch by default indexes all the attributes of your record, so you can search on any field within millisecond of latency.
Option 2. Use AWS Aurora [Less preferred solution]
If your application has a relational use-case where data are related, you may consider this option. Just to call out, Aurora is a SQL database.
Since this is a relational storage, you can opt for organizing the records in multiple tables and join them based on the primary key of those tables.
I will suggest for 1st option as:
DynamoDB will provide you durable, highly available, low latency primary storage for your application.
AWS Elasticsearch will act as secondary storage, which is also durable, scalable and low latency storage.
With AWS Elasticsearch, you can run any search query on your table. You can also do analytics on data. Kibana UI is provided out of the box, that you may use to plot the analytical data on a dashboard like (how user growth is trending, how many users belong to a specific location, user distribution based on city/state/country etc)
With DynamoDB streams and AWS Lambda, you will be syncing these two databases in near real-time [within few milliseconds]
Your application will be scalable and the search feature can further be enhanced to do filtering on multi-level attributes. [One such example: search all users who belong to a given city]
Having said that, now I will leave this up to you to decide. 😊