DynamoDB query all users sorted by name - amazon-web-services

I am modelling the data of my application to use DynamoDB.
My data model is rather simple:
I have users and projects
Each user can have multiple projects
Users can be millions, project per users can be thousands.
My access pattern is also rather simple:
Get a user by id
Get a list of paginated users sorted by name or creation date
Get a project by id
get projects by user sorted by date
My single table for this data model is the following:
I can easily implement all my access patterns using table PK/SK and GSIs, but I have issues with number 2.
According to the documentation and best practices, to get a sorted list of paginated users:
I can't use a scan, as sorting is not supported
I should not use a GSI with a PK that would put all my users in the same partition (e.g. GSI PK = "sorted_user", SK = "name"), as that would make my single partition hot and would not scale
I can't create a new entity of type "organisation", put all users in there, and query by PK = "org", as that would have the same hot partition issue as above
I could bucket users and use write sharding, but I don't really know how I could practically query paginated sorted users, as bucket PKs would need to be possibly random, and I would have to query all buckets to be able to sort all users together. I also thought that bucket PKs could be alphabetical letters, but that could crated hot partitions as well, as the letter "A" would probably be hit quite hard.
My application model is rather simple. However, after having read all docs and best practices and watched many online videos, I find myself stuck with the most basic use case that DynamoDB does not seem to be supporting well. I suppose it must be quite common to have to get lists of users in some sort of admin panel for practically any modern application.
What would others would do in this case? I would really want to use DynamoDB for all the benefits that it gives, especially in terms of costs.
Edit
Since I have been asked, in my app the main use case for 2) is something like this: https://stackoverflow.com/users?tab=Reputation&filter=all.
As to the sizing, it needs to scale well, at least to the tens of thousands.

I also thought that bucket PKs could be alphabetical letters, but
that could create hot partitions as well, as the letter "A" would
probably be hit quite hard.
I think this sounds like a reasonable approach.
The US Social Security Administration publishes data about names on its website. You can download the list of name data from as far back as 1879! I stumbled upon a website from data scientist and linguist Joshua Falk that charted the baby name data from the SSA, which can give us a hint of how names are distributed by their first letter.
Your users may not all be from the US, but this can give us an understanding of how names might be distributed if partitioned by the first letter.
While not exactly evenly distributed, perhaps it's close enough for your use case? If not, you could further distribute the data by using the first two (or three, or four...) letters of the name as your partition key.
1 million names likely amount to no more than a few MBs of data, which isn't very much. Partitioning based on name prefixes seems like a reasonable way to proceed.
You might also consider using a tool like ElasticSearch, which could support your second access pattern and more.

Related

DynamoDB Schema - Retrieve All Users

I am trying to model my data. As you see the partition key is the user email. In the global secondary index I have a PK of "US", which stands for "User". If I want to get all of the enabled users I just have to query the GSI where GSI1PK = "US" and GSI1SK Starts with "Enabled".
My concern is that all of the users in the app would have the same GSI1PK. Will this be a problem? Can GSIs PK have problems with "hot partitions"? I am Googling this and I do not see a clear answer. There is only one here on StackOverflow that says it will be a problem, but there are other places that say it will not. I am kind of confused.
What would be the best way to structure the data in my table so I can access all of the users without causing hot artition issues?
Placing a potentially large item collection in a single partition will likely lead to a hot partition. Ideally, your chosen partition keys evenly distribute data across partitions. However, it may not always be clear about how to achieve this.
You might consider splitting your large partition into smaller partitions on write (aka write sharding), and re-combining them when reading. For example, when creating GSIPK, you could introduce a randomly generated integer between 1 and 4 in the partition key:
And your GSI would look like this
Now your User data is more evenly distributed across partitions. When reading users from your table, you would pull from all the partitions at once. This could be done in parallel for faster performance.
In this example, I chose a random number to "write shard" the data into separate partitions. However, your data may lend itself to a more natural division (e.g. by country, enabled status, time zone, etc). What I want to highlight is that your strategy to distribute data across partitions can be separate from the data model you use to support your application access patterns.

Should I really use one DynamoDB table for all data?

The DynamoDB best practice documentation has this line:
You should maintain as few tables as possible in a DynamoDB application. Most well designed applications require only one table.
It's the last line that confuses me the most.
Take an example photo storage application. Does this mean that I should store user accounts (account ID, password, email) and photos (owner ID, photo location, metadata) in the same table?
If so I assume the primary key should be the account/owner ID, and the sort key would be the type of object it is (e.g. account or photo).
Should I be using one table like this instead of two tables (one for accounts, one for photos)?
It is generally recommended to use as few tables as possible, and very often a single table unless you have a really good reason to use more than one. Chances are you won't have a good reason to use more than one - except for old habits.
It seems counter-intuitive if you are coming from a traditional database background (like me), but it is in fact best practice.
The primary key could become a combination of the 'row'/object type and another value, stored in a single field, i.e. 'account#12345' for an account object with unique id of 12345 and 'photo#67890' for a photo object with your id of 67890 -
If you are looking up an account by your id number, you would query with the account prefix, and if you were looking for a photo, you would add the 'photo' prefix. this is a very simple example - your design may vary.
The video recommended in the first comment on your question is excellent - watch it at 0.75 speed or slower, and watch it a few times.
The short answer is yes. But the way it would be designed would be highly specific to how your application interacts with the database.
I highly recommend that anyone still confused with how to design DynamoDB/NoSQL tables watches this video from re:Invent.

Better method for querying DynamoDB table randomly?

I've included some links along with our approaches to other answers, which seem to be the most optimal on the web right now.
Our records need to be categorized (eg. "horror", "thriller", "tv"), and randomly accessible both in specific categories and across all/some categories. We generally need to access about 20 - 100 items at a time. We also have a smallish number of categories (less than 100).
We write to the database for uploading/removing content, although this is done in batches and does not need to be real time.
We have tried two different approaches, with two different data structures.
Approach 1
AWS DynamoDB - Pick a record/item randomly?
Help selecting nth record in query.
In short, using the category as a hash key, and a UUID as the sort key. Generate a random UUID, query Dynamo using greater than or less than, and limit to 1. This is even suggested by an AWS employee in the second link. (We've also tried increasing the limit to the number of items we need, but this increases the probability of the query failing the first time around).
Issues with this approach:
First query can fail if it is greater than/less than any of the UUIDs
Querying on any specific category will cause throttling at scale (Small number of partitions)
We've also considered adding a suffix to each category to artificially increase the number of partitions we have, as pointed out in the following link.
AWS Database Blog
Choosing the Right DynamoDB Partition Key
Approach 2
Amazon Web Services: How do we get random item from the dynamoDb's table?
Doing something similar to this, where we concatenate the category with a sequential number, and use this as the hash key. e.g. horror-000001.
By knowing the number of records in each category, we're able to perform random queries across our entire data set, while also avoiding hot partitions/keys.
Issues with this approach
We need a secondary data structure to manage the sequential counts across each category
Writing (especially deleting) is significantly more complex, although this doesn't need to happen in real time.
Conclusion
Both approaches solve our main use case of random queries on category/categories, but the cons they offer are really deterring us from using them. We're leaning more towards approach #1 using suffixes to solve the hot partitioning issue, although we would need the additional retry logic for failed queries.
Is there a better way of approaching this problem? Specifically looking for solutions capable of scaling well (No scan), without requiring extra resources be implemented. #1 fits the bill, but needing to manage suffixes and failed attempts really deters us from using it, especially when it is being called inside a lambda (billed for time used).
Thanks!
Follow Up
After more research and testing, my team has decided to move towards MySQL hosted on RDS for these tables. We learned that this is one of the few use cases were DynamoDB does not fit, and requires rewriting your use case to fit the DB (Bad).
We felt that the extra complexity required to integrate random sampling on DynamoDB wasn't worth it, and we were unable to come up with any comparable solutions. We are, however, sticking with DynamoDB for our tables that do not need random accessibility due to the price and response times.
For anyone wondering why we chose MySQL, it was largely due to the Nodejs library available, great online resources (which DynamoDB definitely lacks), easy integration via RDS with our Lambdas, and the option to migrate to Amazons Aurora database.
We also looked at PostgreSQL, but we weren't as happy with the client library or admin tools, and we believe that MySQL will suit our needs for these tables.
If anybody has anything else they'd like to add or a specific question please leave a comment or send me a message!
This was too long for a comment, and I guess it's pretty much a full fledged answer now.
Approach 2
I've found that my typical time to get a single item from dynamodb to a host in the same region is <10ms. As long as you're okay with at most 1-2 extra calls, you can quite easily implement approach 2.
If you use a keys only GSI where the category is your hash key and the primary key of the table is your range key, you can quickly find the largest numbered single item within a category.
When you add a new item, find the largest number for that category from the GSI and then write the new item to the table with sequence number n+1.
When you delete, find the item with the largest sequence number for that category from the GSI, overwrite the item you are deleting, and then delete the now duplicated item from its position at the highest sequence number.
To randomly get an item, query the GSI to find the highest numbered item in the category, and then randomly pick a number since you now know the valid range.
Approach 1
I'm not sure exactly what you mean when you say "without requiring extra resources to be implemented". If you're okay with using a managed resource (no dev work to implement), you can also make Approach 1 work by putting a DAX cluster in front of your dynamodb table. Then you can query to your heart's content without really worrying about hot partitions. (Though the caching layer means that new/deleted items won't be reflected right away.)

How do I optimize my DynamoDB table secondary global index so that records are evenly distributed while still keeping all records sortable?

Related to this question, I'm looking for more a more specific answer. In an effort to keep this non-subjective, here is a full thought process for creating an activities table with a stuck point that can be finished with a quick example answer.
In an effort to better understand DynamoDB, I'm creating a personal website that contains an activity feed from a DynamoDB table. The goal is to evenly distribute partition keys while still being able to sort across all partition keys (I'm struggling with this part).
Different types of activities will include blog posts, projects, twitter post references, LinkedIn post references, etc. Using the activity type as a partition key would not be wise as my activity is highly weighted, mostly on the twitter side, hardly ever creating blog posts.
A unique activity id seems to be the best option for evenly distributing activities across DynamoDB partitions. However, this completely removes the ability to sort activities to start, as queries require a partition id to be known first. This is where a secondary global index (SGI) will be helpful. With this, a sort key will not be required on the primary partition key, but paired in an SGI.
This is part where I'm stuck. What do I base the SGI partition key on? At the moment I'm thinking of a single value "activity" for all activities with a sort key of "date", but that is a single partition for all entries. Will a single SGI partition key value limit performance in this project?
Note that this is a small scale project. However, I'm thinking about large scale projects while building this one, attempting to create the best DynamoDB table possible in regards to optimized partition distribution, while still keeping it flexible for sorting all table records.
Consider GSI (Global Secondary Index) same as Main Table indexes while designing your schema as they also get Read/Write provisioning limits and are subject to hot partition throttling as well which back pressures on main table in other words if your GSI gets throttled then your main table will start throttling requests.
Will a single SGI partition key value limit performance in this project?
Single partition for complete table is definitely misuse of DDB scalable capability.
The goal is to evenly distribute partition keys while still being able to sort across all partition keys (I'm struggling with this part).
You can sort across partitions using GSI but you will again need partition key for your GSI and if that partition key is not distributed enough then you get into problems I mentioned above.
DDB is powerful for put/get operations if modeled right and for fairly simple queries with some filters. In general, you will utilize your throughput more efficiently as the ratio of partition key values accessed to the total number of partition key values in a table grows.
For your specific need its not directly possible to get scalable solution from DDB but we still have few options
Option 1:
We can model the data such that it is fairly distributed for writes and will need extra work while reading it back, this pattern is also known as Randomizing Across Multiple Partition Key Values. Since you don't want to access specific item for given time this will work for us.
Idea is to create fixed set (say 1 to 100) and randomly pick a number from it to append to creation date (not timestamp) and have creation timestamps as sort key.
This will distribute your load across multiple random partitions but increases the read complexity as you will need to query all partitions and merge to get final sort view for that date.
Option 2:
Use multiple tables for hot and cold data as it is time series based data. For info read
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.TimeSeriesDataAccessPatterns
Option 3:
Scan? Not a good choice if we talk about scalability and when your data grows but for fairly small set of data it surely helps so mentioning it.
These are just an example not saying a good fit for your usecase.
So here is a thought process question for you: write down all your use-cases and access patterns. Figure out their importance which are fine with eventual consistency which are not and see if DDB is good fit for them at first place, don't be tempted to use DDB and then struggling with access pattern scalability.
Also read https://stackoverflow.com/a/38790120/962545 for more questions you must be asking yourself before restricting yourself for specific access pattern you want from DDB.
Don't forget to read best practices: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html

How do you implement multi-tenancy on CouchBase? Can it be performant?

I'm considering an app which will store customer data. Given the way buckets work in CouchBase, all customer data will be in one bucket. It appears that I have two choices:
Implement multi-tenancy in views, by assigning a field to each record that indicates the customer it belongs to.
Implement it by putting a factor on every key that is a customer ID.
It seems, though, that since I will be using views, I'll really want to do both. In case number 2, I need to have the data in the record so that it can be indexed on (or maybe I can pull out part of the key in the map phase and index on customer) and in option 1, I'd want it to be part of the key as a check when retrieving data to make sure I don't send the wrong customers data down the line.
The problem is, this is a service where multiple customers will interact, and sometimes one customer will create some data and the other will view it, at the first customers request. But putting an ACL on each record that lists everyone who's authorized to view it would be problematic, to say the least.
I bet there is a common methodology or design pattern to answer this question, and would appreciate some pointers to best practices.
I'm also concerned about the performance if the indexes are indexing both on the particular piece of relevant data, and the customer id... a large number of different customers would presumably make the indexes much less efficient. (but maybe not.)
Here are my thoughts on your questions:
[Concerning items #1 and 2] - It seems, though, that since I will be using views, I'll really want to do both.
This doesn't seem to make sense to me. In Couchbase, the map phase can include content from both the key and the value. It makes little sense to store the data in both the key and the value, as you are guaranteed to have 1:1 duplication there. Store it wherever it makes the most sense to store it; in this case, probably the value.
The problem is, this is a service where multiple customers will interact, and sometimes one customer will create some data and the other will view it, at the first customers request. But putting an ACL on each record that lists everyone who's authorized to view it would be problematic, to say the least.
My site also has muti-tenant data stored in a single database. In my case, I use object unique identifiers as my keys. By default, customers can access all objects that belong to them (I have a user object, and the user is associated with a customer account). Users may also have additional permissions assigned to them, whereby a single object from another customer could be added to their user account, and they would thereby be granted access to view the object.
The alternative is "security through obscurity" and use guids as a random identifier, giving customers access to view any object that they have the guid for.
I would not, however, try to store the permissions on the objects themselves. That would quickly become unwieldy. You need to think about your specific use case, and decide what simple approach would work for the majority of the cases, and just not support the other 1-2% of the cases.