DynamoDb database design - amazon-web-services

I'm new to DynamoDb and noSql in general.
I have a users table and a notes table. A user can create notes and I want to be able to retrieve all notes associated with a user.
One solution I've thought of is every time a note is saved the note id is stored inside a 'notes' attribute inside the user table. This will allow me to query the users table for all note id's and then query notes using those id's:
UserTable:
UserId: 123456789
notes: ['note-id-1', note-id-2]
NotesTable
id: note-id-1
text: "Some note"
Is this the correct approach? The only other way I can think is to have the notes table have a userId attribute so I can then query the notes table based on that userId. Obviously this is the sort of approach is more relational.

I would take the approach at the end of your question: each note should have a userId attribute. Then create a global secondary index with userId as primary key and noteId as sort key. This way you can also query on userId, by doing a query on that index.
If you do it the way you suggested, you always need two queries to get the notes of a user (first get the notes from the user table and then query on the notes table). Also, when someone has N notes you would need to do N queries, this is going to be expensive if N is large.
If you do it the way in this answer, you need one query to get all notes of a user (I'm assuming no pagination) and one to get the user information. Will never be more than 2.
General rule of thumb:
SQL: storage = expensive, computation = cheap
NoSQL: storage = cheap, computation = expensive
So always try to need as little queries as possible.

Related

GSI vs redundancy dynamoDB

I have this scenario:
I have to save in dynamoDB table a lot of shops. Every shop has a ID string and its PK.
Every shop has a field "category" that is a string that indicates its category (food,tatoo ...).
So far everything is ok.
I have this use-case: "given in a category get all the stores of that category".
To accomplish this, two options came to my mind:
create a GSI that has like PK the "category id" and like field "shop ID".
In this way with the id of the category I get all the IDs of the stores of that category and then for each store id I query the main table to get all the info of each single store (name, address, etc.).
I create in the main table a PK called type "category_$id" (where $id is the category id) and as field the id of the store. This, as in the case of GSI, given a category ID, I have the set of IDs of the shops and then for each ID I execute the query on the same table to get all the info of that shop.
I wanted to know what the difference between these two options is in terms of cost / benefit and which is the best.
They seem to me substantially the same thing (the only difference is that the first uses another table, i.e. the index, while the second uses the same table), but I await the opinion of someone more experienced than me
One benefit of GSI is that it will result in less management. Lets say you delete/add a record from/to a main table. This will be automatically be reflected in your GSI.
In contrast, if you have two independent tables, you have to manage the synchronization between them yourself.

DynamoDB Many-to-Many relations

I have a problem modeling my data in DynamoDB. My APP creates notes with the possibility to share a note with other user and allow the other user to update the Note (as done by https://keep.google.com/).
As I need to share notes between users, I decide that my primary table key will be the identifier of a Note.
Then I come with the following data-model for my DynamoDB tables:
Primary Table :(PK = NoteId, SK = Type)
Secondary Table: (GSK = userId, SK = noteId )
The "Type" will indicate if it is the BODY of the note (where information regarding the note will be save) or an identifier that indicate if the note has been shared with other user.
But I do have a problem: I use the secondary global key to retrieve all the notes for a user.
Once I have the list of noteId(s), I will enquiry my primary table to get all shared-notes for the user (as the notes for the user are already present in the SGK).
However, for doing this I need to use the function: "BatchGetItem".
The problem is that it is only allow to get 100 items and 16MB data.
In case of more than 100 shared-notes I have to call this functions several times. Moreover in case the data exceeds 16MB I need to implement a mechanism to read the rest of the requested data.
This operation could get really slow depending on the data size and number of shareId.
As you can imagine this is easily solved using a RDB and "join".
But the idea here is to use DynamoDB.
Data Access patterns:
Get all Notes by userId (own and shared)
Add a shared by userId and sharedId.
Get rights by noteId and userId.
Update a note by Id
Delete a note by Id
Any ideas of how I can change my data-model to improve the access pattern to read all notes?
Modelling your schema to utilise item collections will allow you to use the Query API which does not have a limit of items returned except a 1MB limit that still needs to be paged through.

DynamoDB partition key choice for notes app

I want to create a DynamoDB table that allows me to save notes from users.
The attributes I have:
user_id
note_id (uuid)
type
text
The main queries I will need:
Get all notes of a certain user
Get a specific note
Get all notes of a certain type (the less used query)
I know that in terms of performance and DynamoDB partitions note_id would be the right choice because they are unique and would be distributed equally over the partitions but on the other hand is much harder to get all notes of a user without scanning all items or using a GSI. And if they are unique I suppose it doesn't make any sense to have a sort key.
The other option would be to use user_id as partition key and note_id as sort key, but if I have certain users that are a much larger number of notes than others wouldn't that impact my performance?
Is it better to have a partition key unique (like note_id) to scale well with DynamoDB partitions and use GSIs to create my queries or to use instead a partition key for my main query (user_id)?
Thanks
Possibly the simplest and most cost-effective way would be a single table:
Table Structure
note_id (uuid) / hash key
user_id
type
text
Have two GSIs, one for "Get all notes of a certain user" and one for "Get all notes of a certain type (the less used query)":
GSI for "Get all notes of a certain user"
user_id / hash key
note_id (uuid) / range key
type
text
A little note on this - which of your queries is the most frequent: "Get all notes of a certain user" or "Get a specific note"? If it's the former, then you could swap the GSI keys for the table keys and vice-versa (if that makes sense - in essence, have your user_id + note_id as the key for your table and the note_id as the GSI key). This also depends upon how you structure your user_id - I suspect you've already picked up on; make sure your user_id is not sequential - make it a UUID or similar.
GSI for "Get all notes of a certain type (the less used query)"
type / hash key
note_id (uuid) / range key
user_id
text
Depending upon the cardinality of the type field, you'll need to test whether a GSI will actually be of benefit here or not.
If the GSI is of little benefit and you need more performance, another option would be to store the type with an array of note_id in a separate table altogether. Beware of the 400k item limit with this one and the fact that you'll need to perform another query to get the text of the note.
With this table structure and GSIs, you're able to make a single query for the information you're after, rather than making two if you have two tables.
Of course, you know your data best - it's best to start with what you think is best and then test it to ensure it meets what you're looking for. DynamoDB is priced by provisioned throughput + the amount of indexed data stored so creating "fat" indexes with many attributes projects, as above, if there is a lot of data then it could become more cost effective to perform two queries and store less indexed data.
I would use user_id as your primary partition(hash) key and note_id as your primary range(sort) key.
You have already noted that in an ideal situation, each partition key is accessed with equal regularity to optimise performance see Design For Uniform Data Access Across Items In Your Tables. The use of user_id is perfectly fine as long as you have a good spread of users who regularly log in. Indeed AWS specifically encourage this option (see 'Choosing a Partition Key' table in the link above).
This approach will also make your application code much simpler than your alternative approach.
You then have a second choice which is whether to apply a Global Secondary Index for your get notes by type query. A GSI key, unlike a primary key, does not need to be unique (see AWS GSI guide, therefore I suggest you would simply use type as your GSI partition key without a range key.
The obvious plus side to using a GSI is a faster result when you perform the note type query. However you should be aware of the downsides also. A GSI has a separate throughput allowance than your table, so you need to provision this in addition to your table throughput (at extra cost). If you dont provision your GSI with enough read units it could end up slower than a scan on your table. If you dont provision enough write units, your table writes could be throttled, even if your table had enough write units.
Also, AWS warn that GSIs are updated asynchronously (usually within a fraction of a second but it can be longer). This means queries on your GSI might return the 'wrong' result if you have table writes and index reads very close together. If this was a problem you would need to handle it in your application code.
I see this as 2 tables. User and notes with a GSI on the notes table. Not sure how else you could do it. Using userId as primary key and note_id as sort key requires that you can only retrieve elements when you know both the user_id and the note_id. With DynamoDB if your not scanning you have to satisfy all the elements in the primary key, so both the partition and and sort if there is one. Below is how I would do this.
Get all notes of a certain user
When a user creates a note I would add this to the users table in the users notes attribute. When you want to get all of a users notes then retrieve the user and access the array/list of note_ids stored there.
{ userId: xxx,
notes: [ note_id_1,note_id_2,note_id_3]
}
Get a specific note
A notes table with node_id as the primary key would make that easy.
{
noteId: XXXX,
note: "sfsfsfsfsfsf",
type: "standard_note"
}
Get all notes of a certain type (the less used query)
I would use a GSI on the notes table for this with the attributes of "note_type" and note_id projected onto it.
Update
You can pull this off with one table and a GSI (See the two answers below for how) but I would not do it. Your data model is so simple why make it more complicated than users and notes.

Efficient implementation of this simple relation in DynamoDB?

User has an email address and a display name.
Both of these must be unique.
Both of these must be updatable as long as either is not being used already.
A User table will exist with additional non-key attributes and a guid ID.
How to model to support efficient query check if email address or display name is already being used?
Should I create a table with the guid as Key, no range, and 2 separate GSI one for email and one for display name (each being the key)? Both will also have a second field with the guid id of the user. Or should these be completely separate tables, or ????
Thoughts, is there a better way?
Thanks.
There are 3 ways you can design that I can think of:
As you have mentioned, a table with guid and 2 separate GSI one for email and other for Name.
You have stated that both the fields had to be unique, so potentially you can make any one of them as hash and create GSI for other.(This will run into problem as you mention that you need to update Email & Name as well, for that you have to delete old record and add a new record with same attributes and updated Hash keys)
Advantage of this would be that you need to pay less as there will be only one GSI compared to #1.
Another option is to use CloudSearch, your DynamoDB table can be integrated with cloudSearch, in this option you can simply create a table with guid no need to add any GSI, whenever you want to search you can search on CloudSearch to get the output.
One more advantage you will get in CloudSearch is that you will be able to query on any attributes of the table and can use different filters on them.
One thing you need to see it that price difference between #2 and #3, you can go with anyone which is better suited in terms of price and functionality.
If you implement this with other ways feel free to share it.
Hope that helps

What's cheaper on DynamoDB (GSI vs multiple tables)

I have an issue of making a username AND an email unique. It is quite easy with relationaldatabase and just do 2 queries and get the count back on each.
select count(email) from users;
select count(username) from users;
But in DynamoDB (NoSQL) is it better (i.e. cheaper) to have 2 tables like so:
username table (where username is the hash) and check that table with a PUT and attribute_does_not_exist
AND
email table (where email is the hash) and check that table after the first one with a PUT and attribute_does_not_exist
OR do I
email table (hash) and username (GSI in that table). Then query the GSI first and if it doesn't exist then do a PUT with email and username
Which is better (cheaper)?
Two questions so I'll address them separately.
Which is cheaper?
You can run a single table with one GSI or two tables for the exact same cost if you want to because throughput for GSIs are provisioned the same way the primary table's throughput is.
Cost should not be a deciding factor.
Which is better?
The fact DynamoDB makes it difficult to have a secondary attribute retain its uniqueness is difficult is a common problem. Because of the asynchronous nature of GSIs the HASH or HASH/RANGE combination for a GSI is not unique. This can be taken advantage of in some circumstances.
If you use two tables you are taking the responsibility for keeping both tables in sync (something that is not easy to do in many situations). This comes with some important responsibilities (what happens if your app dies after writing to the first table but before it writes to the second), but this additional responsibility could allow you to maintain the uniqueness you want.
To explain how you would actually accomplish the dual uniqueness while maintaining accuracy, you would want to take advantage of conditional writes. The following outline describes a series of steps that would ensure that you maintain uniqueness.
Write record to username table with condition that username is not in the table, but include a conditional flag set to false (if write fails, we bail)
Write record to email table with condition that email is not in the table (if write fails, we delete the previous username record)
Update the username record to set the conditional flag to true
The reason you would want to use a conditional flag with the username to essentially indicate that the record is not in a valid state is to ensure you actually maintain the uniqueness.