DynamoDB one-to-one - amazon-web-services

Hello stackoverflow community,
This question is about modeling one-to-one relationships with multiple entities involved.
Say we have an application about students. Each Student has:
Profile (name, birth date...)
Grades (math score, geography...)
Address (city, street...).
Requirements:
The Profile, Grades and the Address only belong to one Student each time (i.e. one-to-one).
A Student has to have all Profile, Grades and Address data present (there is no student without grades for example).
Updates can happen to all fields, but the profile data mostly remain untouched.
We access the data based on a Student and not by querying for the address or something else (a query could be "give me the grades of student John", or "give me profile and address of student John", etc).
All fields put together are bellow the 400kb threshold of DynamoDB.
The question is how would you design it? Put all data as a single row/item or split it to Profile, Grades and Address items?

My solution is to go with keeping all data in one row defined by the studentId as the PK and the rest of the data follow in a big set of columns. So one item looks like [studentId, name, birthDate, mathsGrade, geographyGrade, ..., city, street].
I find that like this I can have transnational inserts/updates (with the downside that I always have to work with the full item of course) and while querying I can ask for the subset of data needed each time.
On top of the above, this solution fits with two of the most important AWS guidelines about dynamo:
keep everything in a single table and
pre-join data whenever possible.
The reason for my question is that I could only find one topic in stackoverflow about one-to-one modeling in DynamoDB and the suggested solution (also heavily up-voted) was in favor of keeping the data in separate tables, something that reminds me a relational-DB kind of design (see the solution here).
I understand that in that context the author tried to keep a more generic use case and probably support more complex queries, but it feels like the option of putting everything together was fully devalued.
For that reason I'd like to open that discussion here and listen to other opinions.

A Basic Implementation
Considering the data and access patterns you've described, I would set up a single student-data table with a partition key that allows me to query by the student, and a sort key that allows me to narrow down my results even further based on the entity I want to access. One way of doing that would be to use some kind of identifier for a student, say studentID, and then something more generalized for the sort key like entityID, or simply SK.
At the application layer, I would classify each Item under one possible entity (profile, grades, address) and store data relevant to that entity in any number of attributes that I would need on that Item.
An example of how that data might look for a student named john smith:
{ studentId: "john", entityId: "profile", firstName: "john", lastName: "smith" }
{ studentId: "john", entityId: "grades", math2045: 96.52, eng1021:89.93 }
{ studentId: "john", entityId: "address", state: "CA", city: "fresno" }
With this schema, all your access patterns are available:
"give me the math grades of student john"
PartitionKey = "john", SortKey = "grades"
and if you store address within the students profile entity, you can accomplish "give me profile and address of student John" in one shot (multiple queries should be avoided when possible)
PartitionKey = "john", SortKey = "profile"
Consider
Keep in mind, you need to take into account how frequently you are reading/writing data when designing your table. This is a very rudimentary design, and may need tweaking to ensure that you're not setting yourself up for major cost or performance issues down the road.
The basic idea that this implementation demonstrates is that denormalizing your data (in this case, across the different entities you've established) can be a very powerful way to leverage DynamoDB's speed, and also leave yourself with plenty of ways to access your data efficiently.
Problems & Limitations
Specific to your application, there is one potential problem that stands out, which is that it seems very feasible the grades Items start to balloon to the point where they are impossible to manage and become expensive to read/write/update. As you start storing more and more students, and each student takes more and more courses, your grades entities will expand with them. Say the average student takes anywhere from 35-40 classes and gets a grade for each of them, you don't want to have to manage 35-40 attributes on an item if you don't have to. You also may not want back every single grade every time you ask for a student's grades. Maybe you start storing more data on each grade entity like:
{ math1024Grade: 100, math1024Instructor: "Dr. Jane Doe", math1024Credits: 4 }
Now for each class, you're storing at least 2 extra attributes. That Item with 35-40 attributes just jumped up to 105-120 attributes.
On top of performance and cost issues, your access patterns could start to evolve and become more demanding. You may only want grades from the student's major, or a certain type of class like humanities, sciences, etc, which is currently unavailable. You will only ever be able to get every single grade from each student. You can apply a FilterExpression to your request and remove some of the unwanted Items, but you're still paying for all the data you've read.
With the current solution, we are leaving a lot on the table in terms of optimizations in performance, flexibility, maintainability, and cost.
Optimizations
One way to address the lack of flexibility in your queries, and possible bloating of grades entities, is the concept of a composite sort key.
Using a composite sort key can help you break down your entities even further, making them more manageable to update and providing you more flexibility when you're querying. Additionally, you would wind up with much smaller and more manageable items, and although the number of items you store would increase, you'll save on cost and performance. With more optimized queries, you'll get only the data you need back so you're not paying those extra read units for data you're throwing away. The amount of data a single Query request can return is limited as well, so you may cut down on the amount of roundtrips you are making.
That composite sort key could look something like this, for grades:
{ studentId: "john", entityId: "grades#MATH", math2045: 96.52, math3082:91.34 }
{ studentId: "john", entityId: "grades#ENG", eng1021:89.93, eng2203:93.03 }
Now, you get the ability to say "give me all of John's MATH course grades" while still being able to get all the grades (by using the begins_with operation on the sort key when querying).
If you think you'll want to start storing more course information under grades entities, you can suffix your composite sort key with the course name, number, identifier, etc. Now you can get all of a students grades, all of a students grades within a subject, and all that data about a students grade within a subject, like its instructor, credits, year taken, semester, start date, etc.
These optimizations are all possible solutions, but may not fit your application, so again keep that in mind.
Resources
Here are some resources that should help you come up with your own solution, or ways to tweak the ones I've provided above to better suit you.
AWS re:Invent 2019: Data modeling with Amazon DynamoDB (CMY304)
AWS re:Invent 2018: Amazon DynamoDB Deep Dive: Advanced Design Patterns for DynamoDB (DAT401)
Best Practices for Using Sort Keys to Organize Data
NoSQL Design For DynamoDB
And keep this one in mind especially when you are considering cost/performance implications for high-traffic application:
Best Practices for Designing and Using Partition Keys Effectively

Related

Data modelling one to many relationship in DynamoDB

I am working in Dynamo DB for the first time . My assignment is Game Management System where it has 3 entities Game , Tournament and User. The relationship between each entity is.
1 Game has multiple users
1 User has multiple tournaments
1 Game has multiple tournaments
I have identified the following access patterns. User can login via gameId and userId . Be it anycase we should retrieve complete Game details hence the data in dynamoDb is stored as below.
Fetch GameDetails by GameId
Fetch GameDetails by userId
challenge here is we will have either GameId or UserId at a time. I should send back complete GameDetail object . GameDetail object should be maintained across the GameId . For example If user logs in using userId 101 and play a game "abc" then all the items with gameId should be updated .
Considering such scenario how to data model DynamoDb . I thought having gameId as partition key and userId as sort key and GSI with inverted index (userId as partitionkey and gameId as sortKey)
As I mentioned earlier the challenge is we will have either GameId or userId at a time in which case we cannot update without sort key . Experienced people please help .
There are three DynamoDB concepts that I think will help here.
The first relates to granularity of Items. In DynamoDB I like my individual Items to be fine-grained. Then if I need to assemble a more complex object, I can pull together a number of fine-grained Items. The advantage is that I can then pull properties of the fine-grained object out into attributes, and I can make Global Secondary Indexes to create different views of the object.
The second relates to immutability. In DynamoDB I like to use Items to represent true facts that have happened and will never change. Then if I need to determine the state of the system, I can pull together all of those immutable events, and perform computations on them. The advantage is that I don’t have to worry about conflicting updates to an Item or to worry about losing historical information.
The third relates to pulling complexity out of the database and into the application. In DynamoDB I like to move computations involving multiple Items out of the database, and perform them within my application. There is not much cost to doing so, since the amount of data for one domain object, like a game, an employee, or an order, is often manageable. The advantage is less complexity in the data model, and fewer opportunities for complicated operations performed within the database where they could affect its performance.
These relate to your system as follows.
For granularity, I would model the Score, not the Game. Assemble the state of the Game by pulling in all the Scores for all the Tournaments within the Game and doing computations on them.
For immutability, record a Score for a particular Tournament as an immutable event. Then you don’t have to worry about updating existing Items when a Score is determined for a Tournament. If the Score has to change, and you want to store the Score between changes, then record Points instead of Scores, and similar logic applies. Once a Point is scored, it forever has been scored and won’t change (but you can score negative Points), and you can determine the score for a Tournament by pulling together all the Point events.
For pulling out complexity, you will be performing things like determining the winner of a Game using computations in your application, either front-end or back-end, using a potentially large collection of Scores. This is generally fine, since compute resources and browsers can usually handle computations on data sets of the size that would be expected for a Game.
I have one note on terminology. In English usage in the United States, you would call a collection of games a tournament. It may be that in other parts of the world, or for particular sports I don't know about, a game is a collection of tournaments. Either way, we effectively have “Big Competitions” and “Little Competitions,” where a Big Competition is a collection of Little Competitions. In your post, you’re using Big Competition for Game, and Little Competition for Tournament. In my suggested table design below, I’m using generic terms BigComp to abbreviate “Big Competition” and LittleComp to abbreviate “Little Competition.” I’m calling this out, and using my generic terms, in case others are confused as I was by the notion of a game as a collection of tournaments.
I would model your system with the following Item types and Global Secondary Indexes.
Constructions like bigcomp#<BigComp Id> mean the literal string bigcomp# followed by a Game Id like 4839457. I do this so that I can use a PK for different types of Items without worrying about collisions in name.
Item type: Score
PK: bigcomp#<BigComp Id>
SK: littlecomp#<LittleComp Id>#userid#<User Id>#score
Attribute: ScoreUserId
Attribute: ScoreValue
**Note: User Id and ScoreUserId are repetitions of the same info, the User Id.**
Global Secondary Index UserScores
PK: ScoreUserId
Projections: All attributes
Item type: Login
PK: userid#<User Id>
SK: <sortable date and time string>
**Note: This allows you to show the last login time.**
Item type: BigComp Enrollment
PK: bigcomp#<BigComp Id>
SK: bigcompenrollment
Attribute: BigCompEnrolledUserId
**Note: The attribute is the User Id of the user who has joined this BigComp.**
GlobalSecondaryIndex BigCompEnrolledUserId-BigComp
PK: BigCompEnrolledUserId
Projections: All attributes
Item type: LittleComp Enrollment
PK: bigcomp#<BigComp Id>
SK: littlecompenrollment#littlecomp#<LittleComp Id>
Attribute: LittleCompEnrolledUserId
GlobalSecondaryIndex LittleCompEnrolledUserId-LittleComp
PK: LittleCompEnrolledUserId
Projections: All attributes
Here’s how some access patterns would work. I’ll use the convention of a Big Competition being a Game, and a Little Competition being a Tournament, since that’s how it is in the original post.
User joins a Game. Write Item of type BigComp Enrollment.
User joins a Tournament. Write Item of type LittleComp Enrollment.
Another User joins the Tournament. Write another Item of type LittleComp Enrollment.
The first User logs out, and then wants to re-join the Tournament based on the Game Id. Query PK: bigcomp#<BigComp Id>, SK: begins_with(littlecompenrollment#), with BigComp Id set to the known Game Id, and you’ll have a collection of all the Tournaments for this Game, and their Users. The application will have to handle authentication, to make sure the User is who they say they are.
Two Users compete in their Tournament, each getting a Score. Write two Items of type Score.
Who won a given Tournament? Query all the Score Items for the specific Game (BigComp) and Tournament (LittleComp).
Who won the Game? Get all the Scores for the specific Game (BigComp). Use logic in your application to figure out the winner.

DynamoDB query all users sorted by name

I am modelling the data of my application to use DynamoDB.
My data model is rather simple:
I have users and projects
Each user can have multiple projects
Users can be millions, project per users can be thousands.
My access pattern is also rather simple:
Get a user by id
Get a list of paginated users sorted by name or creation date
Get a project by id
get projects by user sorted by date
My single table for this data model is the following:
I can easily implement all my access patterns using table PK/SK and GSIs, but I have issues with number 2.
According to the documentation and best practices, to get a sorted list of paginated users:
I can't use a scan, as sorting is not supported
I should not use a GSI with a PK that would put all my users in the same partition (e.g. GSI PK = "sorted_user", SK = "name"), as that would make my single partition hot and would not scale
I can't create a new entity of type "organisation", put all users in there, and query by PK = "org", as that would have the same hot partition issue as above
I could bucket users and use write sharding, but I don't really know how I could practically query paginated sorted users, as bucket PKs would need to be possibly random, and I would have to query all buckets to be able to sort all users together. I also thought that bucket PKs could be alphabetical letters, but that could crated hot partitions as well, as the letter "A" would probably be hit quite hard.
My application model is rather simple. However, after having read all docs and best practices and watched many online videos, I find myself stuck with the most basic use case that DynamoDB does not seem to be supporting well. I suppose it must be quite common to have to get lists of users in some sort of admin panel for practically any modern application.
What would others would do in this case? I would really want to use DynamoDB for all the benefits that it gives, especially in terms of costs.
Edit
Since I have been asked, in my app the main use case for 2) is something like this: https://stackoverflow.com/users?tab=Reputation&filter=all.
As to the sizing, it needs to scale well, at least to the tens of thousands.
I also thought that bucket PKs could be alphabetical letters, but
that could create hot partitions as well, as the letter "A" would
probably be hit quite hard.
I think this sounds like a reasonable approach.
The US Social Security Administration publishes data about names on its website. You can download the list of name data from as far back as 1879! I stumbled upon a website from data scientist and linguist Joshua Falk that charted the baby name data from the SSA, which can give us a hint of how names are distributed by their first letter.
Your users may not all be from the US, but this can give us an understanding of how names might be distributed if partitioned by the first letter.
While not exactly evenly distributed, perhaps it's close enough for your use case? If not, you could further distribute the data by using the first two (or three, or four...) letters of the name as your partition key.
1 million names likely amount to no more than a few MBs of data, which isn't very much. Partitioning based on name prefixes seems like a reasonable way to proceed.
You might also consider using a tool like ElasticSearch, which could support your second access pattern and more.

AWS Personalize items attributes

I'm trying to implement personalization and having problems with Items schema.
Imagine I'm Amazon, I've products their brands and their categories. In what kind of Items schema should I include this information?
Should I include brand name as string as categorical field? Should I rather include brand ID as string or numeric? or should I include both?
What about categories? I've the same questions.
Metadata Fields Metadata includes string or non-string fields that
aren't required or don't use a reserved keyword. Metadata schemas have
the following restrictions:
Users and Items schemas require at least one metadata field,
Users and Interactions datasets can contain up to five metadata
fields. An Items dataset can contain up to 50 metadata fields.
If you add your own metadata field of type string, it must include the
categorical attribute. Otherwise, Amazon Personalize won't use the
field when training a model.
https://docs.aws.amazon.com/personalize/latest/dg/how-it-works-dataset-schema.html
There are simply 2 ways to include your metadata in Items/Users datasets:
If it can be represented as a number value, then provide the actual value if it makes sense.
If it can be represented as string, then provide the string value and make sure, that categorical is set to true.
But let's take a look into "Why does they need me, to categorize my strings metadata?". The answer is pretty simple.
Let's start with an example.
If you would have Items as Amazon.com products and you would like to provide rates metadata field, then:
You could take all of the rates including the full review text sent by clients and simply put it as metadata field.
You can take just stars rating, calculate the average and put it as metadata field.
Probably the second one is making more sense in general. Having random, long reviews of product as metadata, pretty much changes nothing. Personalize doesn't understands if the review itself is good or bad, or if the author also recommends another product, so pretty much it doesn't really add anything to the recommendations.
However if you simply "cut" your dataset and calculate the average rating, like in the 2. point, then it makes a lot more sense. Maybe some of our customers like crappy products? Maybe they want to buy them, because they are famous YouTubers and they create videos about that? Based on their previous interactions and much more, Personalize will be able to perform just slightly better, because now it knows, that this product has rating of 5/5 or 3/5.
I wanted to show you, that for some cases, providing Items metadata as string makes no sense. That's why your string metadata must be categorical. It means, that it should be finite set of values, so it adds some knowledge for Personalize about given Item and why some of people might want to interact with it.
Going back to your question:
Should I include brand name as string as categorical field? Should I rather include brand ID as string or numeric? or should I include both?
I would simply go with brand ID as string. You could also go with brand name, but probably single brand can be renamed, when it's still the same brand, so picking up the ID would be more constant. Also two different brands could have the same names, because they are present on different markets, so picking up the ID solves that.
The "categorical": true switch in your schema just tells Personalize:
Hey, do you see that string field? It's categorised, finite set of values. If you train a model for me, please include this one during the training, it's important!
And as it's said in documentation, if you will provide string metadata field, which is not marked as categorical, then Personalize will "think" that:
Hmm.. this field is a string, it has pretty random values and it's not marked as categorical. It's probably just a leftover from Items export job. Let's ignore that.

Search dynamoDB using more than one attribute

I've created a skill with the help of a few people on this site.
I have a database and what I want to do is ask Alexa to recall data from my database. I.e. by asking for films from a certain date
The issue im having at the moment is I have defined my partition key and it works correctly for one of my items in my table and will read the message for that specific key, but anything else i search it gives me the same response as the one item that works. Any ideas on how to overcome this?
Here is how i have defined my table:
let handleCinemaIntent = (context, callback) => {
let params = {
TableName: "cinema",
Key: {
date: "2018-01-04",
}
};
Just as a side note, I will have the same date repeating in my partition key and from what I understand, the partition key needs to be unique; so i'd need to overcome this.
You have a few options for structuring your DynamoDB table but I think the most straightforward is the following:
You can set up your table with a partition key of "date" (like you have now), but also with a sort key which would be the film name, or some other identifier. This way, you can have all films for a particular date under one partition key and query them using a Query operation (as opposed to the GetItem that you've been using). You won't be able to modify the existing table to add a sort key though, so you will have to delete the existing table and recreate it with the different schema.
Since there is generally a rather limited number of films for each day, this partition scheme should work really well, assuming you always just query by day. Where this breaks down is if you need to search by just film name (ie. "give me the dates when this film will run"). If you need the latter, then you could create a GSI where the primary key is the film name, and the range key is the date.
However, you should pause a moment and consider whether DynamoDB is the right database for your needs. I say this because Dynamo is really good at access patterns where you know exactly what you are searching for and you need to be able to scale horizontally. Whereas your use case is more of a fuzzy search.
As an alternative to Dynamo you might consider setting up an ElasticSearch cluster and throwing your film data in it. Then you can very trivially run queries like "what films will run on this day", or "what days will this film run", or "what films will run this week", or "what action movies are coming this spring", "what animation films are playing today", "what movies are playing near me"

AWS DynamoDB Table Design: Store two UserIDs and Details in Table

I'm building an app where two users can connect with each other and I need to store that connection (e.g. a friendship) in a DynamoDB table. Basically, the connection table has have two fields:
userIdA (hash key)
userIdB (sort key)
I was thinking to add an index on userIdB to query on both fields. Should I store a connection with one record (ALICE, BOB) or two records (ALICE, BOB; BOB, ALICE)? The first option needs one write operation and less space, but I have to query twice to get all all connections of an user. The second option needs two write operations and more space, but I only have to query once for the userId.
The user tablehas details like name and email:
userId (hash key)
name (sort key)
email
In my app, I want to show all connections of a certain user with user details in a listview. That means I have two options:
Store the user details of the connected users also in the connection table, e.g. add two name fields to that table. This is fast, but if the user name changes (name and email are retrieved from Facebook), the details are invalid and I need to update all entries.
Query the user details of each userId with a Batch Get request to read multiple items. This may be slower, but I always have up to date user details and don't need to store them in the connection table.
So what is the better solution, or are there any other advantages/disadvantages that I may have overlooked?
EDIT
After some google research regarding friendship tables with NoSQL databases, I found the following two links:
How does Facebook maintain a list of friends for each user? Does it maintain a separate table for each user?
NoSQL Design Patterns for Relational Data
The first link suggests to store the connection (or friendship) in a two way direction with two records, because it makes it easier and faster to query:
Connections:
1 userIdA userIdB
2 userIdB userIdA
The second link suggests to save a subset of duplicated data (“summary”) into the tables to read it faster with just one query. That would be mean to save the user details also into the connection table and to save the userIds into an attribute of the user table:
Connections:
# userIdA userIdB userDetails status
1 123 456 { userId: 456, name: "Bob" } connected
2 456 123 { userId: 123, name: "Alice" } connected
Users:
# userId name connections
1 123 Alice { 456 }
2 456 Bob { 123 }
This database model makes it pretty easy to query connections, but seems to be difficult to update if some user details may change. Also, I'm not sure if I need the userIds within the user table again because I can easily query on a userId.
What do you think about that database model?
In general, nosql databases are often combined with a couple of assumptions:
Eventual consistency is acceptable. That is, it's often acceptable in application design if during an update some of the intermediate answers aren't right. That is, it might be fine if for a few seconds while alice is becoming Bob's friend, It's OK if "Is Alice Bob's friend" returns true while "is Bob Alice's friend" returns false
Performance is important. If you're using nosql it's generally because performance matters to you. It's also almost certainly because you care about the performance of operations that happen most commonly. (It's possible that you have a problem where the performance of some uncommon operation is so bad that you can't do it; nosql is not generally the answer in that situation)
You're willing to make uncommon operations slower to improve the performance of common operations.
So, how does that apply to your question. First, it suggests that ultimately the answer depends on performance. That is, no matter what people say here, the right answer depends on what you observe in practice. You can try multiple options and see what results you get.
With regard to the specific options you enumerated.
Assuming that performance is enough of a concern that nosql is a reasonable solution for your application, it's almost certainly query rather than update performance you care about. You probably will be happy if you make updates slower and more expensive so that queries can be faster. That's kind of the whole point.
You can likely handle updates out of band--that is eventually consistency likely works for you. You could submit update operations to a SQS queue rather than handling them during your page load. So if someone clicks a confirm friend button, you could queue a request to actually update your database. It is OK even if that involves rebuilding their user row, rebuilding the friend rows, and even updating some counts about how many friends they have.
It probably does make sense to store a friend row in each direction so you only need one query.
It probably does make sense to store the user information like Name and picture that you typically display in a friend list duplicated in the friendship rows. Note that whenever the name or picture changes you'll need to go update all those rows.
It's less clear that storing the friends in the user table makes sense. That could get big. Also, it could be tricky to guarantee eventual consistency. Consider what happens if you are processing updates to two users' friendships at the same time. It's very important that you not end up with inconsistency once all the dust has settled.
Whenever you have non-normalized data such as duplicating rows in each direction, or copying user info into friendship tables, you want some way to revalidate and fix your data. You want to write code that in the background can go scan your system for inconsistencies caused by bugs or crashed activities and fix them.
I suggest you have the following fields in the table:
userId (hash key)
name (sort key)
email
connections (Comma separated or an array of userId assuming you have multiple connections for a user)
This structure can ensure consistency across your data.