Data modelling one to many relationship in DynamoDB

Data modelling one to many relationship in DynamoDB - amazon-web-services

I am working in Dynamo DB for the first time . My assignment is Game Management System where it has 3 entities Game , Tournament and User. The relationship between each entity is.
1 Game has multiple users
1 User has multiple tournaments
1 Game has multiple tournaments
I have identified the following access patterns. User can login via gameId and userId . Be it anycase we should retrieve complete Game details hence the data in dynamoDb is stored as below.
Fetch GameDetails by GameId
Fetch GameDetails by userId
challenge here is we will have either GameId or UserId at a time. I should send back complete GameDetail object . GameDetail object should be maintained across the GameId . For example If user logs in using userId 101 and play a game "abc" then all the items with gameId should be updated .
Considering such scenario how to data model DynamoDb . I thought having gameId as partition key and userId as sort key and GSI with inverted index (userId as partitionkey and gameId as sortKey)
As I mentioned earlier the challenge is we will have either GameId or userId at a time in which case we cannot update without sort key . Experienced people please help .

There are three DynamoDB concepts that I think will help here.
The first relates to granularity of Items. In DynamoDB I like my individual Items to be fine-grained. Then if I need to assemble a more complex object, I can pull together a number of fine-grained Items. The advantage is that I can then pull properties of the fine-grained object out into attributes, and I can make Global Secondary Indexes to create different views of the object.
The second relates to immutability. In DynamoDB I like to use Items to represent true facts that have happened and will never change. Then if I need to determine the state of the system, I can pull together all of those immutable events, and perform computations on them. The advantage is that I don’t have to worry about conflicting updates to an Item or to worry about losing historical information.
The third relates to pulling complexity out of the database and into the application. In DynamoDB I like to move computations involving multiple Items out of the database, and perform them within my application. There is not much cost to doing so, since the amount of data for one domain object, like a game, an employee, or an order, is often manageable. The advantage is less complexity in the data model, and fewer opportunities for complicated operations performed within the database where they could affect its performance.
These relate to your system as follows.
For granularity, I would model the Score, not the Game. Assemble the state of the Game by pulling in all the Scores for all the Tournaments within the Game and doing computations on them.
For immutability, record a Score for a particular Tournament as an immutable event. Then you don’t have to worry about updating existing Items when a Score is determined for a Tournament. If the Score has to change, and you want to store the Score between changes, then record Points instead of Scores, and similar logic applies. Once a Point is scored, it forever has been scored and won’t change (but you can score negative Points), and you can determine the score for a Tournament by pulling together all the Point events.
For pulling out complexity, you will be performing things like determining the winner of a Game using computations in your application, either front-end or back-end, using a potentially large collection of Scores. This is generally fine, since compute resources and browsers can usually handle computations on data sets of the size that would be expected for a Game.
I have one note on terminology. In English usage in the United States, you would call a collection of games a tournament. It may be that in other parts of the world, or for particular sports I don't know about, a game is a collection of tournaments. Either way, we effectively have “Big Competitions” and “Little Competitions,” where a Big Competition is a collection of Little Competitions. In your post, you’re using Big Competition for Game, and Little Competition for Tournament. In my suggested table design below, I’m using generic terms BigComp to abbreviate “Big Competition” and LittleComp to abbreviate “Little Competition.” I’m calling this out, and using my generic terms, in case others are confused as I was by the notion of a game as a collection of tournaments.
I would model your system with the following Item types and Global Secondary Indexes.
Constructions like bigcomp#<BigComp Id> mean the literal string bigcomp# followed by a Game Id like 4839457. I do this so that I can use a PK for different types of Items without worrying about collisions in name.
Item type: Score
PK: bigcomp#<BigComp Id>
SK: littlecomp#<LittleComp Id>#userid#<User Id>#score
Attribute: ScoreUserId
Attribute: ScoreValue
**Note: User Id and ScoreUserId are repetitions of the same info, the User Id.**
Global Secondary Index UserScores
PK: ScoreUserId
Projections: All attributes
Item type: Login
PK: userid#<User Id>
SK: <sortable date and time string>
**Note: This allows you to show the last login time.**
Item type: BigComp Enrollment
PK: bigcomp#<BigComp Id>
SK: bigcompenrollment
Attribute: BigCompEnrolledUserId
**Note: The attribute is the User Id of the user who has joined this BigComp.**
GlobalSecondaryIndex BigCompEnrolledUserId-BigComp
PK: BigCompEnrolledUserId
Projections: All attributes
Item type: LittleComp Enrollment
PK: bigcomp#<BigComp Id>
SK: littlecompenrollment#littlecomp#<LittleComp Id>
Attribute: LittleCompEnrolledUserId
GlobalSecondaryIndex LittleCompEnrolledUserId-LittleComp
PK: LittleCompEnrolledUserId
Projections: All attributes
Here’s how some access patterns would work. I’ll use the convention of a Big Competition being a Game, and a Little Competition being a Tournament, since that’s how it is in the original post.
User joins a Game. Write Item of type BigComp Enrollment.
User joins a Tournament. Write Item of type LittleComp Enrollment.
Another User joins the Tournament. Write another Item of type LittleComp Enrollment.
The first User logs out, and then wants to re-join the Tournament based on the Game Id. Query PK: bigcomp#<BigComp Id>, SK: begins_with(littlecompenrollment#), with BigComp Id set to the known Game Id, and you’ll have a collection of all the Tournaments for this Game, and their Users. The application will have to handle authentication, to make sure the User is who they say they are.
Two Users compete in their Tournament, each getting a Score. Write two Items of type Score.
Who won a given Tournament? Query all the Score Items for the specific Game (BigComp) and Tournament (LittleComp).
Who won the Game? Get all the Scores for the specific Game (BigComp). Use logic in your application to figure out the winner.

Related

DynamoDB one-to-one

Hello stackoverflow community,
This question is about modeling one-to-one relationships with multiple entities involved.
Say we have an application about students. Each Student has:
Profile (name, birth date...)
Grades (math score, geography...)
Address (city, street...).
Requirements:
The Profile, Grades and the Address only belong to one Student each time (i.e. one-to-one).
A Student has to have all Profile, Grades and Address data present (there is no student without grades for example).
Updates can happen to all fields, but the profile data mostly remain untouched.
We access the data based on a Student and not by querying for the address or something else (a query could be "give me the grades of student John", or "give me profile and address of student John", etc).
All fields put together are bellow the 400kb threshold of DynamoDB.
The question is how would you design it? Put all data as a single row/item or split it to Profile, Grades and Address items?

My solution is to go with keeping all data in one row defined by the studentId as the PK and the rest of the data follow in a big set of columns. So one item looks like [studentId, name, birthDate, mathsGrade, geographyGrade, ..., city, street].
I find that like this I can have transnational inserts/updates (with the downside that I always have to work with the full item of course) and while querying I can ask for the subset of data needed each time.
On top of the above, this solution fits with two of the most important AWS guidelines about dynamo:
keep everything in a single table and
pre-join data whenever possible.
The reason for my question is that I could only find one topic in stackoverflow about one-to-one modeling in DynamoDB and the suggested solution (also heavily up-voted) was in favor of keeping the data in separate tables, something that reminds me a relational-DB kind of design (see the solution here).
I understand that in that context the author tried to keep a more generic use case and probably support more complex queries, but it feels like the option of putting everything together was fully devalued.
For that reason I'd like to open that discussion here and listen to other opinions.

A Basic Implementation
Considering the data and access patterns you've described, I would set up a single student-data table with a partition key that allows me to query by the student, and a sort key that allows me to narrow down my results even further based on the entity I want to access. One way of doing that would be to use some kind of identifier for a student, say studentID, and then something more generalized for the sort key like entityID, or simply SK.
At the application layer, I would classify each Item under one possible entity (profile, grades, address) and store data relevant to that entity in any number of attributes that I would need on that Item.
An example of how that data might look for a student named john smith:
{ studentId: "john", entityId: "profile", firstName: "john", lastName: "smith" }
{ studentId: "john", entityId: "grades", math2045: 96.52, eng1021:89.93 }
{ studentId: "john", entityId: "address", state: "CA", city: "fresno" }
With this schema, all your access patterns are available:
"give me the math grades of student john"
PartitionKey = "john", SortKey = "grades"
and if you store address within the students profile entity, you can accomplish "give me profile and address of student John" in one shot (multiple queries should be avoided when possible)
PartitionKey = "john", SortKey = "profile"
Consider
Keep in mind, you need to take into account how frequently you are reading/writing data when designing your table. This is a very rudimentary design, and may need tweaking to ensure that you're not setting yourself up for major cost or performance issues down the road.
The basic idea that this implementation demonstrates is that denormalizing your data (in this case, across the different entities you've established) can be a very powerful way to leverage DynamoDB's speed, and also leave yourself with plenty of ways to access your data efficiently.
Problems & Limitations
Specific to your application, there is one potential problem that stands out, which is that it seems very feasible the grades Items start to balloon to the point where they are impossible to manage and become expensive to read/write/update. As you start storing more and more students, and each student takes more and more courses, your grades entities will expand with them. Say the average student takes anywhere from 35-40 classes and gets a grade for each of them, you don't want to have to manage 35-40 attributes on an item if you don't have to. You also may not want back every single grade every time you ask for a student's grades. Maybe you start storing more data on each grade entity like:
{ math1024Grade: 100, math1024Instructor: "Dr. Jane Doe", math1024Credits: 4 }
Now for each class, you're storing at least 2 extra attributes. That Item with 35-40 attributes just jumped up to 105-120 attributes.
On top of performance and cost issues, your access patterns could start to evolve and become more demanding. You may only want grades from the student's major, or a certain type of class like humanities, sciences, etc, which is currently unavailable. You will only ever be able to get every single grade from each student. You can apply a FilterExpression to your request and remove some of the unwanted Items, but you're still paying for all the data you've read.
With the current solution, we are leaving a lot on the table in terms of optimizations in performance, flexibility, maintainability, and cost.
Optimizations
One way to address the lack of flexibility in your queries, and possible bloating of grades entities, is the concept of a composite sort key.
Using a composite sort key can help you break down your entities even further, making them more manageable to update and providing you more flexibility when you're querying. Additionally, you would wind up with much smaller and more manageable items, and although the number of items you store would increase, you'll save on cost and performance. With more optimized queries, you'll get only the data you need back so you're not paying those extra read units for data you're throwing away. The amount of data a single Query request can return is limited as well, so you may cut down on the amount of roundtrips you are making.
That composite sort key could look something like this, for grades:
{ studentId: "john", entityId: "grades#MATH", math2045: 96.52, math3082:91.34 }
{ studentId: "john", entityId: "grades#ENG", eng1021:89.93, eng2203:93.03 }
Now, you get the ability to say "give me all of John's MATH course grades" while still being able to get all the grades (by using the begins_with operation on the sort key when querying).
If you think you'll want to start storing more course information under grades entities, you can suffix your composite sort key with the course name, number, identifier, etc. Now you can get all of a students grades, all of a students grades within a subject, and all that data about a students grade within a subject, like its instructor, credits, year taken, semester, start date, etc.
These optimizations are all possible solutions, but may not fit your application, so again keep that in mind.
Resources
Here are some resources that should help you come up with your own solution, or ways to tweak the ones I've provided above to better suit you.
AWS re:Invent 2019: Data modeling with Amazon DynamoDB (CMY304)
AWS re:Invent 2018: Amazon DynamoDB Deep Dive: Advanced Design Patterns for DynamoDB (DAT401)
Best Practices for Using Sort Keys to Organize Data
NoSQL Design For DynamoDB
And keep this one in mind especially when you are considering cost/performance implications for high-traffic application:
Best Practices for Designing and Using Partition Keys Effectively

When it's worth the tradeoff of using local secondary index in DynamoDB?

I've read guidelines for secondary indexes but I'm not sure when the ability to search fast outweighs the disadvantage of scan over attributes. Let me give you an example.
I am saving game progress data for users. The PK is user ID. I need to be able to:
Find out user progress about a particular game.
Get all finished/in progress games for a user.
Thus, I can design my SK as progress_{state} to be able to query all games by progress fast (state represents started/finished) or I can design my SK as progress_{gameId} to be able to query progress of a given game fast. However, I can't have both using just SK. When I chose one, the other operation will require a scan.
Therefore, I was thinking about using LSI which will add an overhead to the whole table as noted by Amazon here:
Every secondary index means more work for DynamoDB. When you add, delete, or replace items in a table that has local secondary indexes, DynamoDB will use additional write capacity units to update the relevant indexes.
I estimate maximum thousands of types games and I wonder whether it's worth using LSI or whether it's better to use scans for the other operation I choose.
Does anyone has any real experience with such problem? I was not able to find anything on this topic.

When you are designing DynamoDB tables, the main cost factor comes with IOPS for reads and writes.
This is why avoiding scans are usually better. Scans will consume a significant amount of read IOPS and it will increase with the number of items in the table since scan needs to read all the items in the table before returning the matching items.
Then coming back to your use-case of using SK for progress, it would be better to use attributes and define Secondary Indexes, since you will need to update the state later on (Which is not possible with PK and SK in the table).
So based on your use-case and the information given in the question you can define the schema as;
PK- UserID
SK- GameID
GSI- Progress (PK)
Query all games by progress fast
GSI Progress (PK)
Note: if this is for a particular user; you can change it to LSI Progress.
Query progress of a given game fast (Assuming that for a given user)
Query using UserID (PK) and GameID (SK) of the Table

DynamoDB record size increasing with time

I have a customer table in DynamoDB with basic attributes like name, dob, zipcode, email, etc. I want to add another attribute to it which will keep increasing with time. For example, each time the user clicks on a product (item), I want to add that to the record so that I have the full snapshot of the customer's profile in a single value indexed by the customerId. So, my new attribute would be called viewedItems and would be a list of itemIds viewed (along with the timestamp).
However, given the 4KB size limit for DynamoDB value, it is going to be surpassed with time as I keep adding the clicked products to the customer profile.
How can I best define my objects so as to perform the following?
Access the full profile of the customer by customerId, including the views.
Access time filtered profile of the customer (like all interactions since last N days), in which case the viewed items should be filtered by the given time range.
Scan the entire table with a time filter on viewedItems.
The query needs to be performant as the profile could be pulled at request time.
Ability to update individual customer record (via a batch job, for example, that updates each customer's record if need be).
One way to do this would be to create a different table (say customer_viewed_items) with hash key customerId and a range key timestamp with value being the itemId that the customer viewed. But this looks like an increasingly complicated schema - not to mention twice the cost involved in accessing the item. If I have to create another attribute based on (say) "bought" items, then I'll need to create another table. So, the solution I have in mind does not seem good to me.
Would really appreciate if you could help suggest a better schema/approach.

As soon as you really don't know how many items will be viewed by user (edge case - user opens all items sequentially, multiple times) - you cannot store this information in single dynamodb record.
The only solution is to normalize your database and create separate table like you've described.
Now, next question - how to minimize retrieval cost in such scheme? Usually you don't need to fetch all viewed items, probably you want to display some of them, then you need to fetch only last X.
You can cache such items in main table customer, ie - create field "lastXviewedItems" and updated it, so it contains only limited number of items without breaking size limit, of course for BI analysis - you will have to store them in 2nd table too.

AWS DynamoDB Table Design: Store two UserIDs and Details in Table

I'm building an app where two users can connect with each other and I need to store that connection (e.g. a friendship) in a DynamoDB table. Basically, the connection table has have two fields:
userIdA (hash key)
userIdB (sort key)
I was thinking to add an index on userIdB to query on both fields. Should I store a connection with one record (ALICE, BOB) or two records (ALICE, BOB; BOB, ALICE)? The first option needs one write operation and less space, but I have to query twice to get all all connections of an user. The second option needs two write operations and more space, but I only have to query once for the userId.
The user tablehas details like name and email:
userId (hash key)
name (sort key)
email
In my app, I want to show all connections of a certain user with user details in a listview. That means I have two options:
Store the user details of the connected users also in the connection table, e.g. add two name fields to that table. This is fast, but if the user name changes (name and email are retrieved from Facebook), the details are invalid and I need to update all entries.
Query the user details of each userId with a Batch Get request to read multiple items. This may be slower, but I always have up to date user details and don't need to store them in the connection table.
So what is the better solution, or are there any other advantages/disadvantages that I may have overlooked?
EDIT
After some google research regarding friendship tables with NoSQL databases, I found the following two links:
How does Facebook maintain a list of friends for each user? Does it maintain a separate table for each user?
NoSQL Design Patterns for Relational Data
The first link suggests to store the connection (or friendship) in a two way direction with two records, because it makes it easier and faster to query:
Connections:
1 userIdA userIdB
2 userIdB userIdA
The second link suggests to save a subset of duplicated data (“summary”) into the tables to read it faster with just one query. That would be mean to save the user details also into the connection table and to save the userIds into an attribute of the user table:
Connections:
# userIdA userIdB userDetails status
1 123 456 { userId: 456, name: "Bob" } connected
2 456 123 { userId: 123, name: "Alice" } connected
Users:
# userId name connections
1 123 Alice { 456 }
2 456 Bob { 123 }
This database model makes it pretty easy to query connections, but seems to be difficult to update if some user details may change. Also, I'm not sure if I need the userIds within the user table again because I can easily query on a userId.
What do you think about that database model?

In general, nosql databases are often combined with a couple of assumptions:
Eventual consistency is acceptable. That is, it's often acceptable in application design if during an update some of the intermediate answers aren't right. That is, it might be fine if for a few seconds while alice is becoming Bob's friend, It's OK if "Is Alice Bob's friend" returns true while "is Bob Alice's friend" returns false
Performance is important. If you're using nosql it's generally because performance matters to you. It's also almost certainly because you care about the performance of operations that happen most commonly. (It's possible that you have a problem where the performance of some uncommon operation is so bad that you can't do it; nosql is not generally the answer in that situation)
You're willing to make uncommon operations slower to improve the performance of common operations.
So, how does that apply to your question. First, it suggests that ultimately the answer depends on performance. That is, no matter what people say here, the right answer depends on what you observe in practice. You can try multiple options and see what results you get.
With regard to the specific options you enumerated.
Assuming that performance is enough of a concern that nosql is a reasonable solution for your application, it's almost certainly query rather than update performance you care about. You probably will be happy if you make updates slower and more expensive so that queries can be faster. That's kind of the whole point.
You can likely handle updates out of band--that is eventually consistency likely works for you. You could submit update operations to a SQS queue rather than handling them during your page load. So if someone clicks a confirm friend button, you could queue a request to actually update your database. It is OK even if that involves rebuilding their user row, rebuilding the friend rows, and even updating some counts about how many friends they have.
It probably does make sense to store a friend row in each direction so you only need one query.
It probably does make sense to store the user information like Name and picture that you typically display in a friend list duplicated in the friendship rows. Note that whenever the name or picture changes you'll need to go update all those rows.
It's less clear that storing the friends in the user table makes sense. That could get big. Also, it could be tricky to guarantee eventual consistency. Consider what happens if you are processing updates to two users' friendships at the same time. It's very important that you not end up with inconsistency once all the dust has settled.
Whenever you have non-normalized data such as duplicating rows in each direction, or copying user info into friendship tables, you want some way to revalidate and fix your data. You want to write code that in the background can go scan your system for inconsistencies caused by bugs or crashed activities and fix them.

I suggest you have the following fields in the table:
userId (hash key)
name (sort key)
email
connections (Comma separated or an array of userId assuming you have multiple connections for a user)
This structure can ensure consistency across your data.

Designing MySQL table for Achievements system

I am creating a database for an achievement system (like something you would see in a Blizzard game). I would like to have a GUI that displays the current progress of all achievements in the game which means I will need to query the progress of all achievements for a user in order to populate the GUI. I plan on having somewhere around 100 achievements.
This brings about a design question. What is the best way to design the database and querying code to query the progress of ~100 bit fields?
It seems like the brute force method would be to get the entire row of achievements and then for each field in the row do some hardcoded string comparison to determine which achievement we are dealing with.
Another possible solution may be to have a big switch statement based on the column index of the table and handle each achievement for each case (requires not modifying the table or you have to refactor a lot of C++ code).
I'm curious to hear any other designs you guys may have for this.
Thanks!

I suggest building a solution using 3 tables. These tables are users, achievements and user_achievements. A user would be identified with a u_id in the users table. An achievement would be identified with a a_id in the achievements table. You would then keep track of users achievements by inserting a row in the user_achievements table that includes a u_id to identify the user and a a_id to identify the achievement. The user_achievements table would also contain a column that would specify the % completion of that achievement for the given user.

Came across this question and even though it's 5 years old, perhaps someone would be interested in following approach.
Achievements are usually broken down to numbers (the rest, like Name, Description of each achievement can be put to site/app core to avoid bloating the DB).
lets be simple, we are not FB and don't need separate table for them, so in "users" table we add just 1 single column: "Achievements" it is a varchar(50). Number in brackets (50) will depend on your actual needs to this column (i.e. how much data it stores).
so you end up having in each cell of the Achievements column a numerical sequence: 10982039482084109384
Read this line of digits as follows, from left to right: user has reached "1098 profile views", received "2039 likes", etc. Optionally, add a separator for easier distinction + to instantly handle cases when as first user had 25 likes, then 125, then 2039 (2 digits, 3 digits, 4 digits - or another alternative is to use 0025 then 0125 then 2039 given you know max digits is 4 per achievement). But still lets say we decide to use separators, i.e. a comma:
1098,2039,4820,8410,9384
Then once you need a data, just SELECT achievements belonging to specific userID and subsequently (if you added a separator)
explode (',', $array)
then your site php core knows that first 4 digits stand for "profile views" and lets say this means that he has a level 10 badge for profile views (1 badge for 100 views).
Thereon, you can easily do operations with no further need for SQL queries. Example, user wants to know his progress on achieving a level 20 badge, you display: he has a 1098/2000 (or 55%) progress.
At that, achievement Description, Name, level information is stored in site core, while percentage is calculated on the go.
Hope the logic is clear and may be useful to any1 in community out there.
Cheers!

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js