I am doing my best to understand DynamoDB data modeling but I am struggling. I am looking for some help to build off what I have now. I feel like I have fairly simple data but it's not coming to me on what I should do to fit into DynamoDB.
I have two different types of data. I have a game object and a team stats object. A Game represents all of the data about the game that week and team stats represents all of the stats about a given team per week.
A timeId is in the format of year-week (ex. 2020-9)
My Access patterns are
1) Retrieve all games per timeId
2) Retrieve all games per timeId and by TeamName
3) Retrieve all games per timeId and if value = true
4) Retrieve all teamStats per timeId
5) Retrieve all teamStats by timeId and TeamName
My attempt at modeling so far is:
PK: TeamName
SK: TimeId
This is leading me to have 2 copies of games since there is a copy for each team. It is also only allowing me to scan for all teamStats by TimeId. Would something like a GSI help here? Ive thought maybe changing the PK to something like
PK: GA-${gameId} / TS-${teamId}
SK: TimeId
Im just very confused and the docs aren't helping me much.
Looking at your access patterns, this is a possible table design. I'm not sure if it's going to really work with your TimeId, especially for the Local Secondary Index (see note below), but I hope it's a good starting point for you.
# Table
-----------------------------------------------------------
pk | sk | value | other attributes
-----------------------------------------------------------
TimeId | GAME#TEAM{teamname} | true | ...
TimeId | STATS#TEAM{teamname} | | ...
GameId | GAME | | general game data (*)
TeamName | TEAM | | general team data (*)
# Local Secondary Index
-------------------------------------------------------------------------------
pk from Table as pk | value from Table as sk | sk from Table + other attributes
-------------------------------------------------------------------------------
TimeId | true | GAME#Team{teamname} | ...
With this Table and Local Secondary Index you can satisfy all access patterns with the following queries:
Retrieve all games by timeId:
Query Table with pk: {timeId}
Retrieve all games per timeId and by TeamName
Query table with pk: {timeId}, sk: GAME#TEAM{teamname}
Retrieve all games per timeId and if value = true
Query LSI with pk: {timeId}, sk: true
Retrieve all teamStats per timeId
Query table with pk: {timeId}, sk: begins with 'STATS'
Retrieve all teamStats by timeId and TeamName
Query table with pk: {timeId}, sk: STATS#TEAM{teamname}
*: I've also added the following two items, as I assume that there are cases where you want to retrieve general information about a specific game or team as well. This is just an assumption based on my experience and might be unnecessary in your case:
Retrieve general game information
Query table with pk: {GameId}
Retrieve general team information
Query table with pk: {TeamName}
Note: I don't know what value = true stands for, but for the secondary index to work in my model, you need to make sure that each combination of pk = TimeId and value = true is unique.
To learn more about single-table design on DynamoDB, please read Alex DeBrie's excellent article The What, Why, and When of Single-Table Design with DynamoDB.
Related
I have DynamoDB table with users and friends.Schema looks like blow. Here user 1 (tom) and user 2 (rob) are friends.
+--------+---------+----------+
| PK | SK | UserName |
+--------+---------+----------+
| USER#1 | USER#1 | tom |
| USER#2 | USER#2 | bob |
| USER#3 | USER#3 | rob |
| FRD#1 | USER#2 | |
| FRD#2 | USER#1 | |
+--------+---------+----------+
Is it possible to get name of friends of user 1 (tom) in single query?
If not what is efficient way to query.
Any help would be really appreciated.
What I am doing currently is:
Step 1: Get all friends of user 1.
let frdParams = {
TableName : "TABLE_NAME",
IndexName: "SK-PK-index",
KeyConditionExpression: "SK = :userId AND begins_with(PK, :friend)",
ExpressionAttributeValues: {
":userId": {S: userId},
":friend": {S: "FRIEND#"}
}
};
const frdRes = await ddb.query(frdParams).promise();
Step 2: Once I getting all users, running more queries in loop.
for (const record of frdRes.Items) {
let recordX = aws.DynamoDB.Converter.unmarshall(record);
let friendId = itemX.PK.replace("FRD", "USER")
let userParams = {
TableName : "TABLE_NAME",
KeyConditionExpression: "PK = :userId AND SK = :userId",
ExpressionAttributeValues: {
":userId": {S: friendId}
}
};
const userRes = await ddb.query(userParams).promise();
}
Data modeling in DynamoDB requires a different mindset than one might use when working with SQL databases. To get the most out of DynamoDB, you need to consider your applications access patterns and store your data in a way that supports those use cases.
It sounds like your access pattern is "fetch friends by user id". There are many ways to implement this access pattern, but I'll give you a few ideas of how it might be achieved.
Idea 1: Denormalize Your Data
You could create a list attribute and store each users friends list. This would make fetching friends by user super simple!
As with any access pattern, there are limitations with this approach. DynamoDB attributes have a maximum size of 400KB, so you'd be limited to a friends list of that size. Also, you will not be able to perform queries based on the values of this attribute, so it would not support additional access patterns. But, it's super simple!
Idea 2: Build an item collection, storing friends within the USER#<id> partition.
This is a typical pattern to represent one-to-many relationships in DynamoDB. Let's say you define friendships with a PK of USER#<user_id> and an SK of FRIEND#<friend_id>. Your table would look like this:
You could fetch the friends of a given user by searching the users partition key for Sort Keys that begins_with FRIEND.
These are just two ideas, and there are many more (and likely better) ways to model friendships in DynamoDB. The examples I've given treat the relationship as one-to-many (one user has many friends). What's more likely is that you'd have a many-to-many relationship to model, which can be tricky in DynamoDB (and another topic altogether!)
If many-to-many sounds like what you have, AWS has an article describing modeling many-to-many relationships that may prove a good starting point.
I have worked out a fairly complex chain of DynamoDB resolvers on a GraphQL AppSync query. What I am curious to know is if I could have possibly designed this in a way to require fewer DynamoDB queries.
Here is my GraphQL Schema:
type Tag {
PartitionKey: ID!
SortKey: ID!
TagName: String!
TagType: String
}
type Model {
PartitionKey: ID!
Name: String
Version: Int
FBX: String
# ms since epoch
CreatedAt: AWSTimestamp
Description: String
Tags: [String]
}
type Query {
GetAllModels(count: Int, nextToken: String): PaginatedModels!
}
This is the query that I am doing:
query GetAllModels{
GetAllModels {
Models {
PartitionKey
Name
Version
CreatedAt
Description
Tags {
TagName
TagType
}
}
}
}
My DynamoDB table is set up as so:
PartionKey | SortKey | TagName | TagType | ModelName | Description
Model-0 | Model-0 | ModelZero | Blah Blah
Model-0 | Tag-Pine |
Model-0 | Tag-Apple |
Tag-Pine | Tag-Pine | Pine | Tree
Tag-Apple | Tag-Apple | Apple | Fruit
So in my resolvers I am going:
GetAllModels will scan with two filters. One filter for PartitionKey beginning with 'Model-' and another filter for SortKey begining with 'Model-'. This is to get all Models.
Next there is a resolver attached to 'Tags' in the Model object. This will query with two expressions. One for PartitionKey = source.Parition and a second for SortKey begin_with 'Tag-' this gets me all of the tags on a model.
Next there are two resolvers on the Tag object. One on TagName and another on TagType. These do a direct GetItem to get their appropriate value with PartitionKey = source.Sort and SortKey = source.SortKey set as the keys.
So each scanned Model ends up firing off 3 more queries to DynamoDB. This just seems a bit excessive to me. But I cannot see any other way to do this. Is there some way to be able to get both TagName and TagType in one query?
Is there a better way to approach this?
I see a few things that I would personally change. The first is that I would avoid the nested DynamoDB scan operations. At least one of these can be replaced with a much faster query operation. The second is that I would consider rethinking how you are storing the data. Currently, there is no good way to list model objects.
Why is there no good way to list model objects?
Assuming each model object will have multiple tags then you are going to have a table that is sparsely populated by model objects. i.e. out of 100 rows you may have 20 - 50 models depending on how many tags the average model has. In DynamoDB, a table is split up based on the partition key causing rows that share the same partition key to be stored near each other to speed up query operations. With your setup where the Partition Key is essentially the unique id of a single model object this means that we can easily get a single model object. You can also quickly get the tags for a single object since those records are nearby as well.
The issue.
The DynamoDB scan operation looks at each partition one at a time, reads as many records as the requests limit allows or all of them if the limit is sufficiently large, and then, only after reading the records from the individual partitions, applies the filter expression before returning the final result. This means you may ask for the first 10 models but since the limit is applied before the scan filter, you may very well only get back 1 model (if that one model had 9 or more tags which would exhaust the limit while DynamoDB was reading the first partition). This may seem strange when coming from many different database systems and is an important consideration of its design.
Here are two solutions to address this concern:
1. Store Models in one table and Tags in another.
NoSQL databases like DynamoDB allow you to store many types of data in the same table but there is nothing wrong with splitting them out. Traditionally it can be a pain to work with multiple tables in a NoSQL database that lacks a join operation or something similar, but fortunately for us we can use GraphQL to "join" data for us. With the approach, the Model table has a single partition key named "id" and your GetAllModels resolver is still a scan but this time on the model table. This way the table is not sparse and you will get 10 models when you ask for 10 models. The Tag table should have a partition key of modelId and a sort key of tagId. You would then have a resolver on the Model.tags field that does a query against the Tag table and looks for rows with the modelId == $ctx.source.id.
This is essentially how #model and #connection work in the new graphql transform tooling launched as part of the amplify cli. You can see more here although the docs are as of writing still being improved. https://aws-amplify.github.io/amplify-js/media/api_guide
2. Store Models and Tags in the same table but change the key structure.
This approach works if you can reliably say that you will have less than 10GB of data per data type (e.g. Model & Tag). For this approach you have a single table with a PartitionKey of Type and Sort Key of id. When you create objects you create them with a Type e.g "Tag" or "Model" etc and a unique id (like a uuid). To list objects of the same type you do a DynamoDB query operation on the partition key of the type to list e.g. "Tag" or "Model". You can then use GSIs to efficiently look up related objects. In your case you would store a "modelId" is every Tag object. You would then make a GSI using the "modelId" as the Partition Key. To list all the tags for a given model you could then do a DynamoDB query operation against that GSI.
I'm sure there are many more ways to do this but hopefully this helps point in the right direction.
This is a long-winded question. It is about Cassandra schema design. I'm here to get inputs from your respected experts on a use-case I'm working on. All inputs, suggestions, and critics are welcome. Here goes my question.
We would like to collect REVIEWS from our USERS about some PAPERS we are about to publish. For each paper we seek for 3 reviews. But We send out review invites to 3*2= 6 users. All 6 users can submit their reviews to our system, but only the first 3 count; and these first 3 reviewers will get reward their work.
In our Cassandra DB, there are three tables: USER, PAPER and REVIEW. The USER and PAPER tables are simple: each user corresponds to a row in the USER table with an unique USER_ID; similarly, each paper has a unique PAPER_ID in the PAPER table.
The REVIEW table looks like this
CREATE TABLE REVIEW(
PAPER_ID uuid,
USER_ID uuid,
REVIEW_CONTENT text,
PRIMARY KEY(PAPER_ID, USER_ID)
);
We use PAPER_ID as the partition key of the REVIEW table so that all reviews of a given paper is stored in a single Cassandra row. For each paper we have, we pick up 6 users, insert 6 entries into the REVIEW table and send out 6 invites to those users. So, for paper "P1", there are 6 entries in the REVIEW table that look like this
----------------------------------------------------
PAPER_ID | USER_ID | REVIEW_CONTENT |
----------------------------------------------------
P1 | U1 | null |
----------------------------------------------------
P1 | U2 | null |
----------------------------------------------------
P1 | U3 | null |
----------------------------------------------------
P1 | U4 | null |
----------------------------------------------------
P1 | U5 | null |
----------------------------------------------------
P1 | U6 | This paper ... |
---------------------------------------------------
... | ... | ... |
Users submit review via a web browser using http. At the backend, we use the following process to handle submitted reviews (use paper "P1" as an example):
Use partition key "P1" to get all 6 entries out from the REVIEW table.
Find out how many of these 6 entries have non-null values at the REVIEW_CONTENT column (non-null values indicate that the corresponding user has already submitted his review. For example, in the above table, user "U6" has submitted his review, while other 5 have not yet).
If this number >=3, we already had enough reviews, return to the current reviewer with a message like "Thanks, we already had enough reviews."
If this number < 2, save the current review to the corresponding entry in the REVIEW table, return to the reviewer with a message like "Your review has been accepted." (E.g. If the current reviewer is "U1", then fill the REVIEW_CONTENT column of "P1, U1" entry with the current review content.)
If this number =2, this is the most complicated the case as the current submission is the last one we'll accept. In this case, we first save the current review to the REVIEW table, then we find the ids of all three users that have submitted reviews (including the current user), record their ids into a transaction table to pay them rewards later.
But this process does not work. The problem is that it does not handle concurrent submissions correctly. Consider the following case: two users have already submitted their reviews, and meanwhile 3 other users are submitting their reviews via three concurrent process shown above. At step 5, each of the three will think he is the 3rd and last submitter and insert new records into the transaction table. This leads to a double counting: a single user may be rewarded more than once for the same review he submitted.
Another problem of this process is that it may never reach to step 5. Let's say there is no submission in the REVIEW table, and 4 users submit their reviews at the same time. All of them saved their reviews at step 4. After this, later submitter will always be rejected as there are 4 accepted reviews already. But since we never reach step 5, no ids will be recorded into the transaction table and users will never get any rewards.
So here comes my question: How should I handle my use case using Cassandra as the back-end DB? Will Cassandra COUNTER help? If so, how? I have not thought through how to use COUNTER yet, but this blog (http://aphyr.com/posts/294-call-me-maybe-cassandra) warned that Cassandra COUNTER is not safe (quote "Consequently, Cassandra counters will over- or under-count by a wide range during a network partition.") Will Cassandra's Compare and Set (CAS) feature help? If so, how? Again the save blog warned that "Cassandra lightweight transactions are not even close to correct."
Rather than creating empty entries in your review table, I would consider leaving it empty and only filling it as the reviews are submitted. To handle concurrency, add a timeuuid field as a sorting key:
CREATE TABLE review(
paper_id uuid,
submission_time timeuuid,
user_id uuid,
content text,
PRIMARY KEY (paper_id, submission_time)
);
When a user makes their submission, add the entry to the table. Then AFTER the write is successful, query the table (on only the paper_id) and find out if the user's id is one of the first three. Respond to the user accordingly. Since you're committed to a small set of reviewers, the extra overhead of fetching all the reviews should be minimal (especially since you wouldn't need to include the content column in the query).
If you need to track who's reviewing the papers, add a set of user ids to the paper table and write the six user ids there.
i want to be able to record actions a logged in user does... persist / updates etc
i have set up discriminators etc and it works perfect however, it only records on all new persisted data...
so i have info on a table called user_actions,
1 - Added a new customer,
2 - Added a new memo
etc
however, it doesnt record any updates to entities on my db...
such as 1 - Updated user - id 1
...
i am thinking of dumping the discriminator superclass and use a the old way to record,,,... like create a table with the fields:
id | action type | description | user ID | date
im not sure, what is the best way to log all transactions in doctrine 2.1?
thanks
Have you consider of HasLifecycleCallbacks? You can track not only PostPersist but also PostUpdate and PostRemove(or even Pre*)
I need to get entries from database with counts of comments. Can i do it with django's comment framework? I am also using a voting application which is not using GenericForeignKeys i get entries with scores like this:
class EntryManager(models.ModelManager):
def get_queryset(self):
return super(EntryManager,self).get_queryset(self).all().annotate(\
score=Sum("linkvote__value"))
But when there is foreignkeys i am being stuck. Do you have any ideas about that?
extra explaination: i need to fetch entries like this:
id | body | vote_score | comment_score |
1 | foo | 13 | 4 |
2 | bar | 4 | 1 |
after doing that, i can order them via comment_score. :)
Thans for all replies.
Apparently, annotating with reverse generic relations (or extra filters, in general) is still an open ticket (see also the corresponding documentation). Until this is resolved, I would suggest using raw SQL in an extra query, like this:
return super(EntryManager,self).get_queryset(self).all().annotate(\
vote_score=Sum("linkvote__value")).extra(select={
'comment_score': """SELECT COUNT(*) FROM comments_comment
WHERE comments_comment.object_pk = yourapp_entry.id
AND comments_comment.content_type = %s"""
}, select_params=(entry_type,))
Of course, you have to fill in the correct table names. Furthermore, entry_type is a "constant" that can be set outside your lookup function (see ContentTypeManager):
from django.contrib.contenttypes.models import ContentType
entry_type = ContentType.objects.get_for_model(Entry)
This is assuming you have a single model Entry that you want to calculate your scores on. Otherwise, things would get slightly more complicated: you would need a sub-query to fetch the content type id for the type of each annotated object.