DynamoDB GSI data modelling for an articles app

DynamoDB GSI data modelling for an articles app - amazon-web-services

I want to create an articles application using serverless (AWS Lambda + DynamoDB + S3 for hosting the FE).
I have some questions regarding the "1 table approach".
The actions I want to follow:
Get latest (6) articles sorted by date
Get an article by id
Get the prev/next article relative to the article opened (based on creation date)
Get related articles by tags
Get comments by article
I have created an initial spreadsheet for the information:
The first problem I have is that for action nr. 1, I cannot get all the articles based on date, I've added the SK for articles as a date, but because the PK has separate articles, each with its id: article-1, article-2.. and so on, I don't know how to fetch all the articles only by SK.
I then tried creating a LSI , but then I noticed that the LSI needs to have the PK the same as the table, so I can select based on LSI type = 'ARTICLE', but I still cannot selected them ordered by date (entities_sort value)
I know AWS says its good for PK to be unique, but then how do you group the data in this case?
I've created a GSI
This helps me get articles by type(GSI2PK)='ARTICLE' sorted by entities_sort (GSI2SK), but isn't there a better way of achieving this? Having your articles as a PK in a table, but somehow still being able to get them sorted by date?
Having GSI1PK, GSI1SK this way - I can get all the comments for an article using reverse lookup, so thats good.
But I still also don't know how to implement number 3. Get the prev/next article relative to the article opened (based on creation date): getting an article by id, check its creation date(entities_sort), then somehow get the next article before and after based on that creation date (entities_sort), is there a function in DynamoDB that can do this for me?
In my approach I try to query/process as few items as possible so I don't want to use filter functions, rather partition my information.
My question is, how should I achieve 1 and 3? And isn't creating 2 GSI's for such few actions overkill?
What is the pattern to have articles on a PK, unique with ids, but still being able to get them sorted by creation date?
Thank you

So what I've ended up doing is:
My access patterns in detail are:
Get any Article by Id (for edit/delete)
Get any Comment by Id (for edit/delete)
Get any Tag by Id (for edit/delete)
Get all Articles ordered by date
Get all the Tags of an Article
Get all comments for an article, sorted by date
Get all Articles that have a specific tag, ordered by date (because I want to show only the last 3 ones)
This is the way I've implemented my model, and I can get all the informations needed.
Also, all my data is partitioned and the queries are really efficient, I always get exactly what I need and the ScannedDocuments value is always the number or returned objects.
The Global Secondary Index helps me query by Article Id and I get, all the comments and tags of that Article.
I've solved the many-to-many between Tags and Articles by a new record in the end:
tag_id, article_date, arct_id, tag_id
So, if I want all articles that have a specific tag sorted by date I can query the PK of the table and sort by SK. If I want to get a single Tag (for edit/delete) I can use the GSI by: article_id, tag_id .. and I get the relation between them.
For getting all Articles sorted by date, i query PK: ARTICLE and an option condition if I want to get only the ones after a date or not I can condition the SK.
For all the comments and tags of an Article I can use the GSI with : article_link_pk: article_id and I get all comments and tags. If I want only comments I can say article_link_pk: article_id and article_link_sk: begins_with(article_link_sk, '2020') in this way I get only comments, without tags.
The data model in NoSQL Developer looks like this:
The GSI reverse lookup looks like this:
It's been a journey, but I feel like I finally got a grasp on how to do data modelling in DynamoDB

Related

DynamoDB filter condition and Limit

Can anyone help me to understand what is the best approach to handle condition filtering and Limit
I'm using dynanmodb to store some products and I want to use pagination with them.
Because some of the products are enabled I got to filter based on a field.
no problem until there.
When I use Limit, if I want to Limit the data to be returned the Limit will be applied after the query makes the filtering.
Here is the deal :
All the time the query will return the data and Limit based on the last item to top first one, and I get the Key to make the pagination to the top.
If I would like to retrieve from the top to the bottom I Will set Scan Index Forward = true, and do the same as the other query.
But I would like to get random products and do the pagination because I don't I apply something like that, all the time the same products from the first one to the bottom or unlike. but what about the products in the middle etc?
I'm using a simple schema :
PK: PRODS
SK: ITEM#RamdomID
thanks a lot, I appreciate any helpful comment.

DynamoDB query by 3 fields

Hi I am struggling to construct my schema with three search fields.
So the two main queries I will use is:
Get all files from a user within a specific folder ordered by date.
Get all files from a user ordered by date.
Maybe there will be a additional query where I want:
All files from a user within a folder orderd by date and itemType == X
All files from a user orderd by date and itemType == X
So as of that the userID has to be the primaryKey.
But what should I use as my sortKey?. I tried to use a composite sortKey like: FOLDER${folderID}#FILE{itemID}#TIME{$timestamp} As I don't know the itemID I can't use the beginsWith expression right ?
What I could do is filter by beginsWith: folderID but then descending sort by date would not work.
Or should I move away from dynamoDB to a relationalDB with those query requirements in mind?

DynamoDB data modeling can be tough at first, but it sounds like you're off to a good start!
When you find yourself requiring an ID and sorting by time, you should know about KSUIDs. KSUID's are unique IDs that can be lexicographically sorted by time. That means that you can sort KSUIDs and they will order by creation time. This is super useful in DynamoDB. Let's check out an example.
When modeling the one-to-many relationship between Users and Folders, you might do something like this:
In this example, User with ID 1 has three folders with IDs 1, 2, and 3. But how do we sort by time? Let's see what this same table looks like with KSUIDs for the Folder ID.
In this example, I replaced the plain ol' ID with a KSUID. Not only does this give me a unique identifier, but it also ensures my Folder items are sorted by creation date. Pretty neat!
There are several solutions to filtering by itemType, but I'd probably start with a global secondary index with a partition key of USER#user_id#itemType and FOLDER#folder_id as the sort key. Your base table would then look like this
and your index would look like this
This index allows you to fetch all items or a specific folder for a given user and itemType.
These examples might not perfectly match your access patterns, but I hope they can get your data modeling process un-stuck! I don't see any reason why your access patterns can't be implemented in DynamoDB.

if you are sure about using dynamoDB you should analyze access patterns to this table in advance and chose part key, sort key based on the most frequent pattern. For other patterns, you should add GSI for each pattern. See https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html
Usually, if it is about unknown patterns RDBMS looks better, or for HighLoad systems NO_SQL for highload workloads and periodic uploading data to something like AWS RedShift.

Model Post and Topic through DynamoDB

Heres the relation I'm trying to model in DynamoDB:
My service contains posts and topics. A post may belong to multiple topics. A topic may have multiple posts. All posts have an interest value which would be adjusted based on a combination of likes and time since posted, interest measures the popularity of a post at the current moment. If a post gets too old, its interest value will be 0 and stay that way forever (archival).
The REST api end points work like this:
GET /posts/{id} returns a post-object containing title, text, author name and a link to the authors rest endpoint (doesn't matter for this example) and the number of likes (the interest value is not included)
GET /topics/{name} should return an object with both a list with the N newest posts of the topics as well as one for the N currently most interesting posts
POST /posts/ creates a new post where multiple topics can be specified
POST /topics/
creates a new topic
POST /likes/ creates a like for a specified post (does not actually create an object, just adds the user to the given post-object's list of likers, which is invisible to the users)
The problem now becomes, how do I create a relationship between topics and and posts in DynamoDB NoSql?
I thought about adding a list of copies of posts to tag entries in DynamboDB, where every tag has a list of both the newest and the most interesting Posts.
One way I could do this is by creating a cloudwatch job that would run every 10 minutes and loop through every topic object, finding both the most interesting and newest entries and then replacing the old lists of the topic.
Another job would also have to regularly update the "interest" value of every non archived post (keep in mind both likes and time have an effect on the interest value).
One problem with this is that a lot of posts in the Tag list would be out of date for 10 minutes in case the User makes a change or deletes the post. Likes will also not be properly tracked on the Tags post list. This could perhaps be solved with transactions, although dynamoDB is limited to 10 objects per transaction.
Another problem is that it would require the add-posts-to-tags job to load all the non archived posts into memory in order to manually sort them by both time and interest, split them up by tag and then adding the first N of both sets to the tag lists every 10 minutes.
I also had a another idea, by limiting the tags of a post that are allowed to 1, I could add the tag as a partition key, with the post-time as the sort key, and use a GSI to add Interest as a second sort key.
This does have several downsides though:
very popular tags may be limited to a single parition since all the posts share a single partition key
Tag limit is 1
A cloudwatch job to adjust the Interest value of posts may still be required
It would require use of a GSI which may lead to dangerous race conditions
But it would have the advantage that there are no replications of the post objects aside from the GSI. It would also allow basically infinite paging of all posts by date instead of being limited to just the N newest posts.
So what is a good approach here? It seams both of my solutions have horrible dealbreakers. Is this just one of those problems that NoSQL simply can't solve?

You are trying to model relational data using a non relational DB ,
to do this I would use 2 types of DB ,
I would store in dynamo the post information
in your example it would be :
GET /posts/{id}
POST /posts/
POST /likes/creates
For the topic related information I would use Elastic search (Amazon Elasticsearch Service)
GET /topics/{name} : the search index would stored the full topic info as well post id's that , and the relevant fields you want to search for (in your case update date to get the most recent posts)
what this will entail is background process (in dynamoDB this can be done via streams) that takes changes to the dynamoDB for new post's , update to like count etc.. and populates the search index.
Note: this can also be solved using graphDB but for scaling purposes better separate the source of the data (post's ) and the data relations (topic).

how to design schema in dynamodb for a reading comprehension quiz application where data would be heavy?

Pls check the uml diagram
What I want to know is if there is 30quest and their options in section 1 ,20question in section 2,30question in section 3, how should i keep in the table as RC passages would have 300-400 words, plus the questions,options it would be around 7-800 words per question.
So each question should have one row in the table or , testwise i should have different columns of section and all question, option should be saved in json format in one column(item for dynamodb)?

I would follow these rules for DynamoDB table design:
Definitely keep everything in one table. It's rare for one application to need multiple tables. It is OK to have different items (rows) in DynamoDB represent different kinds of objects.
Start by identifying your access patterns, that is, what are the questions you need to ask of your data? This will determine your choice of partition key, sort key, and indexes.
Try to pick a partition key that will result in object accesses being spread somewhat evenly over your different partitions.
If you will have lots of different tests, with accesses spread somewhat evenly over the tests, then TestID could be a good partition key. You will probably want to pull up all the tests for a given instructor, so you could have a column InstructorID with a global secondary index pointing back to the primary key attributes.
Your sort key could be heterogenous--it could be different depending on whether the item is a question or a student's answer. For questions, the sort key could be QuestionID with the content of the question stored as other attributes. For question options it could be QuestionID#OptionID, with something like an OptionDescription attribute for the content of the option. Keep in mind that it's OK to have sparse attributes--not every item needs something populated for every attribute, and it's OK to have attributes that are meaningless for many items. For answers, your sort key could be QuestionID#OptionID#StudentID, with the content of the student's answer stored as a StudentAnswer attribute.
Here is a guide on DynamoDB best practices. For something more digestible, search in YouTube for "aws reinvent dynamo rick houlihan." Rick Houlihan has some good talks about data modeling in DynamoDB. Here are a couple, and one more on data modeling:
https://www.youtube.com/watch?v=6yqfmXiZTlM&list=PL_EDAAla3DXWy4GW_gnmaIs0PFvEklEB7
https://www.youtube.com/watch?v=HaEPXoXVf2k
https://www.youtube.com/watch?v=DIQVJqiSUkE

The better approach is to store each question and its option as a row in DynamoDB Table . Definitely will not suggest , the second approach of storing the question and answer as a JSON is definitely not advisable as the maximum size of a DynamoDb Item is 400 Kb. In such scenarios , using a document database is much more helpful.
Also try to come up with the type of queries that you will be running . Some of the typical ones are
Get all questions in a section by SectionID
Get the details of a Question by Question Id
Get all questions
If you can provide some more information , I could guide you in data modelling
Also I did not see the UML diagram

The following is my suggestion.Create the DynamoDB table
Store each sectionId , question and its option as a row in DynamoDB Table
Partition Key :- SectionID , Sort Key :- QuestionId
Create a GSI on the table with Partition Key :- QuestionId, Sort Key :- OptionId

DynamoDb database design

I'm new to DynamoDb and noSql in general.
I have a users table and a notes table. A user can create notes and I want to be able to retrieve all notes associated with a user.
One solution I've thought of is every time a note is saved the note id is stored inside a 'notes' attribute inside the user table. This will allow me to query the users table for all note id's and then query notes using those id's:
UserTable:
UserId: 123456789
notes: ['note-id-1', note-id-2]
NotesTable
id: note-id-1
text: "Some note"
Is this the correct approach? The only other way I can think is to have the notes table have a userId attribute so I can then query the notes table based on that userId. Obviously this is the sort of approach is more relational.

I would take the approach at the end of your question: each note should have a userId attribute. Then create a global secondary index with userId as primary key and noteId as sort key. This way you can also query on userId, by doing a query on that index.
If you do it the way you suggested, you always need two queries to get the notes of a user (first get the notes from the user table and then query on the notes table). Also, when someone has N notes you would need to do N queries, this is going to be expensive if N is large.
If you do it the way in this answer, you need one query to get all notes of a user (I'm assuming no pagination) and one to get the user information. Will never be more than 2.
General rule of thumb:
SQL: storage = expensive, computation = cheap
NoSQL: storage = cheap, computation = expensive
So always try to need as little queries as possible.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js