Heres the relation I'm trying to model in DynamoDB:
My service contains posts and topics. A post may belong to multiple topics. A topic may have multiple posts. All posts have an interest value which would be adjusted based on a combination of likes and time since posted, interest measures the popularity of a post at the current moment. If a post gets too old, its interest value will be 0 and stay that way forever (archival).
The REST api end points work like this:
GET /posts/{id} returns a post-object containing title, text, author name and a link to the authors rest endpoint (doesn't matter for this example) and the number of likes (the interest value is not included)
GET /topics/{name} should return an object with both a list with the N newest posts of the topics as well as one for the N currently most interesting posts
POST /posts/ creates a new post where multiple topics can be specified
POST /topics/
creates a new topic
POST /likes/ creates a like for a specified post (does not actually create an object, just adds the user to the given post-object's list of likers, which is invisible to the users)
The problem now becomes, how do I create a relationship between topics and and posts in DynamoDB NoSql?
I thought about adding a list of copies of posts to tag entries in DynamboDB, where every tag has a list of both the newest and the most interesting Posts.
One way I could do this is by creating a cloudwatch job that would run every 10 minutes and loop through every topic object, finding both the most interesting and newest entries and then replacing the old lists of the topic.
Another job would also have to regularly update the "interest" value of every non archived post (keep in mind both likes and time have an effect on the interest value).
One problem with this is that a lot of posts in the Tag list would be out of date for 10 minutes in case the User makes a change or deletes the post. Likes will also not be properly tracked on the Tags post list. This could perhaps be solved with transactions, although dynamoDB is limited to 10 objects per transaction.
Another problem is that it would require the add-posts-to-tags job to load all the non archived posts into memory in order to manually sort them by both time and interest, split them up by tag and then adding the first N of both sets to the tag lists every 10 minutes.
I also had a another idea, by limiting the tags of a post that are allowed to 1, I could add the tag as a partition key, with the post-time as the sort key, and use a GSI to add Interest as a second sort key.
This does have several downsides though:
very popular tags may be limited to a single parition since all the posts share a single partition key
Tag limit is 1
A cloudwatch job to adjust the Interest value of posts may still be required
It would require use of a GSI which may lead to dangerous race conditions
But it would have the advantage that there are no replications of the post objects aside from the GSI. It would also allow basically infinite paging of all posts by date instead of being limited to just the N newest posts.
So what is a good approach here? It seams both of my solutions have horrible dealbreakers. Is this just one of those problems that NoSQL simply can't solve?
You are trying to model relational data using a non relational DB ,
to do this I would use 2 types of DB ,
I would store in dynamo the post information
in your example it would be :
GET /posts/{id}
POST /posts/
POST /likes/creates
For the topic related information I would use Elastic search (Amazon Elasticsearch Service)
GET /topics/{name} : the search index would stored the full topic info as well post id's that , and the relevant fields you want to search for (in your case update date to get the most recent posts)
what this will entail is background process (in dynamoDB this can be done via streams) that takes changes to the dynamoDB for new post's , update to like count etc.. and populates the search index.
Note: this can also be solved using graphDB but for scaling purposes better separate the source of the data (post's ) and the data relations (topic).
Related
I am modelling the data of my application to use DynamoDB.
My data model is rather simple:
I have users and projects
Each user can have multiple projects
Users can be millions, project per users can be thousands.
My access pattern is also rather simple:
Get a user by id
Get a list of paginated users sorted by name or creation date
Get a project by id
get projects by user sorted by date
My single table for this data model is the following:
I can easily implement all my access patterns using table PK/SK and GSIs, but I have issues with number 2.
According to the documentation and best practices, to get a sorted list of paginated users:
I can't use a scan, as sorting is not supported
I should not use a GSI with a PK that would put all my users in the same partition (e.g. GSI PK = "sorted_user", SK = "name"), as that would make my single partition hot and would not scale
I can't create a new entity of type "organisation", put all users in there, and query by PK = "org", as that would have the same hot partition issue as above
I could bucket users and use write sharding, but I don't really know how I could practically query paginated sorted users, as bucket PKs would need to be possibly random, and I would have to query all buckets to be able to sort all users together. I also thought that bucket PKs could be alphabetical letters, but that could crated hot partitions as well, as the letter "A" would probably be hit quite hard.
My application model is rather simple. However, after having read all docs and best practices and watched many online videos, I find myself stuck with the most basic use case that DynamoDB does not seem to be supporting well. I suppose it must be quite common to have to get lists of users in some sort of admin panel for practically any modern application.
What would others would do in this case? I would really want to use DynamoDB for all the benefits that it gives, especially in terms of costs.
Edit
Since I have been asked, in my app the main use case for 2) is something like this: https://stackoverflow.com/users?tab=Reputation&filter=all.
As to the sizing, it needs to scale well, at least to the tens of thousands.
I also thought that bucket PKs could be alphabetical letters, but
that could create hot partitions as well, as the letter "A" would
probably be hit quite hard.
I think this sounds like a reasonable approach.
The US Social Security Administration publishes data about names on its website. You can download the list of name data from as far back as 1879! I stumbled upon a website from data scientist and linguist Joshua Falk that charted the baby name data from the SSA, which can give us a hint of how names are distributed by their first letter.
Your users may not all be from the US, but this can give us an understanding of how names might be distributed if partitioned by the first letter.
While not exactly evenly distributed, perhaps it's close enough for your use case? If not, you could further distribute the data by using the first two (or three, or four...) letters of the name as your partition key.
1 million names likely amount to no more than a few MBs of data, which isn't very much. Partitioning based on name prefixes seems like a reasonable way to proceed.
You might also consider using a tool like ElasticSearch, which could support your second access pattern and more.
I want to create an articles application using serverless (AWS Lambda + DynamoDB + S3 for hosting the FE).
I have some questions regarding the "1 table approach".
The actions I want to follow:
Get latest (6) articles sorted by date
Get an article by id
Get the prev/next article relative to the article opened (based on creation date)
Get related articles by tags
Get comments by article
I have created an initial spreadsheet for the information:
The first problem I have is that for action nr. 1, I cannot get all the articles based on date, I've added the SK for articles as a date, but because the PK has separate articles, each with its id: article-1, article-2.. and so on, I don't know how to fetch all the articles only by SK.
I then tried creating a LSI , but then I noticed that the LSI needs to have the PK the same as the table, so I can select based on LSI type = 'ARTICLE', but I still cannot selected them ordered by date (entities_sort value)
I know AWS says its good for PK to be unique, but then how do you group the data in this case?
I've created a GSI
This helps me get articles by type(GSI2PK)='ARTICLE' sorted by entities_sort (GSI2SK), but isn't there a better way of achieving this? Having your articles as a PK in a table, but somehow still being able to get them sorted by date?
Having GSI1PK, GSI1SK this way - I can get all the comments for an article using reverse lookup, so thats good.
But I still also don't know how to implement number 3. Get the prev/next article relative to the article opened (based on creation date): getting an article by id, check its creation date(entities_sort), then somehow get the next article before and after based on that creation date (entities_sort), is there a function in DynamoDB that can do this for me?
In my approach I try to query/process as few items as possible so I don't want to use filter functions, rather partition my information.
My question is, how should I achieve 1 and 3? And isn't creating 2 GSI's for such few actions overkill?
What is the pattern to have articles on a PK, unique with ids, but still being able to get them sorted by creation date?
Thank you
So what I've ended up doing is:
My access patterns in detail are:
Get any Article by Id (for edit/delete)
Get any Comment by Id (for edit/delete)
Get any Tag by Id (for edit/delete)
Get all Articles ordered by date
Get all the Tags of an Article
Get all comments for an article, sorted by date
Get all Articles that have a specific tag, ordered by date (because I want to show only the last 3 ones)
This is the way I've implemented my model, and I can get all the informations needed.
Also, all my data is partitioned and the queries are really efficient, I always get exactly what I need and the ScannedDocuments value is always the number or returned objects.
The Global Secondary Index helps me query by Article Id and I get, all the comments and tags of that Article.
I've solved the many-to-many between Tags and Articles by a new record in the end:
tag_id, article_date, arct_id, tag_id
So, if I want all articles that have a specific tag sorted by date I can query the PK of the table and sort by SK. If I want to get a single Tag (for edit/delete) I can use the GSI by: article_id, tag_id .. and I get the relation between them.
For getting all Articles sorted by date, i query PK: ARTICLE and an option condition if I want to get only the ones after a date or not I can condition the SK.
For all the comments and tags of an Article I can use the GSI with : article_link_pk: article_id and I get all comments and tags. If I want only comments I can say article_link_pk: article_id and article_link_sk: begins_with(article_link_sk, '2020') in this way I get only comments, without tags.
The data model in NoSQL Developer looks like this:
The GSI reverse lookup looks like this:
It's been a journey, but I feel like I finally got a grasp on how to do data modelling in DynamoDB
Firstly, let me know if I should place this in a different Community. It is programming related but less than I would prefer.
I am creating a mobile app based which I intend to base on AWS App Sync unless I can determine it is a poor fit.
I want to store a fairly large set of data, say a half million records.
From these records, I need to be able to grab all entries based on a tag and page them from the larger set.
An example of this data would be:
{
"name":"Product123",
"tags":[
{
"name":"1880",
"type":"year",
"value":7092
},
{
"name":"f",
"type":"gender",
"value":4120692
}
]
}
Various objects may or may not have a specific tag but may have up to 500 tags or more (the seed of initial data has 130 tags). My filter would ignore them if they did not match but return them if they did.
In reading about Query vs Scan on DyanmoDB, I feel like my current data structure would require mostly scanning and be in-efficient. Efficiency is only a real restriction due to cost.
With cost in mind, I will focus on the cost per user to access this data in filtered sets. Say 100,000 users for now each filtering and paging data many times a day.
Your concept of tags doesn't sound too different from the concept of Cognito User Pools' groups with AppSync (docs) - authentication based on groups will only return items allowed for groups that the user making the request is in. Cognito's default group limit is 25 per user pool, so while convenient out of the box, it wouldn't itself help you much. Instead, it's interesting just because it's similar conceptually, and can give you insight by looking at how it works internally.
If you go into the AppSync console and set up a request mapping template for groups auth, you'll see that it uses a scan and the contains operation. Doing something similar would probably be your best bet here, if you really want to use Dynamo. If you find that prohibitively costly, you could use a Lambda data source, which allows you to use any data store, if you have one in mind that's a little more flexible for this type of action.
Say I have a general website that allows someone to download their feed in a small amount of time. A user can be subscribed to many different pages, and the user's feed must be returned to the user from the server with only N of the most recent posts between all of the pages subscribed to. Originally when a user queried the server for a feed, the algorithm was as follows:
look at all of the pages a user subscribed to
getting the N most recent posts from each page
sorting all of the posts
return the N most recent posts to the user as their feed
As it turns out, doing this EVERY TIME a user tried to refresh a feed was really slow. Thus, I changed the database to have a table of feedposts, which simply has a foreign key to a user and a foreign key to the post. Every time a page makes a new post, it creates a feed post for each of its subscribing followers. That way, when a user wants their feed, it is already created and does not have to be created upon retrieval.
The way I am doing this is creating far too many rows and simply does not seem scalable. For instance, if a single page makes 1 post & has 1,000,000 followers, we just created 1,000,000 new rows in our feedpost table.
Please help!
How do companies such as facebook handle this problem? Do they generate the feed upon request? Are my database relationships terrible?
It's not that the original schema itself would be inherently wrong, at least not based on the high-level description you have provided. The slowness stems from the fact that you're not accessing the database in a way relational databases should be accessed.
In general, when querying a relational database, you should use JOINs and in-database ordering where possible, instead of fetching a bunch of data, and then trying to connect related objects and sort them in your code. If you let the database do all this for you, it will be much faster, because it can take advantage of indices, and only access those objects that are actually needed.
As a rule of thumb, if you need to sort the results of a QuerySet in your Python code, or loop through multiple querysets and combine them somehow, you're most likely doing something wrong and you should figure out how to let the database do it for you. Of course, it's not true every single time, but certainly often enough.
Let me try to illustrate with a simple piece of code. Assume you have the following models:
class Page(models.Model):
name = models.CharField(max_length=47)
followers = models.ManyToManyField('auth.User', related_name='followed_pages')
class Post(models.Model):
title = models.CharField(max_length=147)
page = models.ForeignKey(Page, related_name='posts')
content = models.TextField()
time_published = models.DateTimeField(auto_now_add=True)
You could, for example, get the list of the last 20 posts posted to pages followed by the currently logged in user with the following single line of code:
latest_posts = Post.objects.filter(page__followers=request.user).order_by('-time_published')[:20]
This runs a single SQL query against your database, which only returns the (up to) 20 results that match, and nothing else. And since you're joining on primary keys of all tables involved, it will conveniently use indices for all joins, making it really fast. In fact, this is exactly the kind of operation relational databases were designed to perform efficiently.
Caching will be the solution here.
You will have to reduce the database reads, which are much slower as compared to cache reads.
You can use something like Redis to cache the post.
Here is an amazing answer for better understanding
Is Redis just a cache
Each page can be assigned a key, and you can pull all of the posts for that page under that key.
you need not to cache everything , just cache resent M posts, where M>>N and safe enough to reduce the database calls.Now if in case user requests for posts beyond the latesd M ones, then they can be directly fetched from the DB.
Now when you have to generate the feed you can make a DB call to get all of the subscribed pages(or you can put in the cache as well) and then just get the required number of post's from the cache.
The problem here would be keeping the cache up-to date.
For that you can use something like django-signals. Whenever a new post is added, add it to the cache as well using the signal.
So for each DB write you will have to write to cache as well.
But then you will not have to read from DB and as Redis is a in memory datastore it is pretty fast as compared to standard relational databases.
Edit:
These are a few more articles which can help for better understanding
Does Stack Exchange use caching and if so, how
How Twitter Uses Redis to Scale - 105TB RAM, 39MM QPS, 10,000+ Instances
I'm seeing a discrepancy between the number of likes reported in the Graph API vs the number of entries in the "data" that has the name and ID of the people who liked a post.
When I view a certain post on Facebook, I see that it has 5 people who have liked it.
When I use the Graph API to fetch the post, the "likes" field has a "data" field with 3 entries in it, and a "count" field whose value is 5.
When I use the Graph API to fetch the likes for the post (eg, {post_id}/likes), I get a "data" field with 5 entries in it (and no "count" field).
Clearly the true answer to how many people have liked the post is 5. But then why is there only 3 entries in the "data" when I fetch the post object?
Here's another example of the same discrepancy:
https://graph.facebook.com/40796308305_10150394134258306 returns data for a post whose "likes/data" only has 1 entry in it, but whose "likes/count" says that there are 3. But https://graph.facebook.com/40796308305_10150394134258306/likes returns "data" with 3 entries. Finding that same entry on Coca-Cola's page finds that there are, in fact, 3 people who have liked it.
The documentation of the post object doesn't mention that the likes list may be incomplete, and the documentation of the fql stream table explicitly says to use the post object to get the full list, so It's either a bug in the API or in the documentation.
I suspect it may be a deliberate but undesirable "feature" to limit the detailed list for performance reasons, as some posts may have hundreds or even thousands of likes.
It ends up actually causing a huge performance problem as I need to find all posts that have been liked by a particular user, and the only way to do that is to do a separate fetch of likes for each post in the list whose like count is higher than the like list length.
2 people have their privacy settings set to not show their name to people who are not their friends.