Google App Engine (NDB): One-To-One and One-To-Many relationship - django

I am building a web application (Django and google's NDB) and have to structure my models now. I have read many examples about how to implement One-To-Many but - to be honest - I'm even more confused right now after having read these.
Let me describe my problem with a similar example:
I have users and each user can read multiple books. A user can stop reading at any time and the progress is saved. Even if a book has been finished, the progress will be saved and never deleted.
I need to check the progress of all books a user has started to read all the time, so this has to be efficient and should require as few db reads as possible. The amount of books is not too much (< 1000) and also the books are thin (say, only one chapter, title, author that's it). It's the mass of progress and the permanent lookup of the progress that I'm fearing since every user has his own progress to probably every book.
How can I structure my models best to these requirements? If I use the StructuredProperty in my User model will the size of the books that are refered to in Progress count towards the limit of X MB (hope not)? If not I guess something like this is the best way to go (I can read progress fast, without additional lookup, and if neccessary get the book from the db).
class Book(ndb.model):
name = ndb.StringProperty(required=True)
...
class Progress(ndb.model):
book = ndb.KeyProperty(kind="Book", required=True)
last_page_read = ndb.IntegerProperty(required=True, default=0)
...
class User(ndb.model):
name = ndb.StringProperty(required=True)
books_and_progress = ndb.StructuredProperty(Progress, repeated=True)
...

Your approach is correct. As you're using structured properties, Progress instances are not separate datastore entities, they're stored inside the User entity, so no additional lookup is necessary to get progress information for a given user. Once you have the user you also have all the information about which books he's reading and in which page he left. To put it another way, your User entities will contain the user's name and a list of "book key, last_page_read" pairs.
will the size of the
books that are refered to in Progress count towards the limit of X MB
(hope not)?
Don't know which limit are you referring to, but keep in mind that what you actually have in the User entity is the key for the Book model, not the actual data for the book. So, the size of your Book instance doesn't affect when you're retrieving User instances from the datastore.

Related

Django Save Object "Receipts"

I'm building a Django web application, part of it involves an online ordering system for food. I want to make a "receipt" object to save transactions.
My concern, however, is this - let's say I have an object Receipt that relates to Orders which relate to Items, if the items get edited or change over time, it will make the receipts look different down the line. Is there a way to save these at the moment of a transaction?
I am implementing a "soft deletion" to my models to avoid deletion issues however I don't think this would protect against edits.
The only way I can think of to deal with is to 'materialize' the Receipt. In other words when a receipt is generated use the Order and Items information current at the time and then write the actual values, not the Order/Items id to a receipt table. So for a Items item write out the attributes(description, price, qty.etc) you are interested in recording to the table, instead of just an Items.id that points to a possibly changed value in future.

Django custom creation manager logic for temporal database

I am trying to develop a Django application that has built-in logic around temporal states for objects. The desire is to be able to have a singular object representing a resource, while having attributes of that resource be able to change over time. For example, a desired use case is to query the owner of a resource at any given time (last year, yesterday, tomorrow, next year, ...).
Here is what I am working with...
class Resource(models.Model):
id = models.AutoField(primary_key=True)
class ResourceState(models.Model):
id = models.AutoField(primary_key=True)
# Link the resource this state is applied to
resource = models.ForeignKey(Resource, related_name='states', on_delete=models.CASCADE)
# Track when this state is ACTIVE on a resource
start_dt = models.DateTimeField()
end_dt = models.DateTimeField()
# Temporal fields, can change between ResourceStates
owner = models.CharField(max_length=100)
description = models.TextField(max_length=500)
I feel like I am going to have to create a custom interface to interact with this state. Some example use cases (interface is completely up in the air)...
# Get all of the states that were ever active on resource 1 (this is already possible)
Resource.objects.get(id=1).states.objects.all()
# Get the owner of resource 1 from the state that was active yesterday, this is non-standard behavior
Resource.objects.get(id=1).states.at(YESTERDAY).owner
# Create a new state for resource 1, active between tomorrow and infinity (None == infinity)
# This is obviously non standard if I want to enforce one-state-per-timepoint
Resource.objects.get(id=1).states.create(
start_dt=TOMORROW,
end_dt=None,
owner="New Owner",
description="New Description"
)
I feel the largest amount of custom logic will be required to do creates. I want to enforce that only one ResourceState can be active on a Resource for any given timepoint. This means that to create some ResourceState objects, I will need to adjust/remove others.
>> resource = Resource.objects.get(id=1)
>> resource.states.objects.all()
[ResourceState(start_dt=None, end_dt=None, owner='owner1')]
>> resource.states.create(start_dt=YESTERDAY, end_dt=TOMORROW, owner='owner2')
>> resource.states.objects.all()
[
ResourceState(start_dt=None, end_dt=YESTERDAY, owner='owner1'),
ResourceState(start_dt=YESTERDAY, end_dt=TOMORROW, owner='owner2'),
ResourceState(start_dt=TOMORROW, end_dt=None, owner='owner1')
]
I know I will have to do most of the legwork around defining the logic, but is there any intuitive place where I should put it? Does Django provide an easy place for me to create these methods? If so, where is the best place to apply them? Against the Resource object? Using a custom Manager to deal with interacting with related 'ResourceState' objects?
Re-reading the above it is a bit confusing, but this isnt a simple topic either!! Please let me know if anyone has any ideas for how to do something like the above!
Thanks a ton!
too long for a comment, and purely some thoughts, not a full answer, but having dealt with many date effective records in financial systems (not in Django) some things come to mind:
My gut would be to start by putting it on the save method of the resource model. You are probably right in needing a custom manager as well.
I'd probably also flirt with the idea of a is_current boolean field in the state model but certain care would need to be considered with future date effective state records. If there is only one active state at a time, I'd also examine the need for an enddate. Having both start and end definitely makes the raw sql queries (if ever needed) easier: date() between state.start and state.end <- this would give current record, sub in any date to get that date's effective record. Also, give some consideration to the open ended end date where you don't know the end date date. Your queries will have to handle the nulls properly. YOu probably also may need to consider the open ended start date (say for a load of historical data where the original start date is unknown). I'd suggest staying away from using some super early date as a fill in (same for date far in the future for unknown end dates) - If you end up with lots of transactions, your query optimizer may thank you, however, I may be old and this doesn't matter anymore.
If you like to read about this stuff, I'd recommend a look at 1.8 in https://www.amazon.ca/Art-SQL-Stephane-Faroult/dp/0596008945/ and chapter 6:
"But before settling for one solution, we must acknowledge that
valuation tables come in all shapes and sizes. For instance, those of
telecom companies, which handle tremendous amounts of data, have a
relatively short price list that doesn't change very often. By
contrast, an investment bank stores new prices for all the securities,
derivatives, and any type of financial product it may be dealing with
almost continuously. A good solution in one case will not necessarily
be a good solution in another.
Handling data that both accumulates and changes requires very careful
design and tactics that vary according to the rate of change."

Feed Algorithm + Database: Either too many rows or too slow retrieval

Say I have a general website that allows someone to download their feed in a small amount of time. A user can be subscribed to many different pages, and the user's feed must be returned to the user from the server with only N of the most recent posts between all of the pages subscribed to. Originally when a user queried the server for a feed, the algorithm was as follows:
look at all of the pages a user subscribed to
getting the N most recent posts from each page
sorting all of the posts
return the N most recent posts to the user as their feed
As it turns out, doing this EVERY TIME a user tried to refresh a feed was really slow. Thus, I changed the database to have a table of feedposts, which simply has a foreign key to a user and a foreign key to the post. Every time a page makes a new post, it creates a feed post for each of its subscribing followers. That way, when a user wants their feed, it is already created and does not have to be created upon retrieval.
The way I am doing this is creating far too many rows and simply does not seem scalable. For instance, if a single page makes 1 post & has 1,000,000 followers, we just created 1,000,000 new rows in our feedpost table.
Please help!
How do companies such as facebook handle this problem? Do they generate the feed upon request? Are my database relationships terrible?
It's not that the original schema itself would be inherently wrong, at least not based on the high-level description you have provided. The slowness stems from the fact that you're not accessing the database in a way relational databases should be accessed.
In general, when querying a relational database, you should use JOINs and in-database ordering where possible, instead of fetching a bunch of data, and then trying to connect related objects and sort them in your code. If you let the database do all this for you, it will be much faster, because it can take advantage of indices, and only access those objects that are actually needed.
As a rule of thumb, if you need to sort the results of a QuerySet in your Python code, or loop through multiple querysets and combine them somehow, you're most likely doing something wrong and you should figure out how to let the database do it for you. Of course, it's not true every single time, but certainly often enough.
Let me try to illustrate with a simple piece of code. Assume you have the following models:
class Page(models.Model):
name = models.CharField(max_length=47)
followers = models.ManyToManyField('auth.User', related_name='followed_pages')
class Post(models.Model):
title = models.CharField(max_length=147)
page = models.ForeignKey(Page, related_name='posts')
content = models.TextField()
time_published = models.DateTimeField(auto_now_add=True)
You could, for example, get the list of the last 20 posts posted to pages followed by the currently logged in user with the following single line of code:
latest_posts = Post.objects.filter(page__followers=request.user).order_by('-time_published')[:20]
This runs a single SQL query against your database, which only returns the (up to) 20 results that match, and nothing else. And since you're joining on primary keys of all tables involved, it will conveniently use indices for all joins, making it really fast. In fact, this is exactly the kind of operation relational databases were designed to perform efficiently.
Caching will be the solution here.
You will have to reduce the database reads, which are much slower as compared to cache reads.
You can use something like Redis to cache the post.
Here is an amazing answer for better understanding
Is Redis just a cache
Each page can be assigned a key, and you can pull all of the posts for that page under that key.
you need not to cache everything , just cache resent M posts, where M>>N and safe enough to reduce the database calls.Now if in case user requests for posts beyond the latesd M ones, then they can be directly fetched from the DB.
Now when you have to generate the feed you can make a DB call to get all of the subscribed pages(or you can put in the cache as well) and then just get the required number of post's from the cache.
The problem here would be keeping the cache up-to date.
For that you can use something like django-signals. Whenever a new post is added, add it to the cache as well using the signal.
So for each DB write you will have to write to cache as well.
But then you will not have to read from DB and as Redis is a in memory datastore it is pretty fast as compared to standard relational databases.
Edit:
These are a few more articles which can help for better understanding
Does Stack Exchange use caching and if so, how
How Twitter Uses Redis to Scale - 105TB RAM, 39MM QPS, 10,000+ Instances

Concurrency in django admin change view

My model:
class Order(models.Model):
property_a = models.CharField()
property_b = models.CharField()
property_c = models.CharField()
Many users will access a given record in a short time frame via admin change page, so I am having concurrency issues:
User 1 and 2 open the change page at the same time. Assume all values are blank when they load the page. User 1 sets property_a to "a", and property_b to "b", then saves. A second later if user 2 changes property b and c then saves, it will quietly overwrite all the values from user 1. in this case, property_a will go back to being blank and b and c will be whatever user 2 put in.
I need recommendations on how to handle this. If I have to have a version field in the model, how do i pass it to the admin, where do I do the check so I can elegantly notify the user their changes can't be saved because another user has modified the record? Is there a more seamless way than just returning an error to the user?
The standard solution is to prevent your users from sharing a single record. It's not at all clear why so many users are messing with the exact same Order instance.
Consider that Order is probably a composite object and you've put too much into a single model. That's the first -- and best -- solution.
If (for inexplicable reasons) you won't decompose this, then you have to create a two-part update transaction.
Requery the data. Compare with the original query as done for this user's session.
If the data doesn't match the original query, then someone else changed it. The user's changes are invalidated, rolled back, wiped out, and the user sees a new query.
If the data does match, you can try to commit the change.
The above algorithm has a race condition, which is usually resolved via low-level SQL. Note that it invalidates a user's work, making it maximally irritating.
That's why your first choice is to decompose your models to eliminate the concurrency.
my model has a miscellaneous notes field
This is a bad design. (a) Concurrency is ruined by collisions on this field. (b) There's no log or history of comments.
Item (b) means that a badly-behaved user can maliciously corrupt this data. If you keep notes and comments as a log, you can -- in principle -- limit users to changing only their own comments.
[In most databases with "miscellaneous notes" the field has become a costly, hard-to-maintain liability full of important but impossible-to-parse data. Miscellaneous notes is where users invent their own processes outside the application software. ]
"miscellaneous notes" must be treated like a log, with an unlimited number of notes -- date-stamped -- identified by user -- appended to the Order.
If you simply partition the design to put notes in a separate table, you solve your concurrency issues.

Designing a database for a user/points system? (in Django)

First of all, sorry if this isn't an appropriate question for StackOverflow. I've tried to make it as generalisable as possible.
I want to create a database (MySQL, site running Django) that has users, who can be allocated a certain number of points for various types of action - it's a collaborative game. My requirements are to obtain:
the number of points a user has
the user's ranking compared to all other users
and the overall leaderboard (i.e. all users ranked in order of points)
This is what I have so far, in my Django models.py file:
class SiteUser(models.Model):
name = models.CharField(max_length=250 )
email = models.EmailField(max_length=250 )
date_added = models.DateTimeField(auto_now_add=True)
def points_total(self):
points_added = PointsAdded.objects.filter(user=self)
points_total = 0
for point in points_added:
points_total += point.points
return points_total
class PointsAdded(models.Model):
user = models.ForeignKey('SiteUser')
action = models.ForeignKey('Action')
date_added = models.DateTimeField(auto_now_add=True)
def points(self):
points = Action.objects.filter(action=self.action)
return points
class Action(models.Model):
points = models.IntegerField()
action = models.CharField(max_length=36)
However it's rapidly becoming clear to me that it's actually quite complex (in Django query terms at least) to figure out the user's ranking and return the leaderboard of users. At least, I'm finding it tough. Is there a more elegant way to do something like this?
This question seems to suggest that I shouldn't even have a separate points table - what do people think? It feels more robust to have separate tables, but I don't have much experience of database design.
this is old, but I'm not sure exactly why you have 2 separate tables (Points Added & Action). It's late, so maybe my mind isn't ticking, but it seems like you just separated one table into 2 for some reason. It doesn't seem like you get any benefit out of it. It's not like there's a 1 to many relationship in it right?
So first of all, I would combine those two tables. Secondly, you are probably better off storing points_total into a value in your site_user table. This is what I think Demitry is trying to allude to, but didn't say explicitly. This way instead of doing this whole additional query (pulling everything a user has done in his history of the site is expensive) + looping action (going through it is even more expensive), you can just pull it as one field. It's denormalizing the data for a greater good.
Just be sure to update the value everytime you add in something that has points. You can use django's post_save signal to do that
It's a bit more difficult to have points saved in the same table, but it's totally worth it. You can do very simple ordering/filtering operations if you have computed points total on user model. And you can count totals only when something changes (not every time you want to show them). Just put some validation logic into post_save signals and make sure to cover this logic with tests and you're good.
p.s. denormalization on wiki.