How to retrieve values from Django ForeignKey -> ManyToMany fields? - django

I have a model (Realtor) with a ForeignKey field (BillingTier), which has a ManyToManyField (BillingPlan). For each logged in realtor, I want to check if they have a billing plan that offers automatic feedback on their listings. Here's what the models look like, briefly:
class Realtor(models.Model):
user = models.OneToOneField(User)
billing_tier = models.ForeignKey(BillingTier, blank=True, null=True, default=None)
class BillingTier(models.Model):
plans = models.ManyToManyField(BillingPlan)
class BillingPlan(models.Model):
automatic_feedback = models.BooleanField(default=False)
I have a permissions helper that checks the user permissions on each page load, and denies access to certain pages. I want to deny the feedback page if they don't have the automatic feedback feature in their billing plan. However, I'm not really sure the best way to get this information. Here's what I've researched and found so far, but it seems inefficient to be querying on each page load:
def isPermitted(user, url):
premium = [t[0] for t in user.realtor.billing_tier.plans.values_list('automatic_feedback') if t[0]]
I saw some solutions which involved using filter (ManyToMany field values from queryset), but I'm equally unsure of using the query for each page load. I would have to get the billing tier id from the realtor: bt_id = user.realtor.billing_tier.id and then query the model like so:
BillingTier.objects.filter(id = bt_id).filter(plans__automatic_feedback=True).distinct()
I think the second option reads nicer, but I think the first would perform better because I wouldn't have to import and query the BillingTier model.
Is there a better option, or are these two the best I can hope for? Also, which would be more efficient for every page load?

As per the OP's invitation, here's an answer.
The core question is how to define an efficient permission check based on a highly relational data model.
The first variant involves building a Python list from evaluating a Django query set. The suspicion must certainly be that it imposes unnecessary computations on the Python interpreter. Although it's not clear whether that's tolerable if at the same time it allows for a less complex database query (a tradeoff which is hard to assess), the underlying DB query is not exactly simple.
The second approach involves fetching additional 1:1 data through relational lookups and then checking if there is any record fulfilling access criteria in a different, 1:n relation.
Let's have a look at them.
bt_id = user.realtor.billing_tier.id: This is required to get the hook for the following 1:n query. It is indeed highly inefficient in itself. It can be optimized in two ways.
As per Django: Access Foreign Keys Directly, it can be written as bt_id = user.realtor.billing_tier_id because the id is of course present in billing_tier and needs not be found via a relational operation.
Assuming that the page in itself would only load a user object, Django can be told to fetch and cache relational data along with that via select_related. So if the page does not only fetch the user object but the required billing_tier_id as well, we have saved one additional DB hit.
BillingTier.objects.filter(id = bt_id).filter(plans__automatic_feedback=True).distinct() can be optimized using Django's exists because that will redurce efforts both in the database and regarding data traffic between the database and Python.
Maybe even Django's prefetch_related can be used to combine the 1:1 and 1:n queries into a single query, but it's much more difficult to judge whether that pays. Could be worth a try.
In any case, it's worth installing a gem called Django Debug Toolbar which will allow you to analyze how much time your implementation spends on database queries.

Related

Django, is filtering by string faster than SQL relationships?

Is it a major flaw if I'm querying my user's information by their user_id (string) rather than creating a Profile model and linking them to other models using SQL relationships?
Example 1: (user_id is stored in django sessions.)
class Information(models.Model):
user_id = models.CharField(...)
...
# also applies for .filter() operations.
information = Information.objects.get(user_id=request.getUser['user_id'])
note: I am storing the user's profile informations on Auth0.
Example 2: (user_id is stored in Profile.)
class Profile(models.Model):
user_id = models.CharField(...)
class Information(models.Model):
profile = models.ForeginKey(Profile, ...)
...
information = Information.objects.get(profile=request.getProfile)
note: With this method Profile will only have one field, user_id.
On Django, will using a string instead of a query object affect performances to retrieve items?
Performance is not an issue here as noted by Dirk; as soon as a column is indexed, the performance difference between data types should be negligible when compared to other factors. Here's a related SO question for more perspective.
What you should take care of is to prevent the duplication of data whose integrity you then would have to take care of on your own instead of relying on well-tested integrity checks in the database.
Another aspect is that if you do have relations between your data, you absolutely should make sure that they are accurately represented in your models using Django's relationships. Otherwise there's really not much point in using Django's ORM at all. Good luck!

Should I use JSONField over ForeignKey to store data?

I'm facing a dilemma, I'm creating a new product and I would not like to mess up the way I organise the informations in my database.
I have these two choices for my models, the first one would be to use foreign keys to link my them together.
Class Page(models.Model):
data = JsonField()
Class Image(models.Model):
page = models.ForeignKey(Page)
data = JsonField()
Class Video(models.Model):
page = models.ForeignKey(Page)
data = JsonField()
etc...
The second is to keep everything in Page's JSONField:
Class Page(models.Model):
data = JsonField() # videos and pictures, etc... are stored here
Is one better than the other and why? This would be a huge help on the way I would organize my databases in the futur.
I thought maybe the second option could be slower since everytime something changes all the json would be overridden, but does it make a huge difference or is what I am saying false?
A JSONField obfuscates the underlying data, making it difficult to write readable code and fully use Django's built-in ORM, validations and other niceties (ModelForms for example). While it gives flexibility to save anything you want to the db (e.g. no need to migrate the db when adding new fields), it takes away the clarity of explicit fields and makes it easy to introduce errors later on.
For example, if you start saving a new key in your data and then try to access that key in your code, older objects won't have it and you might find your app crashing depending on which object you're accessing. That can't happen if you use a separate field.
I would always try to avoid it unless there's no other way.
Typically I use a JSONField in two cases:
To save a response from 3rd party APIs (e.g. as an audit trail)
To save references to archived objects (e.g. when the live products in my db change but I still have orders referencing the product).
If you use PostgreSQL, as a relational database, it's optimised to be super-performant on JOINs so using ForeignKeys is actually a good thing. Use select_related and prefetch_related in your code to optimise the number of queries made, but the queries themselves will scale well even for millions of entries.

Why is the database used by Django subqueries not sticky?

I have a concern with django subqueries using the django ORM. When we fetch a queryset or perform a DB operation, I have the option of bypassing all assumptions that django might make for the database that needs to be used by forcing usage of the specific database that I want.
b_det = Book.objects.using('some_db').filter(book_name = 'Mark')
The above disregards any database routers I might have set and goes straight to 'some_db'.
But if my models approximately look like so :-
class Author(models.Model):
author_name=models.CharField(max_length=255)
author_address=models.CharField(max_length=255)
class Book(models.Model):
book_name=models.CharField(max_length=255)
author=models.ForeignKey(Author, null = True)
And I fetch a QuerySet representing all books that are called Mark like so:-
b_det = Book.objects.using('some_db').filter(book_name = 'Mark')
Then later if somewhere in the code I trigger a subquery by doing something like:-
if b_det:
auth_address = b_det[0].author.author_address
Then this does not make use of the original database 'some_db' that I had specified early on for the main query. This again goes through the routers and picks up (possibly) the incorrect database.
Why does django do this. IMHO , if I had selected forced usage of database for the original query then even for the subquery the same database needs to be used. Why must the database routers come into picture for this at all?
This is not a subquery in the strict SQL sense of the word. What you are actually doing here is to execute one query and use the result of that to find related items.
You can chain filters and do lots of other operations on a queryset but it will not be executed until you take a slice on it or call .values() but here you are actually taking a slice
auth_address = b_det[0].#rest of code
So you have a materialized query and you are now trying to find the address of the related author and that requires another query but you are not using with so django is free to choose which database to use. You cacn overcome this by using select_related

Feed Algorithm + Database: Either too many rows or too slow retrieval

Say I have a general website that allows someone to download their feed in a small amount of time. A user can be subscribed to many different pages, and the user's feed must be returned to the user from the server with only N of the most recent posts between all of the pages subscribed to. Originally when a user queried the server for a feed, the algorithm was as follows:
look at all of the pages a user subscribed to
getting the N most recent posts from each page
sorting all of the posts
return the N most recent posts to the user as their feed
As it turns out, doing this EVERY TIME a user tried to refresh a feed was really slow. Thus, I changed the database to have a table of feedposts, which simply has a foreign key to a user and a foreign key to the post. Every time a page makes a new post, it creates a feed post for each of its subscribing followers. That way, when a user wants their feed, it is already created and does not have to be created upon retrieval.
The way I am doing this is creating far too many rows and simply does not seem scalable. For instance, if a single page makes 1 post & has 1,000,000 followers, we just created 1,000,000 new rows in our feedpost table.
Please help!
How do companies such as facebook handle this problem? Do they generate the feed upon request? Are my database relationships terrible?
It's not that the original schema itself would be inherently wrong, at least not based on the high-level description you have provided. The slowness stems from the fact that you're not accessing the database in a way relational databases should be accessed.
In general, when querying a relational database, you should use JOINs and in-database ordering where possible, instead of fetching a bunch of data, and then trying to connect related objects and sort them in your code. If you let the database do all this for you, it will be much faster, because it can take advantage of indices, and only access those objects that are actually needed.
As a rule of thumb, if you need to sort the results of a QuerySet in your Python code, or loop through multiple querysets and combine them somehow, you're most likely doing something wrong and you should figure out how to let the database do it for you. Of course, it's not true every single time, but certainly often enough.
Let me try to illustrate with a simple piece of code. Assume you have the following models:
class Page(models.Model):
name = models.CharField(max_length=47)
followers = models.ManyToManyField('auth.User', related_name='followed_pages')
class Post(models.Model):
title = models.CharField(max_length=147)
page = models.ForeignKey(Page, related_name='posts')
content = models.TextField()
time_published = models.DateTimeField(auto_now_add=True)
You could, for example, get the list of the last 20 posts posted to pages followed by the currently logged in user with the following single line of code:
latest_posts = Post.objects.filter(page__followers=request.user).order_by('-time_published')[:20]
This runs a single SQL query against your database, which only returns the (up to) 20 results that match, and nothing else. And since you're joining on primary keys of all tables involved, it will conveniently use indices for all joins, making it really fast. In fact, this is exactly the kind of operation relational databases were designed to perform efficiently.
Caching will be the solution here.
You will have to reduce the database reads, which are much slower as compared to cache reads.
You can use something like Redis to cache the post.
Here is an amazing answer for better understanding
Is Redis just a cache
Each page can be assigned a key, and you can pull all of the posts for that page under that key.
you need not to cache everything , just cache resent M posts, where M>>N and safe enough to reduce the database calls.Now if in case user requests for posts beyond the latesd M ones, then they can be directly fetched from the DB.
Now when you have to generate the feed you can make a DB call to get all of the subscribed pages(or you can put in the cache as well) and then just get the required number of post's from the cache.
The problem here would be keeping the cache up-to date.
For that you can use something like django-signals. Whenever a new post is added, add it to the cache as well using the signal.
So for each DB write you will have to write to cache as well.
But then you will not have to read from DB and as Redis is a in memory datastore it is pretty fast as compared to standard relational databases.
Edit:
These are a few more articles which can help for better understanding
Does Stack Exchange use caching and if so, how
How Twitter Uses Redis to Scale - 105TB RAM, 39MM QPS, 10,000+ Instances

Seeking Good Approach to Persist Data Submitted Through Dynamic Django Forms

Summary:
Looking for a good way to save data to Django models for which the associated forms are generated dynamically.
Detail:
I've been puzzling over the best approach for creating dynamic Django forms backed by models. For example, I'd like to create an interface where a user can create an HTML form, customize the types of fields in that form dynamically (Number, String, Dropdown Box, Date, etc.), and then display that form to other users so those users can submit data which is saved to a database. I'm not sure how to make an efficient approach to persist the data.
www.formsite.com and www.mailchimp.com have some form-building tools that are nice examples of what I am trying to do. Jacob Kaplan-Moss has an excellent tutorial on how to create the forms dynamically, but the tutorial doesn't get into how to persist the data.
As a dummy example, one (perhaps bad?) approach might be to create some models like below, where there is a database table for the SurveyQuestions (storing the customizable names and datatypes of each field) and one for the SurveyQuestionResponses (each record storing an individual response for a SurveyQuestion on a particular Survey).
However, it seems like this approach might result in really complex and slow queries. For example, if a Survey has 10 questions and you would like to display 10 user responses to that survey, there would be queries to select all 10 SurveyQuestions and then for each survey responder, there would be a query to select each of the SurveyQuestionResponses. It seems like the number of queries needed could add up really fast!
class Survey(models.Model):
# some fields here.
pass
class SurveyQuestion(models.Model):
""" Defines the headings and field
types for a given Survey.
"""
survey = models.ForeignKey(Survey)
field_name = models.CharField(
max_length=255,
help_text='Enter the name for this field heading')
field_type = models.IntegerField(
choices=choices.FIELD_TYPES,
help_text='Enter the data type for this field')
display_order = models.IntegerField(default=0)
class SurveyQuestionResponse(models.Model):
survey_field = models.ForeignKey(SurveyQuestion)
response value = models.TextArea(blank=True, null=True)
Is there a better approach to persisting data based on dynamic forms? Should I be somehow converting a form respondent's response to some sort of pickled format and store it to a TextField (Instead of having 10 SurveyQuestionResponse records there would be one record with all the response values pickled together)? I'm not too familiar with NoSQL options, but would a NoSQL approach work best for this type of thing? Is there some sort of rendering or caching that would make sense to do?
I keep encountering situations where saving data from dynamic forms like this would be very useful. I am wondering what other people's approaches are. Any advice is much appreciated. Thanks for reading this admittedly long question.
Joe
For a relational database an Entity-attribute-value model(EAV) could be used to achieve a dynamic, or open schema. Relational databases are not really suited for this type of schema, and this generally results in very slow queries over time. NoSQL has its own set of issues but I think that it would be best suited to your requirements. If you decide to take this route you can take a look at MongoDB. I have not used it extensively, but it seems most similar to relational database than the other NoSQL database out there, and its python interface seems pretty similar to django's ORM. By the was I remember finding a nice EAV example for Django. Though I don't remember where at the moment.