Performance: Store likes in PostgreSQL ArrayField (Django example) - django

I have 2 models: Post and Comment, each can be liked by User.
For sure, total likes should be rendered somewhere near each Post or Comment
But also each User should have a page with all liked content.
So, the most obvious way is just do it with m2m field, which seems will lead to lots of problems in some future.
And what about this?
Post and Comment models should have some
users_liked_ids = ArrayField(models.IntegerField())
User model should also have such fields:
posts_liked_ids = ArrayField(models.IntegerField())
comments_liked_ids = ArrayField(models.IntegerField())
And each time User likes something, two actions are performed:
User's id adds to Post's/Comment's users_liked_ids field
Post's/Comment's id adds to User's posts_liked_ids/comments_liked_ids field
The questions are:
Is it a good plan?
Will it be efficient to make lookups in such approach to get "Is that Post/Comment` was liked but current user
Will it be better to store likes in some separate table, rather then in liked model, but also in ArrayField
Probably better stay with obvious m2m?

1) No.
2) Definitely not.
3) Absolutely, incredibly not. Don't split your data up even further.
4) Yes.
Here are some of the problems:
no referential integrity, since you can't create foreign keys on array elements, meaning you could easily have garbage values in an ID array
data duplication with posts having user ids and users having post ids means it's possible for information to get out of sync (what happens when a user or post is deleted?)
inefficient lookups in match arrays (your #2)
Don't, under any circumstances, do this. You may want to combine your "post" and "comment" models to simplify the relationship, but this is what junction tables are for. Arrays are good for use cases that don't involve foreign keys or the potential for extreme length.

Related

How to censor specific fields based on condition using django QuerySet API

Using Django we have situation where we have a model Case which can be set to being a medical case or not (through a BooleanField).
Now, we also have a system to check if a certain employee (User subclass) is authorized to see sensitive data when a case is labeled as being medical (containing medical sensitive data).
I am able to annotate a new field to each instance, a BooleanField letting us know if the requesting employee is authorized to see medical data on the specific Case instance or not.
Ideally, I would like to have the database sensor out specific fields (field customer for example), when the requesting employee is not authorized to see medical data for that case. I imagine this can be done with an annotate method, and a combination of from django.db.models.Case and from django.db.models.When.
But, what we would also like is that the resulting QuerySet keeps the same field names on the different model instances. We don't want to change the field name of customer to another name.
We have actually come up with a solution, using .values first, and then the .annotate for each field we want to potentially censor out (see code below). This isn't ideal though, for multiple reasons. For one, we don't get back model instances, but dictionaries. Also, but this is another question, one of the fields that needs to be censored is a ManyToManyField, and using .values now returns a unique row for each instance referred to through the ManyToManyField (any solution for that?)
Also, ideally, this queryset would be the base queryset for all situations in which an employee tries to request Cases in our app. We want all our colleagues to use this base queryset so that we don't have to implement the same solution in multiple places, and prevent sensitive data from leaking.
So, I am wondering, can anyone recommend a solution for this situation?
Thanks in advance!
PS. We would like to have this done by the database since the amount of cases being fetched is potentially very high, and doing this in Python would probably require a lot of CPU power and thus kill performance.
from django.db.models import Case, When, BooleanField, IntegerField, F, Value, Q
OurModel.objects.annotate(
employee_medical_authorized=Case(
When(..., then=Value(True)),
default=Value(False),
output_field=BooleanField()
)).values(...).annotate(
customer=Case(
When(Q(employee_medical_authorized=Value(False)) & Q(medical=Value(True)),
then=Value(None)),
default=F('customer'),
output_field=IntegerField()
)
)

Should I use JSONField over ForeignKey to store data?

I'm facing a dilemma, I'm creating a new product and I would not like to mess up the way I organise the informations in my database.
I have these two choices for my models, the first one would be to use foreign keys to link my them together.
Class Page(models.Model):
data = JsonField()
Class Image(models.Model):
page = models.ForeignKey(Page)
data = JsonField()
Class Video(models.Model):
page = models.ForeignKey(Page)
data = JsonField()
etc...
The second is to keep everything in Page's JSONField:
Class Page(models.Model):
data = JsonField() # videos and pictures, etc... are stored here
Is one better than the other and why? This would be a huge help on the way I would organize my databases in the futur.
I thought maybe the second option could be slower since everytime something changes all the json would be overridden, but does it make a huge difference or is what I am saying false?
A JSONField obfuscates the underlying data, making it difficult to write readable code and fully use Django's built-in ORM, validations and other niceties (ModelForms for example). While it gives flexibility to save anything you want to the db (e.g. no need to migrate the db when adding new fields), it takes away the clarity of explicit fields and makes it easy to introduce errors later on.
For example, if you start saving a new key in your data and then try to access that key in your code, older objects won't have it and you might find your app crashing depending on which object you're accessing. That can't happen if you use a separate field.
I would always try to avoid it unless there's no other way.
Typically I use a JSONField in two cases:
To save a response from 3rd party APIs (e.g. as an audit trail)
To save references to archived objects (e.g. when the live products in my db change but I still have orders referencing the product).
If you use PostgreSQL, as a relational database, it's optimised to be super-performant on JOINs so using ForeignKeys is actually a good thing. Use select_related and prefetch_related in your code to optimise the number of queries made, but the queries themselves will scale well even for millions of entries.

How to retrieve values from Django ForeignKey -> ManyToMany fields?

I have a model (Realtor) with a ForeignKey field (BillingTier), which has a ManyToManyField (BillingPlan). For each logged in realtor, I want to check if they have a billing plan that offers automatic feedback on their listings. Here's what the models look like, briefly:
class Realtor(models.Model):
user = models.OneToOneField(User)
billing_tier = models.ForeignKey(BillingTier, blank=True, null=True, default=None)
class BillingTier(models.Model):
plans = models.ManyToManyField(BillingPlan)
class BillingPlan(models.Model):
automatic_feedback = models.BooleanField(default=False)
I have a permissions helper that checks the user permissions on each page load, and denies access to certain pages. I want to deny the feedback page if they don't have the automatic feedback feature in their billing plan. However, I'm not really sure the best way to get this information. Here's what I've researched and found so far, but it seems inefficient to be querying on each page load:
def isPermitted(user, url):
premium = [t[0] for t in user.realtor.billing_tier.plans.values_list('automatic_feedback') if t[0]]
I saw some solutions which involved using filter (ManyToMany field values from queryset), but I'm equally unsure of using the query for each page load. I would have to get the billing tier id from the realtor: bt_id = user.realtor.billing_tier.id and then query the model like so:
BillingTier.objects.filter(id = bt_id).filter(plans__automatic_feedback=True).distinct()
I think the second option reads nicer, but I think the first would perform better because I wouldn't have to import and query the BillingTier model.
Is there a better option, or are these two the best I can hope for? Also, which would be more efficient for every page load?
As per the OP's invitation, here's an answer.
The core question is how to define an efficient permission check based on a highly relational data model.
The first variant involves building a Python list from evaluating a Django query set. The suspicion must certainly be that it imposes unnecessary computations on the Python interpreter. Although it's not clear whether that's tolerable if at the same time it allows for a less complex database query (a tradeoff which is hard to assess), the underlying DB query is not exactly simple.
The second approach involves fetching additional 1:1 data through relational lookups and then checking if there is any record fulfilling access criteria in a different, 1:n relation.
Let's have a look at them.
bt_id = user.realtor.billing_tier.id: This is required to get the hook for the following 1:n query. It is indeed highly inefficient in itself. It can be optimized in two ways.
As per Django: Access Foreign Keys Directly, it can be written as bt_id = user.realtor.billing_tier_id because the id is of course present in billing_tier and needs not be found via a relational operation.
Assuming that the page in itself would only load a user object, Django can be told to fetch and cache relational data along with that via select_related. So if the page does not only fetch the user object but the required billing_tier_id as well, we have saved one additional DB hit.
BillingTier.objects.filter(id = bt_id).filter(plans__automatic_feedback=True).distinct() can be optimized using Django's exists because that will redurce efforts both in the database and regarding data traffic between the database and Python.
Maybe even Django's prefetch_related can be used to combine the 1:1 and 1:n queries into a single query, but it's much more difficult to judge whether that pays. Could be worth a try.
In any case, it's worth installing a gem called Django Debug Toolbar which will allow you to analyze how much time your implementation spends on database queries.

Django .order_by() with .distinct() using postgres

I have a Read model that is related to an Article model. What I would like to do is make a queryset where articles are unique and ordered by date_added. Since I'm using postgres, I'd prefer to use the .distinct() method and specify the article field. Like so:
articles = Read.objects.order_by('article', 'date_added').distinct('article')
However this doesn't give the desired effect and orders the queryset by the order they were created. I am aware of the note about .distinct() and .order_by() in Django's documentation, but I don't see that it applies here since the side effect it mentions is there will be duplicates and I'm not seeing that.
# To actually sort by date added I end up doing this
articles = sorted(articles, key=lambda x: x.date_added, reverse=True)
This executes the entire query before I actually need it and could potentially get very slow if there are lots of records. I've already optimized using select_related().
Is there a better, more efficient, way to create a query with uniqueness of a related model and order_by date?
UPDATE
The output would ideally be a queryset of Read instances where their related article is unique in the queryset and only using the Django orm (i.e. sorting in python).
Is there a better, more efficient, way to create a query with uniqueness of a related model and order_by date?
Possibily. It's hard to say without the full picture, but my assumption is that you are using Read to track which articles have and have not been read, and probably tying this to User instance to determine if a particular user has read an article or not. If that's the case, your approach is flawed. Instead, you should do something like:
class Article(models.Model):
...
read_by = models.ManyToManyField(User, related_name='read_articles')
Then, to get a particular user's read articles, you can just do:
user_instance.read_articles.order_by('date_added')
That takes the need to use distinct out of the equation, since there will not be any duplicates now.
UPDATE
To get all articles that are read by at least one user:
Article.objects.filter(read_by__isnull=False)
Or, if you want to set a threshold for popularity, you can use annotations:
from django.db.models import Count
Article.objects.annotate(read_count=Count('read_by')).filter(read_count__gte=10)
Which would give you only articles that have been read by at least 10 users.

Filter on a list of tags

I'm trying to select all the songs in my Django database whose tag is any of those in a given list. There is a Song model, a Tag model, and a SongTag model (for the many to many relationship).
This is my attempt:
taglist = ["cool", "great"]
tags = Tag.objects.filter(name__in=taglist).values_list('id', flat=True)
song_tags = SongTag.objects.filter(tag__in=list(tags))
At this point I'm getting an error:
DatabaseError: MultiQuery does not support keys_only.
What am I getting wrong? If you can suggest a completely different approach to the problem, it would be more than welcome too!
EDIT: I should have mentioned I'm using Django on Google AppEngine with django-nonrel
You shouldn't use m2m relationship with AppEngine. NoSQL databases (and BigTable is one of them) generally don't support JOINs, and programmer is supposed to denormalize the data structure. This is a deliberate design desicion: while your database will contain redundant data, your read queries will be much simpler (no need to combine data from 3 tables), which in turn makes the design of DB server much simpler as well (of course this is made for the sake of optimization and scaling)
In your case you should probably get rid of Tag and SongTag models, and just store the tag in the Song model as a string. I of course assume that Tag model only contains id and name, if Tag in fact contains more data, you should still have Tag model. Song model in that case should contain both tag_id and tag_name. The idea, as I explained above, is to introduce redundancy for the sake of simpler queries
Please, please let the ORM build the query for you:
song_tags = SongTag.objects.filter(tag__name__in = taglist)
You should try to use only one query, so that Django also generates only one query using a join.
Something like this should work:
Song.objects.filter(tags__name__in=taglist)
You may need to change some names from this example (most likely the tags in tags__name__in), see https://docs.djangoproject.com/en/1.3/ref/models/relations/.