drf how to avoid objects.all() in UniqueValidator - django

I have a serializer class that represents a user.
class UserSerializer(BaseSerializer):
uid = serializers.IntegerField(required=True)
class Meta:
model = User
fields = "all"
def validate(self, data):
super().validate(data)
validate_user_data(data=self.initial_data, user=self.instance)
return data
users should be unique on uid, so when getting a post request what I really want is to change the uid field to:
uid = serializers.IntegerField(required=True, validators=[validators.UniqueValidator(queryset=User.objects.all())])
and this will probably work, the problem is, this will trigger an sql query that will select all the users.
This could have a very high impact on the system since there could be tens of thousands of them.
What I would really want is to change the query to User.objects.get(uid=uid), which will not select every user from the DB.
However, since I'm in the serializer definition of uid, I can't use uid=uid because uid is just being defined.

(…) and this will probably work, the problem is, this will trigger an sql query the will select all the users.
This is incorrect. Django will filter the queryset, but the filtering itself happens at the database side.
This will not query for all the items in the User table. The queryset is not evaluated. It acts as a "root queryset" against which queries will be constructed.
We can look up the source code on GitHub:
class UniqueValidator(object):
# ...
def __call__(self, value):
queryset = self.queryset
queryset = self.filter_queryset(value, queryset)
queryset = self.exclude_current_instance(queryset)
if qs_exists(queryset):
raise ValidationError(self.message, code='unique')
Here the queryset is thus filtered. This filtering is not done at the Python/Django level, but it constructs a filtered variant. Indeed, if we look at the filter_queryset function, we see:
def filter_queryset(self, value, queryset):
"""
Filter the queryset to all instances matching the given attribute.
"""
filter_kwargs = {'%s__%s' % (self.field_name, self.lookup): value}
return qs_filter(queryset, **filter_kwargs)
with as qs_filter:
def qs_filter(queryset, **kwargs):
try:
return queryset.filter(**kwargs)
except (TypeError, ValueError, DataError):
return queryset.none()
As you can see, it will thus generate a query User.objects.filter(uid=the_uid).exclude(pk=item_that_is_updated)
It will thus check if there is a User object in that database with the same uid as the one you set, and exclude the one your are updating (given that is applicable). The query this will look like:
SELECT user.*
FROM user
WHERE uid = the_uid
AND id <> item_that_is_updated
It will thus filter at the database level, and therefore boost efficiency.

Related

Optimize the filtering of Queryset results in Django

I'm overriding Django Admin's list_filter (to customize the filter that shows on the right side on the django admin UI for a listview). The following code works, but isn't optimized: it increases SQL queries by "number of product categories".
(The parts to focus on, in the following code sample are, qs.values_list('product_category', flat=True) which only returns an id (int), so I've to use ProductCategory.objects.get(id=i).)
Wondering if this can be simplified?
(E.g. data: Suppose the product categories are "baked" "fried" "raw" etc., and the Items are "bread" "fish fry" "cake". So when the Item list is displayed in Django Admin, all product categories will show on the 'Filter By' column on the right side of the UI.)
from django.utils.translation import ugettext_lazy as _
from django.contrib.admin import SimpleListFilter
from product_category.model import ProductCategory
class ProductCategoryFilter(SimpleListFilter):
title = _('ProductCategory')
parameter_name = 'product_category'
def lookups(self, request, model_admin):
qs = model_admin.get_queryset(request)
ordered_filter_obj_list = []
# TODO: Works, but increases SQL queries by "number of product categories"
for i in (
qs.values_list("product_category", flat=True)
.distinct()
.order_by("product_category")
):
cat = ProductCategory.objects.get(id=i)
ordered_filter_obj_list.append((i, cat))
return ordered_filter_obj_list
def queryset(self, request, queryset):
if self.value():
return queryset.filter(product_category__exact=self.value())
# P.S. Above filter is used in another class like so
class ItemAdmin(admin.ModelAdmin):
list_filter = (ProductCategoryFilter,)
Probably you are looking for select_related, I do not know your exact models structure, but you may use it as follow:
cats = set()
for p in Product.objects.all().select_related('category'):
# Without select_related(), this would make a database query for each
# loop iteration in order to fetch the related categories for each product.
cats.add(p.category)
I am Assuming there is some relation between your Product and ProductCategory models. Hope this help.
Hah, phrasing the question makes it clear in your own head! Found an answer mins after posting this:
(Instead of doing an objects.get() inside the for loop, we can do objects.all() (which is a single SQL Query) and fill up a temporary dictionary. Then use this temp dict to find the associated string value.)
def lookups(self, request, model_admin):
qs = model_admin.get_queryset(request)
category_list = {}
for x in ProductCategory.objects.all():
category_list[x.id] = str(x)
ordered_filter_obj_list = []
for i in (
qs.values_list("product_category", flat=True)
.distinct().order_by("product_category")
):
ordered_filter_obj_list.append((i, category_list[i]))
return ordered_filter_obj_list
First parameter on the tuple list is the value of the lookup, and the second is just the name for display. This can be done in a single SQL query, or via the Django ORM:
def lookups(self, request, model_admin):
qs = model_admin.get_queryset(request).select_related('product_category')
values = qs.values('product_category_id', 'product_category__name') #assuming ProductCategory has an attribute 'name'
unique_categories = values.distinct('product_category_id', 'product_category__name')
categories = []
for c in unique_categories:
categories.append((c['product_category_id'], c['product_category__name']))
return categories

How to sort after dehydrate creates all data in Tastypie

I want to BookResource to perform join (book table with author table) with dedydrate(...) function. Final result should be sorted by table Author.
dehydrate(...) is called for each item in Book table.
class Author(Model)
author_name = models.CharField(max_length=64)
class Book(Model)
author = models.ForeignKey('Author')
book_name = models.CharField(max_length=64)
class BookResource(ModelResource):
class Meta(object):
# The point here is Book table can be sorted. But, final result
# should be sorted by author_name
queryset = Book.objects.all().order_by('book_name')
resource_name = 'api_test'
serializer = Serializer(formats=['xml', 'json'])
allowed_methods = ('get')
always_return_data = True
def dehydrate(self, bundle):
author_id = bundle.obj.author.id # author is foreign key of book
author_obj = Author.objects.get(id=bundle.obj.author.id)
# Construct queryset with author_name. Same as join 2 tables.
# But, I want to sort by author.
bundle.data['author_name'] = author_obj.author_name
return bundle
# This is called before dehydrate(...) is called. Not sure how to use it.
def apply_sorting(self, obj_list, options=None):
return obj_list
Questions:
1) How to sort result by author if using above code?
2) Could not figure out how to do join. Can you provide alternative?
Thank you.
Very late answer, but I ran into this lately. If you look at the Tastypie resource on the dehydrate method
there is a series of methods called. Shortly, this is the order of calls:
build_bundle
obj_get_list
apply_sorting
paginators
full_dehydrate (field by field)
alter_list_data_to_serialize
return self.create_response
You want to focus on the second last method self.alter_list_data_to_serialize(request, to_be_serialized)
the base method returns the second parameter
def alter_list_data_to_serialize(self, request, data):
"""
A hook to alter list data just before it gets serialized & sent to the user.
Useful for restructuring/renaming aspects of the what's going to be
sent.
Should accommodate for a list of objects, generally also including
meta data.
"""
return data
Whatever you want to do you can do it in here. Consider that to be serialized format will be a dict with 2 fields: meta and objects. the object field has a list of bundle objects.
An additional sorting layer could be:
def alter_list_data_to_serialize(self, request, data):
try:
sord = True if request.GET.get("sord") == "asc" else False
data["objects"].sort(
key=lambda x: x.data[request.GET.get("sidx", default_value)],
reverse=sord)
except:
pass
return data
You can optimize it as is a bit dirt. But that's the main idea.
Again, it's a workaround and all the sorting should belong to apply_sorting

Is it possible to write a QuerySet method that modifies the dataset but delays evaluation (similar to prefetch_related)?

I'm working on a QuerySet class that does something similar to prefetch_related but allows the query to link data that's in an unconnected database (basically, linking records from django apps's database to records in a legacy system, using a shared unique key, something along the links of:
class UserFoo(models.Model):
''' Uses the django database & can link to User model '''
user = models.OneToOneField(User, related_name='userfoo')
foo_record = models.CharField(
max_length=32,
db_column="foo",
unique=True
) # uuid pointing to legacy db table
#property
def foo(self):
if not hasattr(self, '_foo'):
self._foo = Foo.objects.get(uuid=self.foo_record)
return self._foo
#foo.setter
def foo(self, foo_obj):
self._foo = foo_obj
and then
class Foo(models.Model):
'''Uses legacy database'''
id = models.AutoField(primary_key=True)
uuid = models.CharField(max_length=32) # uuid for Foo legacy db table
…
#property
def user(self):
if not hasattr(self, '_user'):
self._user = User.objects.get(userfoo__foo_record=self.uuid)
return self._user
#user.setter
def user(self, user_obj):
self._user = user_obj
Run normally, a query that matches 100 foos (each with, say, 1 user record) will end up requiring 101 queries: one to get the foos, and a hundred for each user record (by doing a look up for the user record by calling the user property on each food).
To get around this, I am making something similar to prefetch_related which pulls all of the matching records for a query by the key, which means I just need one additional query to get the remaining records.
My code looks something like this:
class FooWithUserQuerySet(models.query.QuerySet):
def with_foo(self):
qs = self._clone()
foo_idx = {}
for record in self.all():
foo_idx.setdefault(record.uuid, []).append(record)
users = User.objects.filter(
userfoo__foo_record__in=foo_idx.keys()
).select_related('django','relations','here')
user_idx = {}
for user in users:
user_idx[user.userfoo.foo_record] = user
for fid, frecords in foo_idx.items():
user = user_idx.get(fid)
for frecord in frecords:
if user:
setattr(frecord, 'user', user)
return qs
This works, but any extra data saved to a foo is lost if the query is later modified — that is, if the queryset is re-ordered or filtered in any way.
I would like a way to create a method that does exactly what I am doing now, but waits until the moment that adjusts whenever the query is evaluated, so that foo records always have a User record.
Some notes:
the example has been highly simplified. There are actually a lot of tables that link up to the legacy data, and so for example although there is a one-to-on relationship between Foo and User, there will be some cases where a queryset will have multiple Foo records with the same key.
the legacy database is on a different server and server platform, so I can't link the two tables using a database server itself
ideally I'd like the User data to be cached, so that even if the records are sorted or sliced I don't have to re-run the foo query a second time.
Basically, I don't know enough about the internals of how the lazy evaluation of querysets works in order to do the necessary coding. I have jumped back and forth on the source code for django.db.models.query but it really is a fairly dense read and I'm hoping someone out there who's worked with this already can offer some pointers.

DRF - How to get WritableField to not load entire database into memory?

I have a very large database (6 GB) that I would like to use Django-REST-Framework with. In particular, I have a model that has a ForeignKey relationship to the django.contrib.auth.models.User table (not so big) and a Foreign Key to a BIG table (lets call it Products). The model can be seen below:
class ShoppingBag(models.Model):
user = models.ForeignKey('auth.User', related_name='+')
product = models.ForeignKey('myapp.Product', related_name='+')
quantity = models.SmallIntegerField(default=1)
Again, there are 6GB of Products.
The serializer is as follows:
class ShoppingBagSerializer(serializers.ModelSerializer):
product = serializers.RelatedField(many=False)
user = serializers.RelatedField(many=False)
class Meta:
model = ShoppingBag
fields = ('product', 'user', 'quantity')
So far this is great- I can do a GET on the list and individual shopping bags, and everything is fine. For reference the queries (using a query logger) look something like this:
SELECT * FROM myapp_product WHERE product_id=1254
SELECT * FROM auth_user WHERE user_id=12
SELECT * FROM myapp_product WHERE product_id=1404
SELECT * FROM auth_user WHERE user_id=12
...
For as many shopping bags are getting returned.
But I would like to be able to POST to create new shopping bags, but serializers.RelatedField is read-only. Let's make it read-write:
class ShoppingBagSerializer(serializers.ModelSerializer):
product = serializers.PrimaryKeyRelatedField(many=False)
user = serializers.PrimaryKeyRelatedField(many=False)
...
Now things get bad... GET requests to the list action take > 5 minutes and I noticed that my server's memory jumps up to ~6GB; why?! Well, back to the SQL queries and now I see:
SELECT * FROM myapp_products;
SELECT * FROM auth_user;
Ok, so that's not good. Clearly we're doing "prefetch related" or "select_related" or something like that in order to get access to all the products; but this table is HUGE.
Further inspection reveals where this happens on Line 68 of relations.py in DRF:
def initialize(self, parent, field_name):
super(RelatedField, self).initialize(parent, field_name)
if self.queryset is None and not self.read_only:
manager = getattr(self.parent.opts.model, self.source or field_name)
if hasattr(manager, 'related'): # Forward
self.queryset = manager.related.model._default_manager.all()
else: # Reverse
self.queryset = manager.field.rel.to._default_manager.all()
If not readonly, self.queryset = ALL!!
So, I'm pretty sure that this is where my problem is; and I need to say, don't select_related here, but I'm not 100% if this is the issue or where to deal with this. It seems like all should be memory safe with pagination, but this is simply not the case. I'd appreciate any advice.
In the end, we had to simply create our own PrimaryKeyRelatedField class to override the default behavior in Django-Rest-Framework. Basically we ensured that the queryset was None until we wanted to lookup the object, then we performed the lookup. This was extremely annoying, and I hope the Django-Rest-Framework guys take note of this!
Our final solution:
class ProductField(serializers.PrimaryKeyRelatedField):
many = False
def __init__(self, *args, **kwargs):
kwarsgs['queryset'] = Product.objects.none() # Hack to ensure ALL products are not loaded
super(ProductField, self).__init__(*args, **kwargs)
def field_to_native(self, obj, field_name):
return unicode(obj)
def from_native(self, data):
"""
Perform query lookup here.
"""
try:
return Product.objects.get(pk=data)
except Product.ObjectDoesNotExist:
msg = self.error_messages['does_not_exist'] % smart_text(data)
raise ValidationError(msg)
except (TypeError, ValueError):
msg = self.error_messages['incorrect_type'] % type(data)
raise ValidationError(msg)
And then our serializer is as follows:
class ShoppingBagSerializer(serializers.ModelSerializer):
product = ProductField()
...
This hack ensures the entire database isn't loaded into memory, but rather performs one-off selects based on the data. It's not as efficient computationally, but it also doesn't blast our server with 5 second database queries loaded into memory!

'private' models, default query sets and chaining methods

I have a private boolean flag on my model, and a custom manager that overwrites the get_query_set method, with a filter, removing private=True:
class myManager(models.Manager):
def get_query_set(self):
qs = super(myManager, self).get_query_set()
qs = qs.filter(private=False)
return qs
class myModel(models.Model):
private = models.BooleanField(default=False)
owner = models.ForeignKey('Profile', related_name="owned")
#...etc...
objects = myManager()
I want the default queryset to exclude the private models be default as a security measure, preventing accidental usage of the model showing private models.
Sometimes, however, I will want to show the private models, so I have the following on the manager:
def for_user(self, user):
if user and not user.is_authenticated():
return self.get_query_set()
qs = super(myManager, self).get_query_set()
qs = qs.filter(Q(owner=user, private=True) | Q(private=False))
return qs
This works excellently, with the limitation that I can't chain the filter. This becomes a problem when I have a fk pointing the myModel and use otherModel.mymodel_set. otherModel.mymodel_set.for_user(user) wont work because mymodel_set returns a QuerySet object, rather than the manager.
Now the real problem starts, as I can't see a way to make the for_user() method work on a QuerySet subclass, because I can't access the full, unfiltered queryset (basically overwriting the get_query_set) form the QuerySet subclass, like I can in the manager (using super() to get the base queryset.)
What is the best way to work around this?
I'm not tied to any particular interface, but I would like it to be as djangoy/DRY as it can be. Obviously I could drop the security and just call a method to filter out private tasks on each call, but I really don't want to have to do that.
Update
manji's answer below is very close, however it fails when the queryset I want isn't a subset of the default queryset. I guess the real question here is how can I remove a particular filter from a chained query?
Define a custom QuerySet (containing your custom filter methods):
class MyQuerySet(models.query.QuerySet):
def public(self):
return self.filter(private=False)
def for_user(self, user):
if user and not user.is_authenticated():
return self.public()
return self.filter(Q(owner=user, private=True) | Q(private=False))
Define a custom manager that will use MyQuerySet (MyQuerySet custom filters will be accessible as if they were defined in the manager[by overriding __getattr__]):
# A Custom Manager accepting custom QuerySet
class MyManager(models.Manager):
use_for_related_fields = True
def __init__(self, qs_class=models.query.QuerySet):
self.queryset_class = qs_class
super(QuerySetManager, self).__init__()
def get_query_set(self):
return self.queryset_class(self.model).public()
def __getattr__(self, attr, *args):
try:
return getattr(self.__class__, attr, *args)
except AttributeError:
return getattr(self.get_query_set(), attr, *args)
Then in the model:
class MyModel(models.Model):
private = models.BooleanField(default=False)
owner = models.ForeignKey('Profile', related_name="owned")
#...etc...
objects = myManager(MyQuerySet)
Now you can:
¤ access by default only public models:
MyModel.objects.filter(..
¤ access for_user models:
MyModel.objects.for_user(user1).filter(..
Because of (use_for_related_fields = True), this same manager wil be used for related managers. So you can also:
¤ access by default only public models from related managers:
otherModel.mymodel_set.filter(..
¤ access for_user from related managers:
otherModel.mymodel_set.for_user(user).filter(..
More informations: Subclassing Django QuerySets & Custom managers with chainable filters (django snippet)
To use the chain you should override the get_query_set in your manager and place the for_user in your custom QuerySet.
I don't like this solution, but it works.
class CustomQuerySet(models.query.QuerySet):
def for_user(self):
return super(CustomQuerySet, self).filter(*args, **kwargs).filter(private=False)
class CustomManager(models.Manager):
def get_query_set(self):
return CustomQuerySet(self.model, using=self._db)
If you need to "reset" the QuerySet you can access the model of the queryset and call the original manager again (to fully reset). However that's probably not very useful for you, unless you were keeping track of the previous filter/exclude etc statements and can replay them again on the reset queryset. With a bit of planning that actually wouldn't be too hard to do, but may be a bit brute force.
Overall manji's answer is definitely the right way to go.
So amending manji's answer you need to replace the existing "model"."private" = False with ("model"."owner_id" = 2 AND "model"."private" = True ) OR "model"."private" = False ). To do that you will need to walk through the where object on the query object of the queryset to find the relevant bit to remove. The query object has a WhereNode object that represents the tree of the where clause, with each node having multiple children. You'd have to call the as_sql on the node to figure out if it's the one you are after:
from django.db import connection
qn = connection.ops.quote_name
q = myModel.objects.all()
print q.query.where.children[0].as_sql(qn, connection)
Which should give you something like:
('"model"."private" = ?', [False])
However trying to do that is probably way more effort than it's worth and it's delving into bits of Django that are probably not API-stable.
My recommendation would be to use two managers. One that can access everything (an escape hatch of sort), the other with the default filtering applied. The default manager is the first one, so you need to play around with the ordering depending on what you need to do. Then restructure your code to know which one to use - so you don't have the problem of having the extra private=False clause in there already.