I'm working on a django project with the following models.
class User(models.Model):
pass
class Item(models.Model):
user = models.ForeignKey(User)
item_id = models.IntegerField()
There are about 10 million items and 100 thousand users.
My goal is to override the default admin search that takes forever and
return all the matching users that own "all" of the specified item ids within a reasonable timeframe.
These are a couple of the tests I use to better illustrate my criteria.
class TestSearch(TestCase):
def search(self, searchterm):
"""A tuple is returned with the first element as the queryset"""
return do_admin_search(User.objects.all())
def test_return_matching_users(self):
user = User.objects.create()
Item.objects.create(item_id=12345, user=user)
Item.objects.create(item_id=67890, user=user)
result = self.search('12345 67890')
assert_equal(1, result[0].count())
assert_equal(user, result[0][0])
def test_exclude_users_that_do_not_match_1(self):
user = User.objects.create()
Item.objects.create(item_id=12345, user=user)
result = self.search('12345 67890')
assert_false(result[0].exists())
def test_exclude_users_that_do_not_match_2(self):
user = User.objects.create()
result = self.search('12345 67890')
assert_false(result[0].exists())
The following snippet is my best attempt using annotate that takes over 50 seconds.
def search_by_item_ids(queryset, item_ids):
params = {}
for i in item_ids:
cond = Case(When(item__item_id=i, then=True), output_field=BooleanField())
params['has_' + str(i)] = cond
queryset = queryset.annotate(**params)
params = {}
for i in item_ids:
params['has_' + str(i)] = True
queryset = queryset.filter(**params)
return queryset
Is there anything I can do to speed it up?
Here's some quick suggestions that should improve performance drastically.
Use prefetch_related` on the initial queryset to get related items
queryset = User.objects.filter(...).prefetch_related('user_set')
Filter with the __in operator instead of looping through a list of IDs
def search_by_item_ids(queryset, item_ids):
return queryset.filter(item__item_id__in=item_ids)
Don't annotate if it's already a condition of the query
Since you know that this queryset only consists of records with ids in the item_ids list, no need to write that per object.
Putting it all together
You can speed up what you are doing drastically just by calling -
queryset = User.objects.filter(
item__item_id__in=item_ids
).prefetch_related('user_set')
with only 2 db hits for the full query.
Related
I'm using Django filters (django-filter) in my project. I have the models below, where a composition (Work) has a many-to-many instrumentations field with a through model. Each instrumentation has several instruments within it.
models.py:
class Work(models.Model):
instrumentations = models.ManyToManyField(Instrument,
through='Instrumentation',
blank=True)
class Instrument(models.Model):
name = models.CharField(max_length=100)
class Instrumentation(models.Model):
players = models.IntegerField(validators=[MinValueValidator(1)])
work = models.ForeignKey(Work, on_delete=models.CASCADE)
instrument = models.ForeignKey(Instrument, on_delete=models.CASCADE)
views.py:
import django_filters
class WorkFilter(django_filters.FilterSet):
instrument = django_filters.ModelMultipleChoiceFilter(
field_name="instrumentation__instrument",
queryset=Instrument.objects.all())
My filter works fine: it grabs all the pieces where there is the instrument selected by the user in the filter form.
However, I'd like to add the possibility of filtering the compositions with those exact instruments. For instance, if a piece contains violin, horn and cello and nothing else, I'd like to get that, but not a piece written for violin, horn, cello, and percussion. Is it possible to achieve that?
I'd also like the user to choose, from the interface, whether to perform an exact search or not, but that's a secondary issue for now, I suppose.
Update: type_of_search using ChoiceFilter
I made some progress; with the code below, I can give the user a choice between the two kinds of search. Now, I need to find which query would grab only the compositions with that exact set of instruments.
class WorkFilter(django_filters.FilterSet):
# ...
CHOICES = {
('exact', 'exact'), ('not_exact', 'not_exact')
}
type_of_search = django_filters.ChoiceFilter(label="Exact match?", choices=CHOICES, method="filter_instruments")
def filter_instruments(self, queryset, name, value):
if value == 'exact':
return queryset.??
elif value == 'not_exact':
return queryset.??
I know that the query I want is something like:
Work.objects.filter(instrumentations__name='violin').filter(instrumentations__name='viola').filter(instrumentations__name='horn')
I just don't know how to 'translate' it into the django_filters language.
Update 2: 'exact' query using QuerySet.annotate
Thanks to this question, I think this is the query I'm looking for:
from django.db.models import Count
instrument_list = ['...'] # How do I grab them from the form?
instruments_query = Work.objects.annotate(count=Count('instrumentations__name')).filter(count=len(instrument_list))
for instrument in instrument_list:
instruments_query = instruments_query.filter(instrumentations__name=instrument_list)
I feel I'm close, I just don't know how to integrate this with django_filters.
Update 3: WorkFilter that returns empty if the search is exact
class WorkFilter(django_filters.FilterSet):
genre = django_filters.ModelChoiceFilter(
queryset=Genre.objects.all(),
label="Filter by genre")
instrument = django_filters.ModelMultipleChoiceFilter(
field_name="instrumentation__instrument",
queryset=Instrument.objects.all(),
label="Filter by instrument")
CHOICES = {
('exact', 'exact'), ('not_exact', 'not_exact')
}
type_of_search = django_filters.ChoiceFilter(label="Exact match?", choices=CHOICES, method="filter_instruments")
def filter_instruments(self, queryset, name, value):
instrument_list = self.data.getlist('instrumentation__instrument')
if value == 'exact':
queryset = queryset.annotate(count=Count('instrumentations__name')).filter(count=len(instrument_list))
for instrument in instrument_list:
queryset = queryset.filter(instrumentations__name=instrument)
elif value == 'not_exact':
pass # queryset = ...
return queryset
class Meta:
model = Work
fields = ['genre', 'title', 'instrument', 'instrumentation']
You can grab instrument_list with self.data.getlist('instrument').
This is how you would use instrument_list for the 'exact' query:
type_of_search = django_filters.ChoiceFilter(label="Exact match?", choices=CHOICES, method=lambda queryset, name, value: queryset)
instrument = django_filters.ModelMultipleChoiceFilter(
field_name="instrumentation__instrument",
queryset=Instrument.objects.all(),
label="Filter by instrument",
method="filter_instruments")
def filter_instruments(self, queryset, name, value):
if not value:
return queryset
instrument_list = self.data.getlist('instrument') # [v.pk for v in value]
type_of_search = self.data.get('type_of_search')
if type_of_search == 'exact':
queryset = queryset.annotate(count=Count('instrumentations')).filter(count=len(instrument_list))
for instrument in instrument_list:
queryset = queryset.filter(instrumentations__pk=instrument)
else:
queryset = queryset.filter(instrumentations__pk__in=instrument_list).distinct()
return queryset
So i got my serializer called like this:
result_serializer = TaskInfoSerializer(tasks, many=True)
And the serializer:
class TaskInfoSerializer(serializers.ModelSerializer):
done_jobs_count = serializers.SerializerMethodField()
total_jobs_count = serializers.SerializerMethodField()
task_status = serializers.SerializerMethodField()
class Meta:
model = Task
fields = ('task_id', 'task_name', 'done_jobs_count', 'total_jobs_count', 'task_status')
def get_done_jobs_count(self, obj):
qs = Job.objects.filter(task__task_id=obj.task_id, done_flag=1)
condition = False
# Some complicate logic to determine condition that I can't reveal due to business
result = qs.count() if condition else 0
# this function take around 3 seconds
return result
def get_total_jobs_count(self, obj):
qs = Job.objects.filter(task__task_id=obj.task_id)
# this query take around 3-5 seconds
return qs.count()
def get_task_status(self, obj):
done_count = self.get_done_jobs_count(obj)
total_count = self.get_total_jobs_count(obj)
if done_count >= total_count:
return 'done'
else:
return 'not yet'
When the get_task_status function is called, it call other 2 function and make those 2 costly query again.
Is there any best way to prevent that? And I dont really know the order of those functions to be called, is it based on the order declare in Meta's fields? Or above that?
Edit:
The logic in get_done_jobs_count is a bit complicate and I cannot make it into a single query when get task
Edit 2:
I just bring all those count function into model and use cached_property
https://docs.djangoproject.com/en/2.1/ref/utils/#module-django.utils.functional
But it raise another question: Is that number reliable? I don't understand much about django cache, is that cached_property is only exist for this instance (just until the API get list of tasks return a response) or will it exist for sometime?
I just try cached_property and it did resolve the problem.
Model:
from django.utils.functional import cached_property
from django.db import models
class Task(models.Model):
task_id = models.AutoField(primary_key=True)
task_name = models.CharField(default='')
#cached_property
def done_jobs_count(self):
qs = self.jobs.filter(done_flag=1)
condition = False
# Some complicate logic to determine condition that I can't reveal due to business
result = qs.count() if condition else 0
# this function take around 3 seconds
return result
#cached_property
def total_jobs_count(self):
qs = Job.objects.filter(task__task_id=obj.task_id)
# this query take around 3-5 seconds
return qs.count()
#property
def task_status(self):
done_count = self.done_jobs_count
total_count = self.total_jobs_count
if done_count >= total_count:
return 'done'
else:
return 'not yet'
Serializer:
class TaskInfoSerializer(serializers.ModelSerializer):
class Meta:
model = Task
fields = ('task_id', 'task_name', 'done_jobs_count', 'total_jobs_count', 'task_status')
You could annotate those values to avoid making extra queries. So the queryset passed to the serializer would look something like this (it might change depending on the Django version you're using and the related query name for jobs):
tasks = tasks.annotate(
done_jobs=Count('jobs', filter=Q(done_flag=1)),
total_jobs=Count('jobs'),
)
result_serializer = TaskInfoSerializer(tasks, many=True)
Then the serializer method would look like:
def get_task_status(self, obj):
if obj.done_jobs >= obj.total_jobs:
return 'done'
else:
return 'not yet'
Edit: a cached_property won't help you if you have to call the method for each task instance (which seems the case). The problem is not so much the calculation but whether you have to hit the database for each separate task. You have to focus on getting all the information needed for the calculation in a single query. If that's impossible or too complicated, maybe think about changing the data structure (models) in order to facilitate that.
Using iterator() and counting Iterator might solve your problem.
job_iter = Job.objects.filter(task__task_id=obj.task_id).iterator()
count = len(list(job_iter))
return count
You can use select_related() and prefetch_related() for Retrieve everything at once if you will need them.
Note: if you use iterator() to run the query, prefetch_related() calls will be ignored
You might want to go through documentation for optimisation
I want to write all types of complex queries,
for example :
If someone wants information "Fruit" is "Guava" in "Pune District" then they will get data for guava in pune district.
htt//api/?fruit=Guava&?district=Pune
If someone wants information "Fruit" is "Guava" in "Girnare Taluka" then they will get data for guava in girnare taluka.
htt://api/?fruit=Guava&?taluka=Girnare
If someone wants information for "Fruit" is "Guava" and "Banana" then they will get all data only for this two fruits, like wise
htt://api/?fruit=Guava&?Banana
But, when I run server then I cant get correct output
If i use http://api/?fruit=Banana then I get all data for fruit which is banana, pomegranate, guava instead of get data for fruit is only banana. So I am confuse what happen here.
can you please check my code, where I made mistake?
*Here is my all files
models.py
class Wbcis(models.Model):
Fruit = models.CharField(max_length=50)
District = models.CharField(max_length=50)
Taluka = models.CharField(max_length=50)
Revenue_circle = models.CharField(max_length=50)
Sum_Insured = models.FloatField()
Area = models.FloatField()
Farmer = models.IntegerField()
def get_wbcis(fruit=None, district=None, talkua=None, revenue_circle=None, sum_insured=None, area=None,min_farmer=None, max_farmer=None, limit=100):
query = Wbcis.objects.all()
if fuit is not None:
query = query.filter(Fruit=fruit)
if district is not None:
query = query.filter(District=district)
if taluka is not None:
query = query.filter(Taluka=taluka)
if revenue_circle is not None:
query = query.filter(Revenue_circle= revenue_circle)
if sum_insured is not None:
query = query.filter(Sum_Insured=sum_Insured)
if area is not None:
query = query.filter(Area=area)
if min_farmer is not None:
query = query.filter(Farmer__gte=min_farmer)
if max_farmer is not None:
query = query.filter(Farmer__lt=max_farmer)
return query[:limit]
Views.py
class WbcisViewSet(ModelViewSet):
queryset = Wbcis.objects.all()
serializer_class = WbcisSerializer
def wbcis_view(request):
fruit = request.GET.get("fruit")
district = request.GET.get("district")
taluka = request.GET.get("taluka")
revenue_circle = request.GET.get("revenue_circle")
sum_insured = request.GET.get("sum_insured")
area = request.GET.get("area")
min_farmer = request.GET.get("min_farmer")
max_farmer = request.GET.get("max_farmer")
wbcis = get_wbcis(fruit, district, taluka,revenue_circle,sum_insured,area, min_farmer, max_farmer)
#convert them to JSON:
dicts = []
for wbci in wbcis:
dicts.append(model_to_dict(wbci))
return JsonResponse(dicts)
Serializers.py
from rest_framework.serializers import ModelSerializer
from WBCIS.models import Wbcis
class WbcisSerializer(ModelSerializer):
class Meta:
model = Wbcis
fields=('id','Fruit','District','Sum_Insured','Area','Farmer','Taluka','Revenue_circle',)
whats need changes in this code for call these queries to get exact output?
I don't think that you're actually calling that view, judging by your usage I presume you're calling the viewset itself and then ignoring the query params.
You should follow the drf docs for filtering but essentially, provide the get queryset method to your viewset and include the code you currently have in your view in that
class WbcisViewSet(ModelViewSet):
queryset = Wbcis.objects.all() # Shouldn't need this anymore
serializer_class = WbcisSerializer
def get_queryset(self):
fruit = self.request.query_params.get("fruit")
....
return get_wbscis(...)
I am working with django-rest-framework and I have an API that returns me the info with a filter like this:
http://example.com/api/products?category=clothing&in_stock=True
--this returns me 10 items
But it also returns the whole Model data if I dont put the filters, this is the default way.
http://example.com/api/products/
--this returns me more than 100 (all the Model Table)
How can I disable this default operation, I mean, how can I make a filter to be necesary to make this api works? or even better! how can I make the last URL to return an empty json response?
UPDATE
Here is some code:
serializers.py
class OEntradaDetalleSerializer(serializers.HyperlinkedModelSerializer):
item = serializers.RelatedField(source='producto.item')
descripcion = serializers.RelatedField(source='producto.descripcion')
unidad = serializers.RelatedField(source='producto.unidad')
class Meta:
model = OEntradaDetalle
fields = ('url','item','descripcion','unidad','cantidad_ordenada','cantidad_recibida','epc')
views.py
class OEntradaDetalleViewSet(BulkUpdateModelMixin,viewsets.ModelViewSet):
filter_backends = (filters.DjangoFilterBackend,)
filter_fields = ('cantidad_ordenada','cantidad_recibida','oentrada__codigo_proveedor','oentrada__folio')
queryset = OEntradaDetalle.objects.all()
serializer_class = OEntradaDetalleSerializer
urls.py
router2 = BulkUpdateRouter()
router2.register(r'oentradadetalle', OEntradaDetalleViewSet)
urlpatterns = patterns('',
url(r'^api/',include(router2.urls)),
)
URL EXAMPLE
http://localhost:8000/api/oentradadetalle/?oentrada__folio=E01
THIS RETURNS ONLY SOME FILTERED VALUES
http://localhost:8000/api/oentradadetalle/
THIS RETURNS EVERYTHING IN THE MODEL (I need to remove this or make it return some empty data)
I would highly recommend using pagination, to prevent anyone from being able to return all of the results (which likely takes a while).
If you can spare the extra queries being made, you can always check if the filtered and unfiltered querysets match, and just return an empty queryset if that is the case. This would be done in the filter_queryset method on your view.
def filter_queryset(self, queryset):
filtered_queryset = super(ViewSet, self).filter_queryset(queryset)
if queryset.count() === len(filtered_queryset):
return queryset.model.objects.none()
return filtered_queryset
This will make one additional query for the count of the original queryset, and if it is the same as the filtered queryset, an empty queryset will be returned. If the queryset was actually filtered, it will be returned and the results will be what you are expecting.
EDIT:
It turns out the real question is - how do I get select_related to follow the m2m relationships I have defined? Those are the ones that are taxing my system. Any ideas?
I have two classes for my django app. The first (Item class) describes an item along with some functions that return information about the item. The second class (Itemlist class) takes a list of these items and then does some processing on them to return different values. The problem I'm having is that returning a list of items from Itemlist is taking a ton of queries, and I'm not sure where they're coming from.
class Item(models.Model):
# for archiving purposes
archive_id = models.IntegerField()
users = models.ManyToManyField(User, through='User_item_rel',
related_name='users_set')
# for many to one relationship (tags)
tag = models.ForeignKey(Tag)
sub_tag = models.CharField(default='',max_length=40)
name = models.CharField(max_length=40)
purch_date = models.DateField(default=datetime.datetime.now())
date_edited = models.DateTimeField(auto_now_add=True)
price = models.DecimalField(max_digits=6, decimal_places=2)
buyer = models.ManyToManyField(User, through='Buyer_item_rel',
related_name='buyers_set')
comments = models.CharField(default='',max_length=400)
house_id = models.IntegerField()
class Meta:
ordering = ['-purch_date']
def shortDisplayBuyers(self):
if len(self.buyer_item_rel_set.all()) != 1:
return "multiple buyers"
else:
return self.buyer_item_rel_set.all()[0].buyer.name
def listBuyers(self):
return self.buyer_item_rel_set.all()
def listUsers(self):
return self.user_item_rel_set.all()
def tag_name(self):
return self.tag
def sub_tag_name(self):
return self.sub_tag
def __unicode__(self):
return self.name
and the second class:
class Item_list:
def __init__(self, list = None, house_id = None, user_id = None,
archive_id = None, houseMode = 0):
self.list = list
self.house_id = house_id
self.uid = int(user_id)
self.archive_id = archive_id
self.gen_balancing_transactions()
self.houseMode = houseMode
def ret_list(self):
return self.list
So after I construct Itemlist with a large list of items, Itemlist.ret_list() takes up to 800 queries for 25 items. What can I do to fix this?
Try using select_related
As per a question I asked here
Dan is right in telling you to use select_related.
select_related can be read about here.
What it does is return in the same query data for the main object in your queryset and the model or fields specified in the select_related clause.
So, instead of a query like:
select * from item
followed by several queries like this every time you access one of the item_list objects:
select * from item_list where item_id = <one of the items for the query above>
the ORM will generate a query like:
select item.*, item_list.*
from item a join item_list b
where item a.id = b.item_id
In other words: it will hit the database once for all the data.
You probably want to use prefetch_related
Works similarly to select_related, but can deal with relations selected_related cannot. The join happens in python, but I've found it to be more efficient for this kind of work than the large # of queries.
Related reading on the subject