How to optimize a batch query to make it faster

How to optimize a batch query to make it faster - django

When changing the status of a document, I would like to put all the products in the document into stock. This involves changing several fields. At low amount of product instances, the query works quickly >50ms. However, some documents have products in quantities of, for example, 3000. Such a query takes more than a second. I know that there will be cases with much higher number of instances. I would like the query to be faster.
A simple loop takes ~1sec.
class PZ_U_Serializer(serializers.ModelSerializer):
class Meta:
model = PZ
fields = '__all__'
def update(self, instance, validated_data):
items = instance.pzitem_set.all()
bulk_update_list = []
for item in items:
item.quantity = 100
item.quantity_available_for_sale = 100
item.price = 100
item.available_for_sale = True
item.save()
return super().update(instance, validated_data)
Batch_update takes about 400ms per field. In this case 1600ms (which is slower than a loop).
class PZ_U_Serializer(serializers.ModelSerializer):
class Meta:
model = PZ
fields = '__all__'
def update(self, instance, validated_data):
items = instance.pzitem_set.all()
bulk_update_list = []
for item in items:
item.quantity = 100
item.quantity_available_for_sale = 100
item.price = 100
item.available_for_sale = True
bulk_update_list.append(item)
items.bulk_update(bulk_update_list, ['quantity', 'price', 'quantity_available_for_sale', 'available_for_sale'], batch_size=1000)
return super().update(instance, validated_data)
In post-gre console I see at (3000 objects) about 11,000 read hits.
I tried adding transaction.atomic before the loop, but it didn't change anything (if I did it right).

Did you try to use queryset's update() method?
It should save you lots of update requests & loop executions :
def update(self, instance, validated_data):
instance.pzitem_set.all().update(
quantity = 100,
quantity_available_for_sale = 100,
price = 100,
available_for_sale = True,
)
return super().update(instance, validated_data)
More infos on the official doc here.

Related

Django: How to load related model objects when instantiating a model

I have two related models (1-n). From the parent model, I am doing a lot of operations on the child model. For each operation I am calling:
ItensOrder.objects.filter(order=self.pk)
Inside the Order class, which is the parent, I am using the children objects several times, like this:
def total(self):
itens = ItensOrder.objects.filter(order=self.pk)
valor = sum(Counter(item.price * item.quantity for item in itens))
return str(valor)
def details(self):
itens = ItensOrder.objects.filter(order=self.pk)
return format_html_join('\n', "{} ({} x {} = {})<br/>",
((item.item.name,str(item.quantity),item.price,str(item.price * item.quantity)) for item in itens))
What is the best way to load the related objects ONLY ONCE, so I can avoid reaching the database every time I need the related objects.
I've been trying this on the parent model:
def __init__(self, *args, **kwargs):
if self.pk is not None:
self.itens = ItensOrder.objects.filter(order=self.pk)
else:
self.itens = None
But this is wrong....
Anybody can help please!?

You can access related child objects by using the related_name of a ForeignKey field
order = Order.objects.get(id=1)
itens = order.itensorder_set.all()
This reverse relationship attribute will by default be the model name lowercase followed by "_set", you can change this by setting related_name on the foreign key
You can pre-populate this property with a cache of all the related objects by using prefetch_related
order = Order.objects.prefetch_related('itensorder_set').get(id=1)
order.itensorder_set.all() # This can be called multiple times but will not hit the database
In your case
class Order(models.Model):
def total(self):
valor = sum(Counter(item.price * item.quantity for item in self.itensorder_set.all()))
return str(valor)
def details(self):
return format_html_join('\n', "{} ({} x {} = {})<br/>",
((item.item.name,str(item.quantity),item.price,str(item.price * item.quantity)) for item in self.itensorder_set.all()))
and in your model admin override get_queryset
def get_queryset(self, request):
return super().get_queryset(request).prefetch_related('itensorder_set')

you can user select_related inside your functions
def total(self):
itens = ItensOrder.objects.select_related('order').filter(order=self)
valor = sum(Counter(item.price * item.quantity for item in itens))
return str(valor)
def details(self):
itens = ItensOrder.objects.select_related('order').filter(order=self)
return format_html_join('\n', "{} ({} x {} = {})<br/>",
((item.item.name,str(item.quantity),item.price,str(item.price * item.quantity)) for item in itens))

Optimising number of queries within a DRF ModelSerializer

Within Django Rest Framework's serialiser it is possible to add more data to the serialised object than in the original Model.
This is useful for when calculating statistical information, on the server-side, and adding this extra information when responding to an API call.
As I understand, adding extra data is done using a SerializerMethodField, where each field is implemented by a get_... function.
However, if you have a number of these SerializerMethodFields, each one can be querying the Model/database separately, for what might be essentially the same data.
Is it possible to query the database once, store the list/result as a data member of the ModelSerializer object, and use the result of the queryset in many functions?
Here's a very simple example, just for illustration:
############## Model
class Employee(Model):
SALARY_TYPE_CHOICES = (('HR', 'Hourly Rate'), ('YR', 'Annual Salary'))
salary_type = CharField(max_length=2, choices=SALARY_TYPE_CHOICES, blank=False)
salary = PositiveIntegerField(blank=True, null=True, default=0)
company = ForeignKey(Company, related_name='employees')
class Company(Model):
name = CharField(verbose_name='company name', max_length=100)
############## View
class CompanyView(RetrieveAPIView):
queryset = Company.objects.all()
lookup_field='id'
serializer_class = CompanySerialiser
class CompanyListView(ListAPIView):
queryset = Company.objects.all()
serializer_class = CompanySerialiser
############## Serializer
class CompanySerialiser(ModelSerializer):
number_employees = SerializerMethodField()
total_salaries_estimate = SerializerMethodField()
class Meta:
model = Company
fields = ['id', 'name',
'number_employees',
'total_salaries_estimate',
]
def get_number_employees(self, obj):
return obj.employees.count()
def get_total_salaries_estimate(self, obj):
employee_list = obj.employees.all()
salaries_estimate = 0
HOURS_PER_YEAR = 8*200 # 8hrs/day, 200days/year
for empl in employee_list:
if empl.salary_type == 'YR':
salaries_estimate += empl.salary
elif empl.salary_type == 'HR':
salaries_estimate += empl.salary * HOURS_PER_YEAR
return salaries_estimate
The Serialiser can be optimised to:
use an object data member to store the result from the query set,
only retrieve the queryset once,
re-use the result of the queryset for all extra information provided in SerializerMethodFields.
Example:
class CompanySerialiser(ModelSerializer):
def __init__(self, *args, **kwargs):
super(CompanySerialiser, self).__init__(*args, **kwargs)
self.employee_list = None
number_employees = SerializerMethodField()
total_salaries_estimate = SerializerMethodField()
class Meta:
model = Company
fields = ['id', 'name',
'number_employees',
'total_salaries_estimate',
]
def _populate_employee_list(self, obj):
if not self.employee_list: # Query the database only once.
self.employee_list = obj.employees.all()
def get_number_employees(self, obj):
self._populate_employee_list(obj)
return len(self.employee_list)
def get_total_salaries_estimate(self, obj):
self._populate_employee_list(obj)
salaries_estimate = 0
HOURS_PER_YEAR = 8*200 # 8hrs/day, 200days/year
for empl in self.employee_list:
if empl.salary_type == 'YR':
salaries_estimate += empl.salary
elif empl.salary_type == 'HR':
salaries_estimate += empl.salary * HOURS_PER_YEAR
return salaries_estimate
This works for the single retrieve CompanyView. And, in fact saves one query/context-switch/round-trip to the database; I've eliminated the "count" query.
However, it does not work for the list view CompanyListView, because it seems that the serialiser object is created once and reused for each Company. So, only the first Company's list of employees is stored in the objects "self.employee_list" data member, and thus, all other companies erroneously get given the data from the first company.
Is there a best practice solution to this type of problem? Or am I just wrong to use the ListAPIView, and if so, is there an alternative?

I think this issue can be solved if you can pass the queryset to the CompanySerialiser with data already fetched.
You can do the following changes
class CompanyListView(ListAPIView):
queryset = Company.objects.all().prefetch_related('employee_set')
serializer_class = CompanySerialiser`
And instead of count use len function because count does the query again.
class CompanySerialiser(ModelSerializer):
number_employees = SerializerMethodField()
total_salaries_estimate = SerializerMethodField()
class Meta:
model = Company
fields = ['id', 'name',
'number_employees',
'total_salaries_estimate',
]
def get_number_employees(self, obj):
return len(obj.employees.all())
def get_total_salaries_estimate(self, obj):
employee_list = obj.employees.all()
salaries_estimate = 0
HOURS_PER_YEAR = 8*200 # 8hrs/day, 200days/year
for empl in employee_list:
if empl.salary_type == 'YR':
salaries_estimate += empl.salary
elif empl.salary_type == 'HR':
salaries_estimate += empl.salary * HOURS_PER_YEAR
return salaries_estimate
Since the data is prefetched, serializer will not do any additional query for all. But make sure you are not doing any kind of filter because another query will execute in that case.

As mentioned by #Ritesh Agrawal, you simply need to prefetch the data. However, I advise to do the aggregations directly inside the database instead of using Python:
class CompanySerializer(ModelSerializer):
number_employees = IntegerField()
total_salaries_estimate = FloatField()
class Meta:
model = Company
fields = ['id', 'name',
'number_employees',
'total_salaries_estimate', ...
]
class CompanyListView(ListAPIView):
queryset = Company.objects.annotate(
number_employees=Count('employees'),
total_salaries_estimate=Sum(
Case(
When(employees__salary_type=Value('HR'),
then=F('employees_salary') * Value(8 * 200)
),
default=F('employees__salary'),
output_field=IntegerField() #optional a priori, because you only manipulate integers
)
)
)
serializer_class = CompanySerializer
Notes:
I haven't tested this code, but I'm using the same kind of syntax for my own projects. If you encounter errors (like 'cannot determine type of output' or similar), try wrapping F('employees_salary') * Value(8 * 200) inside an ExpressionWrapper(..., output_field=IntegerField()).
Using aggregation, you can apply filters on the queryset afterwards. However, if you're prefetching your related Employees, then you cannot filter the related objects anymore (as mentioned in the previous answer). BUT, if you already know you'll need the list of employees with hourly rate, you can do .prefetch_related(Prefetch('employees', queryset=Employee.object.filter(salary_type='HR'), to_attr="hourly_rate_employees")).
Relevant documentation:
Query optimization
Aggregation
Hope this will help you ;)

AND search with reverse relations

I'm working on a django project with the following models.
class User(models.Model):
pass
class Item(models.Model):
user = models.ForeignKey(User)
item_id = models.IntegerField()
There are about 10 million items and 100 thousand users.
My goal is to override the default admin search that takes forever and
return all the matching users that own "all" of the specified item ids within a reasonable timeframe.
These are a couple of the tests I use to better illustrate my criteria.
class TestSearch(TestCase):
def search(self, searchterm):
"""A tuple is returned with the first element as the queryset"""
return do_admin_search(User.objects.all())
def test_return_matching_users(self):
user = User.objects.create()
Item.objects.create(item_id=12345, user=user)
Item.objects.create(item_id=67890, user=user)
result = self.search('12345 67890')
assert_equal(1, result[0].count())
assert_equal(user, result[0][0])
def test_exclude_users_that_do_not_match_1(self):
user = User.objects.create()
Item.objects.create(item_id=12345, user=user)
result = self.search('12345 67890')
assert_false(result[0].exists())
def test_exclude_users_that_do_not_match_2(self):
user = User.objects.create()
result = self.search('12345 67890')
assert_false(result[0].exists())
The following snippet is my best attempt using annotate that takes over 50 seconds.
def search_by_item_ids(queryset, item_ids):
params = {}
for i in item_ids:
cond = Case(When(item__item_id=i, then=True), output_field=BooleanField())
params['has_' + str(i)] = cond
queryset = queryset.annotate(**params)
params = {}
for i in item_ids:
params['has_' + str(i)] = True
queryset = queryset.filter(**params)
return queryset
Is there anything I can do to speed it up?

Here's some quick suggestions that should improve performance drastically.
Use prefetch_related` on the initial queryset to get related items
queryset = User.objects.filter(...).prefetch_related('user_set')
Filter with the __in operator instead of looping through a list of IDs
def search_by_item_ids(queryset, item_ids):
return queryset.filter(item__item_id__in=item_ids)
Don't annotate if it's already a condition of the query
Since you know that this queryset only consists of records with ids in the item_ids list, no need to write that per object.
Putting it all together
You can speed up what you are doing drastically just by calling -
queryset = User.objects.filter(
item__item_id__in=item_ids
).prefetch_related('user_set')
with only 2 db hits for the full query.

Django instances updates and recall values from history

I can track my model instances updates history. I do not know how to proceed further so I can recall instances values by specific date / time.
class Room is used to create new Rooms.
class RoomLog is used to track changes. If I update Room instance then related RoomLog instance is created. It effect in creating RoomLog instances for every Room instance update.
So, I can change specific instance few times within e.g. a minute or an hour. I can do the same to other instances in the same time or for example I can just NOT change any instance or change some of the other instances and not change all of them.
Although,
I would like to recall the whole table with all latest values of the instances by specific hour /day or within specific period e.g.
If I choose a time 1:00am then Table is rendered with all instances created at this time. Let's call it Table v0.
Now I start updates and some instances are changed between 1:00am & 2:00am. So, I would like to re-call a table that displays all instances for the time 2:00am and keep in time that some instances were changed but some not. So, some of them have history of changes and some do not have.
and the same for the other hours / days. I hope you get the sense of what I want to achieve.
How to do this ?
These are my two models I use to operate on Room and related RoomLog instances.
class Room(models.Model):
room_name = models.CharField(max_length= 10)
room_value = models.PositiveSmallIntegerField(default=0)
flat = models.ForeignKey(Flat)
created_date = models.DateField(auto_now_add= True)
created_time = models.TimeField(auto_now_add= True)
def __init__(self, *args, **kwargs):
super(Room, self).__init__(*args, **kwargs)
self.value_original = self.room_value
def save(self, **kwargs):
with transaction.atomic():
response = super(Room, self).save(**kwargs)
if self.value_original != self.room_value:
room_log = RoomLog()
room_log.room = self
room_log.room_value = self.value_original
room_log.save()
return response
class Meta:
ordering = ('room_name',)
def __unicode__(self):
return self.room_name
class RoomLog(models.Model):
room = models.ForeignKey(Room)
room_value = models.PositiveSmallIntegerField(default=0)
update_date = models.DateField(auto_now_add= True)
update_time = models.TimeField(auto_now_add= True)
def __str__(self):
return '%s | %s | %s' % (self.room, self.update_date, self.update_time)
EDIT
answer below from djq about gt & lte pointed me towards a solution. this is how I have solved it in my views:
class AllRoomsView(ListView):
template_name = 'prostats/roomsdetail.html'
queryset = Room.objects.all()
def get_context_data(self, **kwargs):
context = super(AllRoomsView, self).get_context_data(**kwargs)
timenow = tz.now()
timeperiod= timedelta(hours=1)
deltastart = timenow - timeperiod
context['rooms'] = Room.objects.all()
context['rlog'] = RoomLog.objects.all()
context['rfiltered'] = RoomLog.objects.filter(update_time__gt = deltastart)
context['rfilteredcount'] = RoomLog.objects.filter(update_time__gt = deltastart).count()
print timenow
choosestart = '22:04:30.223113'
choosend = '22:54:30.223113'
context['roomfiltertest'] = RoomLog.objects.filter(update_time__gt = choosestart, update_time__lte = choosend)
return context

You should be able to filter your model by time ranges. This might be easier with a DateTimeField. Here's an example of how to filter by time range (assuming DateTimeField, and a slightly different model structure with the fields booking_started and booking_ended)
from datetime import timedelta
from django.utils import timezone as tz
start = tz.now()
time_range = timedelta(hours=2)
end = start - time_range
Room.objects.filter(booking_started__lte=start, booking_ending__gt=end)
lte and gt are shorthands for less than or equal to or greater than

Reducing queries for manytomany models in django

EDIT:
It turns out the real question is - how do I get select_related to follow the m2m relationships I have defined? Those are the ones that are taxing my system. Any ideas?
I have two classes for my django app. The first (Item class) describes an item along with some functions that return information about the item. The second class (Itemlist class) takes a list of these items and then does some processing on them to return different values. The problem I'm having is that returning a list of items from Itemlist is taking a ton of queries, and I'm not sure where they're coming from.
class Item(models.Model):
# for archiving purposes
archive_id = models.IntegerField()
users = models.ManyToManyField(User, through='User_item_rel',
related_name='users_set')
# for many to one relationship (tags)
tag = models.ForeignKey(Tag)
sub_tag = models.CharField(default='',max_length=40)
name = models.CharField(max_length=40)
purch_date = models.DateField(default=datetime.datetime.now())
date_edited = models.DateTimeField(auto_now_add=True)
price = models.DecimalField(max_digits=6, decimal_places=2)
buyer = models.ManyToManyField(User, through='Buyer_item_rel',
related_name='buyers_set')
comments = models.CharField(default='',max_length=400)
house_id = models.IntegerField()
class Meta:
ordering = ['-purch_date']
def shortDisplayBuyers(self):
if len(self.buyer_item_rel_set.all()) != 1:
return "multiple buyers"
else:
return self.buyer_item_rel_set.all()[0].buyer.name
def listBuyers(self):
return self.buyer_item_rel_set.all()
def listUsers(self):
return self.user_item_rel_set.all()
def tag_name(self):
return self.tag
def sub_tag_name(self):
return self.sub_tag
def __unicode__(self):
return self.name
and the second class:
class Item_list:
def __init__(self, list = None, house_id = None, user_id = None,
archive_id = None, houseMode = 0):
self.list = list
self.house_id = house_id
self.uid = int(user_id)
self.archive_id = archive_id
self.gen_balancing_transactions()
self.houseMode = houseMode
def ret_list(self):
return self.list
So after I construct Itemlist with a large list of items, Itemlist.ret_list() takes up to 800 queries for 25 items. What can I do to fix this?

Try using select_related
As per a question I asked here

Dan is right in telling you to use select_related.
select_related can be read about here.
What it does is return in the same query data for the main object in your queryset and the model or fields specified in the select_related clause.
So, instead of a query like:
select * from item
followed by several queries like this every time you access one of the item_list objects:
select * from item_list where item_id = <one of the items for the query above>
the ORM will generate a query like:
select item.*, item_list.*
from item a join item_list b
where item a.id = b.item_id
In other words: it will hit the database once for all the data.

You probably want to use prefetch_related
Works similarly to select_related, but can deal with relations selected_related cannot. The join happens in python, but I've found it to be more efficient for this kind of work than the large # of queries.
Related reading on the subject

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to optimize a batch query to make it faster - django

Related

Django: How to load related model objects when instantiating a model

Optimising number of queries within a DRF ModelSerializer

AND search with reverse relations

Django instances updates and recall values from history

Reducing queries for manytomany models in django

Categories

Resources