I'm using Django 1.11 with MySQL. Upgrading to 2 isn't feasible in the short term so isn't an acceptable solution to my immediate problem, but answers referring to Django 2 may help others so feel free to post them.
I need to perform a data migration on all rows in a table. There are less than 40000 rows but they are quite big - two of the columns are ~15KB of JSON which get parsed when the model is loaded. (These are the rows I need to use in the data migration so I cannot defer them)
So as not to load all the objects into memory simultaneously, I thought I'd use queryset.iterator which only parses rows 100 at time. This works fine if all I do is read the results, but if I perform another query (eg to save one of the objects) then once I reach the end of the current chunk of 100 results, the next chunk of 100 results are not fetched and the iterator finishes.
It's as if the result set that fetchmany fetches the rows from has been lost.
To illustrate the scenario using ./manage.py shell
(Assume there exist 40000 MyModel with sequential ids)
iterator = app.models.MyModel.objects.iterator()
for obj in iterator:
print(obj.id)
The above prints the ids 1 to 40000 as expected.
iterator = app.models.MyModel.objects.iterator()
for obj in iterator:
print(obj.id)
obj.save()
The above only prints the ids 1 to 100
iterator = app.models.MyModel.objects.iterator()
for obj in iterator:
print(obj.id)
if obj.id == 101:
obj.save()
The above only prints the ids 1 to 200
Replacing obj.save with anything else that makes a query to the DB (eg app.models.OtherModel.objects.first()) has the same result.
Is it simply not possible to make another query while using queryset iterator? Is there another way to achieve the same thing?
Thanks
As suggested by #dirkgroten, Paginator is an alternative to iterator that's potentially a better solution in terms of memory usage as it uses slicing on the queryset which adds OFFSET and LIMIT clauses to retrieve only part of the full result set.
However, high OFFSET values incur a performance penalty on MySQL: https://www.eversql.com/faster-pagination-in-mysql-why-order-by-with-limit-and-offset-is-slow/
Therefore seeking on an indexed column may be a better option:
chunk_size = 100
seek_id = 0
next_seek_id = -1
while seek_id != next_seek_id:
seek_id = next_seek_id
for obj in app.models.MyModel.objects.filter(id__gt=seek_id)[:chunk_size]:
next_seek_id = obj.id
# do your thing
Additionally, if your data is such that performing the query isn't expensive but instantiating model instances is, iterator has the potential advantage of doing a single database query. Hopefully other answers will be able to shed light on the use of queryset.iterator with other queries.
Related
Can anyone help, I want to return an ordered list based on forloop in Django using a field in the model that contains both integer and string in the format MM/1234. The loop should return the values with the least interger(1234) in ascending order in the html template.
Ideally you want to change the model to have two fields, one integer and one string, so you can code a queryset with ordering based on the integer one. You can then define a property of the model to return the self.MM+"/"+str( self.nn) composite value if you often need to use that. But if it's somebody else's database schema, this may not be an option.
In which case you'll have to convert your queryset into a list (which reads all the data rows at once) and then sort the list in Python rather than in the database. You can run out of memory or bring your server to its knees if the list contains millions of objects. count=qs.count() is a DB operation that won't.
qs = Foo.objects.filter( your_selection_criteria)
# you might want to count this before the next step, and chicken out if too many
# also this simple key function will crash if there's ever no "/" in that_field
all_obj = sorted( list( qs),
key = lambda obj: obj.that_field.split('/')[1] )
Which one would be better for performance?
We take a slice of products. which make us impossible to bulk update.
products = Product.objects.filter(featured=True).order_by("-modified_on")[3:]
for product in products:
product.featured = False
product.save()
or (invalid)
for product in products.iterator():
product.update(featured=False)
I have tried QuerySet's in statement too as following.
Product.objects.filter(pk__in=products).update(featured=False)
This line works fine on SQLite. But, it rises following exception on MySQL. So, I couldn't use that.
DatabaseError: (1235, "This version of MySQL doesn't yet support
'LIMIT & IN/ALL/ANY/SOME subquery'")
Edit: Also iterator() method causes re-evaluate the query. So, it is bad for performance.
As #Chris Pratt pointed out in comments, the second example is invalid because the objects don't have update methods. Your first example will require queries equal to results+1 since it has to update each object. That might really be costly if you have 1000 products. Ideally you do want to reduce this to a more fixed expense if possible.
This is a similar situation to another question:
Django: Cannot update a query once a slice has been taken
That being said, you would have to do it in at least 2 queries, but you have to be a bit sneaky on how to construct the LIMIT...
Using Q objects for complex queries:
# get the IDs we want to exclude
products = Product.objects.filter(featured=True).order_by("-modified_on")[:3]
# flatten them into just a list of ids
ids = products.values_list('id', flat=True)
# Now use the Q object to construct a complex query
from django.db.models import Q
# This builds a list of "AND id NOT EQUAL TO i"
limits = [~Q(id=i) for i in ids]
Product.objects.filter(featured=True, *limits).update(featured=False)
In some cases it's acceptable to cache QuerySet in array
products = list(products)
Product.objects.filter(pk__in=products).update(featured=False)
Small optimization with values_list
products_id = list(products.values_list('id', flat=True)
Product.objects.filter(pk__in=products_id).update(featured=False)
When I do something like
I. objects = Model.objects.all()
and then
II. objects.filter(field_1=some_condition)
I hit db every time when on the step 2 with various conditions. Is there any way to get all data in first action and then just take care of the result?
You actually don't hit the db until you evaluate the qs, queries are lazy.
Read more here.
edit:
After re-reading your question it becomes apparent you were asking how to prevent db hits when filtering for different conditions.
qs = SomeModel.objects.all()
qs1 = qs.filter(some_field='some_value')
qs2 = qs.filter(some_field='some_other_value')
Usually you would want the database to do the filtering for you.
You could force an evaluation of the qs by converting it to a list. This would prevent further db hits, however it would likely be worse than having your db return you results.
qs_l = list(qs)
qs1_l = [element for element in qs_l if element.some_field='some_value']
qs2_l = [element for element in qs_l if element.some_field='some_other_value']
Of course you will hit db every time. filter() transforms to SQL statement which is executed by your db, you can't filter without hitting it. So you can retrieve all the objects you need with values() or list(Model.objects.all()) and, as zeekay suggested, use Python expressions (like list comprehensions) for additional filtering.
Why don't you just do objs = Model.objects.filter(field=condition)? That said, once the SQL query is executed you can use Python expressions to do further filtering/processing without incurring additional database hits.
Is it possible to check how many rows were deleted by a query?
queryset = MyModel.object.filter(foo=bar)
queryset.delete()
deleted = ...
Or should I use transactions for that?
#transaction.commit_on_success
def delete_some_rows():
queryset = MyModel.object.filter(foo=bar)
deleted = queryset.count()
queryset.delete()
PHP + MySQL example:
mysql_query('DELETE FROM mytable WHERE id < 10');
printf("Records deleted: %d\n", mysql_affected_rows());
There are many situations where you want to know how many rows were deleted, for example if you do something based on how many rows were deleted. Checking it by performing a COUNT creates extra database load and is not atomic.
The queryset.delete() method immediately deletes the object and returns the number of objects deleted and a dictionary with the number of deletions per object type.
Check the docs for more details: https://docs.djangoproject.com/en/stable/topics/db/queries/#deleting-objects
Actual rows affected you could view with SELECT row_count().
First of all qs.count() and cursor.rowcount is not same things!
In MySQL with InnoDB with REPEATABLE READ (default mode) READ queries and WRITE queries view !!DIFFERENT!!! querysets!
READ queries read from old snapshot, while WRITE queries view actual committed data, like they works in READ COMMITTED mode.
How would one go about retrieving the last 1,000 values from a database via a Objects.filter? The one I am currently doing is bringing me the first 1,000 values to be entered into the database (i.e. 10,000 rows and it's bringing me the 1-1000, instead of 9000-1,000).
Current Code:
limit = 1000
Shop.objects.filter(ID = someArray[ID])[:limit]
Cheers
Solution:
queryset = Shop.objects.filter(id=someArray[id])
limit = 1000
count = queryset.count()
endoflist = queryset.order_by('timestamp')[count-limit:]
endoflist is the queryset you want.
Efficiency:
The following is from the django docs about the reverse() queryset method.
To retrieve the ''last'' five items in
a queryset, you could do this:
my_queryset.reverse()[:5]
Note that this is not quite the same
as slicing from the end of a sequence
in Python. The above example will
return the last item first, then the
penultimate item and so on. If we had
a Python sequence and looked at
seq[-5:], we would see the fifth-last
item first. Django doesn't support
that mode of access (slicing from the
end), because it's not possible to do
it efficiently in SQL.
So I'm not sure if my answer is merely inefficient, or extremely inefficient. I moved the order_by to the final query, but I'm not sure if this makes a difference.
reversed(Shop.objects.filter(id=someArray[id]).reverse()[:limit])