Are Django's QuerySets lazy enough to cope with large data sets? - django

I think I've read somewhere that Django's ORM lazily loads objects. Let's say I want to update a large set of objects (say 500,000) in a batch-update operation. Would it be possible to simply iterate over a very large QuerySet, loading, updating and saving objects as I go?
Similarly if I wanted to allow a paginated view of all of these thousands of objects, could I use the built in pagination facility or would I manually have to run a window over the data-set with a query each time because of the size of the QuerySet of all objects?

If you evaluate a 500000-result queryset, which is big, it will get cached in memory. Instead, you can use the iterator() method on your queryset, which will return results as requested, without the huge memory consumption.
Also, use update() and F() objects in order to do simple batch-updates in single query.

If the batch update is possible using a SQL query, then i think using sql-queries or django-orm will not make a major difference. But if the update actually requires loading each object, processing the data and then updating them, you can use the orm or write your own sql query and run update queries on each of the processed data, the overheads completely depends on the code logic.
The built-in pagination facility runs a limit,offset query (if you are doing it correct), so i don't think there are major overheads in the pagination either ..

As I benchmarked this for my current project with dataset of 2.5M records in one table.
I was reading information and counting records, for example, I needed to find IDs of records, which field "name" was updated more than once in certain timeframe. Django benchmark was using ORM, to retrieve all records and then to iterate through them. Data was saved in list for future processing. No any debug output, except result print in the end.
On the other end, I was using MySQLdb which was executing same queries (got from Django) and building same structure, using classes for storing data and saving instances in list for future processing. No any debug output, except result print in the end.
I found that:
without Django with Django
execution time x 10x
memory consumption y 25y
And I was only reading and counting, without performing update/insert queries.
Try to investigate this question for yourself, benchmark isn't hard to write and execute.

Related

Django: Improve page load time by executing complex queries automatically over night and saving result in small lookup table

I am building a dashboard-like webapp in Django and my view takes forever to load due to a relatively large database (a single table with 60.000 rows...and growing), the complexity of the queries and quiet a lot of number crunching and data manipulation in python, according to django debug toolbar the page needs 12 seconds to load.
To speed up the page loading time I thought about the following solution:
Build a view that is called automatically every night, completeles all the complex queries, number crunching and data manipulation and saves the results in a small lookup table in the database
Build a second view that is returning the dashbaord but retrieves the data from the small lookup table via a very simple query and hence loads much faster
Since the queries from the first view are executed every night, the data is always up-to-date in the lookup table
My questions: Does my idea make sense, and if yes does anyone have any exerience with such an approach? How can I write a view that gets called automatically every night?
I also read about caching but with caching the first loading of the page after a database update would still take a very long time, and the data in the database gets updated on a regular basis
Yes, it is common practice.
We are pre-calculating some stuff and we are using celery to run those tasks around midnight daily. For some stuff we have special new model, but usually we add database columns to the model, that contains pre-calculated information.
This approach basically has nothing to do with views - you use them normally, just access data differently.

Optimising API queries using JSONField()

Initial opening: I am utilising postgresql JSONFields.
I have the following attribute (field) in my User model:
class User(AbstractUser):
...
benefits = JSONField(default=dict())
...
I essentially currently serialize benefits for each User on the front end with DRF:
benefits = UserBenefit.objects.filter(user=self)
serializer = UserBenefitSerializer(benefits, many=True)
As the underlying returned benefits changes little and slowly, I thought about "caching" the JSON in the database every time there is a change to improve the performance of the UserBenefit.objects.filter(user=user) QuerySet. Instead, becoming user.benefits and hopefully lightening DB load over 100K+ users.
1st Q:
Should I do this?
2nd Q:
Is there an efficient way to write the corresponding serializer.data <class 'rest_framework.utils.serializer_helpers.ReturnList'> to the JSON field?
I am currently using:
data = serializers.serialize("json", UserBenefit.objects.filter(user=self))
For your first question:
It's not a bad idea if you don't want to use caching alternatives.
If you have to query the database because of some changes or ... and you can't cache the hole request, then the idea of saving a JSON object can be a pretty good idea. This way you only retrieve the data and skip most parts of serializing and also terminate the need to query a pivot table to get the m2m data. But also note that this way, you are adding a whole bunch of extra data to your rows and unless you're going to need them most of the time, and you will get extra data that you don't really need which you can help it using values function on querysets but still it requires more coding. Basically, you're going to use more bandwidth for your first query and more storage to store the data instead of process power. Also, the pagination will be really hard to achieve on your benefits if you need it at some point.
Getting m2m relation data is usually pretty fast depending on the amount of data you have on your database but the ultimate way of getting better performance is caching the requests and reducing the database hits as much as possible.
And as you probably hear it a lot, you should test and benchmark to see which options really works for you the best depending on your requirements and limitations. It's really hard to suggest an optimization method without knowing the information about the whole scope and the current solution.
And for your second question:
I think I don't really get it. If you are storing a JSON object which is a field in User model, then why do you need data = serializers.serialize("json", UserBenefit.objects.filter(user=self)) ?
You don't need it since the serializer can just return the JSON field data.

Django: Actions that provide intermediate pages ... with 100k rows

I know how to write Actions that provide intermediate pages, since the docs are great:
https://docs.djangoproject.com/en/2.0/ref/contrib/admin/actions/#actions-that-provide-intermediate-pages
But, if my selection contains 100k rows, the pattern of the docs does not work since the URL gets too long.
How to write Django Admin Actions that provide intermediate pages and can handle +100k rows?
I solved it this way:
Pickle QuerySets
Store pickled QuerySet in the cache under a random ID
forward the random ID to the next page
the next pages use the random ID to read the QuerySet from the cache.
When i need something closer to that i used some grouping variables like: all, active, accepted, denied. By doing this grouping i can do some bulk action on huge large of data without creating a python list with thousands of pks.
Another good point to pay atention is that you need to pass that to the DB, otherwise you will have a enormous bottleneck on the views/models.

Does searching by id depends on number of columns in postgres?

I have the following query: MyModel.objects.filter(id__in=ids).
I noticed that increasing number of columns in table decreases speed of the above query.
Why is that?
Query time in Postgres mostly consists of planing time, execution time and data fetch.
Planing time and execution time should not be affected by a number of columns in the table, but the data fetch phase definitely is as you are returning more data.
Also, an additional step that happens is the mapping of return data into Django QuerySet which takes more time if more columns are involved.
To limit the scope of data returned if applicable, you can always use values, defer, or only.
In some complex data-modeling situations, your models might contain a lot of fields, some of which could contain a lot of data (for example, text fields), or require expensive processing to convert them to Python objects. If you are using the results of a queryset in some situation where you don’t know if you need those particular fields when you initially fetch the data, you can tell Django not to retrieve them from the database.

Does Django bulk_create lock the entire table while inserting the rows?

I'm using Django 1.6 and postgres, would a bulk_create on a specific table lock the entire table? (in my case I'm bulk creating 10,000 rows and it takes ~10 seconds) I've tested this while creating objects every half second while the bulk create was happening and none of those individual creates hung but I'd just like to make sure. Thanks!
bulk_create inserts the provided list of objects into the database in an efficient manner (generally only 1 query, no matter how many objects there are), so it blocks the table to perform atomic transaction.
usage: bulk_create(obj_list, batch_size=None)
The batch_size parameter controls how many objects are created in single query. The default is to create all objects in one batch, except for SQLite where the default is such that at most 999 variables per query are used.
The following article can also give you an idea how fast is bulk_create relativly to other methods.