Django: select_related() and memory usage - django

I am working on an API, and I have a question. I was looking into the usage of select_related(), in order to save myself some database queries, and indeed it does help in reducing the amount of database queries performed, on the expense of bigger and more complex queries.
My question is, does using select_related() cause heavier memory usage? Running some experiments I noticed that indeed this is the case, but I'm wondering why. Regardless of whether I use select_related(), the response will contain the exact same data, so why does the use of select_related() cause more memory to be used?
Is it because of caching? Maybe separate data objects are used to cache the same model instances? I don't know what else to think.

It's a tradeoff. It takes time to send a query to the database, the database to prepare results, and then send those results back. select_related works off the principle that the most expensive part of this process is the request and response cycle, not the actual query, so it allows you to combine what would otherwise have been distinct queries into just one so there's only one request and response instead of multiple.
However, if your database server is under-powered (not enough RAM, processing power, etc.), the larger query could actually end up taking longer than the request and response cycle. If that's the case, you probably need to upgrade the server, though, rather than not use select_related.
The rule of thumb is that if you need related data, you use select_related. If it's not actually faster, then that's a sign that you need to optimize your database.
UPDATE (adding more explanation)
Querying a database actually involves multiple steps:
Application generates the query (negligible)
Query is sent to the database server (milliseconds to seconds)
Database processes the query (milliseconds to seconds)
Query results are sent back to application (milliseconds to seconds)
In a well-tuned environment (sufficient server resources, fast connections) the entire process is finished in mere milliseconds. However, steps 2 and 4, still usually take more time overall than step 3. This is why it makes more sense to send fewer more complex queries than multiple simpler queries: the bottleneck is most usually the transport layer not the processing.
However, a poorly optimized database, on an under-powered machine with large and complex tables could take a very long time to run the query, becoming the bottleneck. That would end up negating the decrease in time gained from sending one complex query instead of multiple simpler ones, i.e. the database would have responded quicker to the simpler queries and the entire process would have taken less net time.
Nevertheless, if this is the case, the proper response is to fix the database-side: optimize the database and its configuration, add more server resources, etc., rather than reverting to sending multiple simple queries instead.

Related

Django: Improve page load time by executing complex queries automatically over night and saving result in small lookup table

I am building a dashboard-like webapp in Django and my view takes forever to load due to a relatively large database (a single table with 60.000 rows...and growing), the complexity of the queries and quiet a lot of number crunching and data manipulation in python, according to django debug toolbar the page needs 12 seconds to load.
To speed up the page loading time I thought about the following solution:
Build a view that is called automatically every night, completeles all the complex queries, number crunching and data manipulation and saves the results in a small lookup table in the database
Build a second view that is returning the dashbaord but retrieves the data from the small lookup table via a very simple query and hence loads much faster
Since the queries from the first view are executed every night, the data is always up-to-date in the lookup table
My questions: Does my idea make sense, and if yes does anyone have any exerience with such an approach? How can I write a view that gets called automatically every night?
I also read about caching but with caching the first loading of the page after a database update would still take a very long time, and the data in the database gets updated on a regular basis
Yes, it is common practice.
We are pre-calculating some stuff and we are using celery to run those tasks around midnight daily. For some stuff we have special new model, but usually we add database columns to the model, that contains pre-calculated information.
This approach basically has nothing to do with views - you use them normally, just access data differently.

PostgreSQL on RDS suddenly eating all the storage available on the disc

One of the query is causing my Postgres to freeze and it also results in some weird behaviour such as increased read/write IOPS and db eating all the space on the device. Here's some graphs which demonstrate the same.
Before deleting the query
After deleting the query
Any idea why is this happening?
In my experience this happens when:
There are lots of dead tuples in the tables of the DB.
The query execution is using space in disk (temporary files are generated during query execution and the work_mem is low).
You have lots of orphan files (less common).
If you want to read the official docs:
https://aws.amazon.com/premiumsupport/knowledge-center/diskfull-error-rds-postgresql/
Could be many options:
Firstly, it depends on the size of your database. Could you provide some additional information?
What does your query?
What is the size of your connection pull?
Do you use Streaming replication?
It seems to me that this could be indexing of your table
Try checking indexing of table(s) that is(are) affected by the query. Also, the problem could be a very large database that requires a lot of RAM to be processed.
Don't forget to checkout joins that are included in the query. Badly formed joins could lead to unwanted cross-joins.

Generating efficient fast reports on amounts of data on AWS

I'm really confused about how or what AWS services to use for my case.
I have a web application which stores user interaction events. Currently these events are stored on a RDS table. Each event contains about 6 fields like timestamp, event type, userID, pageID, etc etc. Currently I have millions of event records on each account schema. When I try to generate reports out of this raw data - the reports are extremely slow since I do complex aggregation queries over long time period. a report of a time period of 30 days might take 4 minutes to generate on RDS.
Is there any way to make these reports running MUCH faster? I was thinking about storing the events on DynamoDB, but I cannot run such complex queries on the data, and to do any attribute based sorting.
Is there a good service combination to achieve this? Maybe using RedShift, EMP, Kinesis?
I think Redshift is your solution.
I'm working with a dataset that generates about 2.000.000 new rows each day and I made really complex operations on it. You could take advance of Redshift sort keys, and order your data by date.
Also if you do complex aggregate functions I really recommend to denormalize all the information and insert it in only one table with all the data. Redshift uses a very efficient, and automatic, column compression you won't have problems with the size of the dataset.
My usual solution to problems like this is to have a set of routines that rollup and store the aggregated results, to various levels in additional RDS tables. This transactional information you are storing isn't likely to change once logged, so, for example, if you find yourself running daily/weekly/monthly rollups of various slices of data, run the query and store those results, not necessarily at the final level that you will need, but at a level that significantly reduces the # of rows that goes into those eventual rollups. For example, have a daily table that summarizes eventtype, userid and pageId one row per day, instead of one row per event (or one row per hour instead of day) - you'll need to figure out the most logical rollups to make, but you get the idea - the goal is to pre-summarize at the levels that will reduce the amount of raw data, but still gives you plenty of flexibility to serve your reports.
You can always go back to the granular/transactional data as long as you keep it around, but there is not much to be gained by constantly calculating the same results every time you want to use the data.

Design pattern for caching dynamic user content (in django)

On my website I'm going to provide points for some activities, similarly to stackoverflow. I would like to calculate value basing on many factors so each computation for each user will take for instance 10 SQL queries.
I was thinking about caching it:
in memcache,
in user's row in database (so that wherever I need to get user from base I easly show the points)
Storing in database seems easy but on other hand it's redundant information and I decided to ask, since maybe there is easier and prettier solution which I missed.
I'd highly recommend this app for storing the calculated values in the model: https://github.com/initcrash/django-denorm
Memcache is faster than the db... but if you already have to retrieve the record from the db anyway, having the calculated values cached in the rows you're retrieving (as a 'denormalised' field) is even faster, plus it's persistent.

Are Django's QuerySets lazy enough to cope with large data sets?

I think I've read somewhere that Django's ORM lazily loads objects. Let's say I want to update a large set of objects (say 500,000) in a batch-update operation. Would it be possible to simply iterate over a very large QuerySet, loading, updating and saving objects as I go?
Similarly if I wanted to allow a paginated view of all of these thousands of objects, could I use the built in pagination facility or would I manually have to run a window over the data-set with a query each time because of the size of the QuerySet of all objects?
If you evaluate a 500000-result queryset, which is big, it will get cached in memory. Instead, you can use the iterator() method on your queryset, which will return results as requested, without the huge memory consumption.
Also, use update() and F() objects in order to do simple batch-updates in single query.
If the batch update is possible using a SQL query, then i think using sql-queries or django-orm will not make a major difference. But if the update actually requires loading each object, processing the data and then updating them, you can use the orm or write your own sql query and run update queries on each of the processed data, the overheads completely depends on the code logic.
The built-in pagination facility runs a limit,offset query (if you are doing it correct), so i don't think there are major overheads in the pagination either ..
As I benchmarked this for my current project with dataset of 2.5M records in one table.
I was reading information and counting records, for example, I needed to find IDs of records, which field "name" was updated more than once in certain timeframe. Django benchmark was using ORM, to retrieve all records and then to iterate through them. Data was saved in list for future processing. No any debug output, except result print in the end.
On the other end, I was using MySQLdb which was executing same queries (got from Django) and building same structure, using classes for storing data and saving instances in list for future processing. No any debug output, except result print in the end.
I found that:
without Django with Django
execution time x 10x
memory consumption y 25y
And I was only reading and counting, without performing update/insert queries.
Try to investigate this question for yourself, benchmark isn't hard to write and execute.