Django query optimization - django

I have a model with a lot of entries (more than 12,513,262) and they are supposed to increase exponentially. But the problem is due to this large no. entries querying is taking a lot of time is there a way to increase performance using indices etc.
My query is like this:
MyModel.objects.all().order_by('-timestamp')[0:50]
and its taking a lot of time to execute.

do you have an index on the timestamp field?
if you use south (for database migrations, you should definitely look into it if you aren't already), you can just add db_index=True to your field and migrate. Otherwise you can run
./manage.py sqlindexes MyApp
to show the sql statement adding the index. (which you need to run manually, e.g. using
./manage.py dbshell

Related

Best index for a Django model when filtering on one field and ordering on another field

I use Django 2.2 linked to PostgreSQL and would like to optimise my database queries.
Given the following simplified model:
class Person(model.Models):
name = models.CharField()
age = models.Integerfield()
on which I have to do the following query, say,
Person.objects.filter(age__gt=20, age__lt=30).order_by('name')
What would be the best way to define the index in the model Meta field so as to optimise the query?
Which of these four options would be best?
class Meta
indexes = [models.Index(fields=['age','name']),
models.Index(fields=['name','age']),
models.Index(fields=['name']),
models.Index(fields=['age'])]
Is it, for example, possible to prevent sorting when the query is done? Thank you.
This is really a postgres question, as much as a Django question, right?
I think there is a good chance that creating an index on your sort field will help with performance. But there are a lot of caveats and if it's really important to you, you might want to do some testing focused on Postgres (ie, just run some queries in psql and see what happens). Some caveats include:
it might depend on which type of index is created for you by Django
Postgres, of course, does not always use index anyway when running a query but it should if you've got the right one and the right query (and if there is enough data in the table to justify loading the index)
it might matter how your SELECT is formatted by Django
I suggest you create your model and specify that you want the index. Then use Django Debug Toolbar to find out what SELECT query is really getting run. Then, open a dbshell with manage.py dbshell (aka psql) and run ANALYZE with that same select. Assuming you can interpret the output, you will see for yourself whether your index is coming in to play. Paste the ANALYZE output here, if you like.
According to this Postgres documentation ORDER BY can be assisted by a btree index. The b-tree type of index is what Django will create for you by default.
So, why don't you try this:
class Meta:
indexes = [models.Index(fields=['age', 'name'])]
Then go run an EXPLAIN ANALYZE in dbshell and see whether it worked.
# You should apply indexing on age, because you are searching for 'age' column data
indexes = [
models.Index(fields=['age'])
]

Migrate a PositiveIntegerField to a FloatField

I have an existing populated database and would like to convert a PositiveIntegerField into a FloatField. I am considering simply doing a migration:
migrations.AlterField(
model_name='mymodel',
name='field_to_convert',
field=models.FloatField(
blank=True,
help_text='my helpful text',
null=True),
),
Where the field is currently defined as:
field_to_convert = models.PositiveIntegerField(
null=True,
blank=True,
help_text='my helpful text')
Will this require a full rewrite of the database column? How well might this conversion scale for larger databases? How might it scale if the vast majority values were null? In what circumstances would this conversion fail? This is a backed by a Postgres database if that makes a difference.
Will this require a full rewrite of the database column?
No, it won't. I did an experiment with PostgreSQL, MySQL, and SQLite the conversion from integer to float goes well in every case, I also put some values as null to match your situation.
If you have a value 3, it just will change to 3.0.
How might it scale if the vast majority values were null?
Well, since you keep null=True in the configuration of your field all null values will remain null, no problem with that. If you remove null=True you might need to specify a default value.
In what circumstances would this conversion fail?
Taking an int column and converting it to float (real) should not fail, if you find a bizarre, weird and very special case it would be a very big finding.
If you have doubts about the migration outcome...
... you can first take a look into migrations SQL with sqlmigrate, and of course, you could backup your database.
You can use sqlmigrate to check generated sql for your migration.
$ python manage.py sqlmigrate app_label migration_name
Keep in mind, that its output depends on the Django version and the database you have in settings. For the setup I had on hand (Django 1.11, Postgres 9.3) for your migration I got:
BEGIN;
--
-- Alter field field_to_convert on mymodel
--
ALTER TABLE "myapp_mymodel" DROP CONSTRAINT "myapp_mymodel_field_to_convert_check";
ALTER TABLE "myapp_mymodel" ALTER COLUMN "field_to_convert" TYPE double precision USING "field_to_convert"::double precision;
COMMIT;
Which looks good to me both in terms of performance and reliability. I'd say go ahead with the AlterField.
If you want to be extra safe, you can always go: rename field -> create field -> run python -> drop field. This will give you more control over the migration process. Check this answer for details.

How to delete all data for one and only one app in Django

I have a set up (Django 1.11) with several apps including OOK, EEK, and others irrelevant ones. I want to delete all the data for OOK while leaving EEK (and the rest) untouched. Ideally, I want all the primary keys to be reset as well so the first new OOK model will get 1 and so on…
Is this possible?
All I can find is reset and sqlclear which are both deprecated. flush removed all data from the database and thus not what I want
I do release that this is an odd thing to do, but this is the hand given to me…
I think you can achieve this behaviour dropping all the tables of that <app> and then migrating only that <app>. That way you'll reset the tables of the <app>. In django 1.7+ you can do:
$ python manage.py migrate OOK zero //This command unapply your migrations
$ python manage.py migrate OOK
https://docs.djangoproject.com/en/2.0/ref/django-admin/#django-admin-migrate
If you are allowed to replace the db, you could export the data you need to a fixture, then do some clever text processing in the json that is in there, say by finding all ID fields and replacing them from 1. Then reimport the result into a clean db?
The ids are autoincremented by postgresql, according to this answer you can reset the index sequence, but not even sure it can go back to 1.
But really what's the point of resetting the indexes?
This might not be possible with django. However, it is doable with raw SQL:
SET FOREIGN_KEY_CHECKS = 0;
TRUNCATE OOK_table1;
TRUNCATE OOK_table2;
[…]
SET FOREIGN_KEY_CHECKS = 1;
⚠ Do take a backup of your database before doing that!

Django database optimization: Should I filter and count in python or use Django's queryset.count()

I have a few views in Django where I query the database about 50 times. I am wondering if it is faster to query the database or to grab the items once and count myself in python.
For example: On one page, I want to list the following statistics:
- users who live in New York
- users who are admin users
- users who are admin users and have one car
...
...
etc.
Is it faster to do many Django database queries such as the following:
User.objects.filter(state='NY').count()
User.objects.filter(type='Admin').count()
User.objects.filter(type='Admin', car=1).count()
Or would it be faster to grab all the users, loop through them in python just one time, and count each of these things as I go?
Does it matter on the database? Does it matter how many of these count() queries I execute (on some views I have upwards of 30 as my application is very data driven and is just spitting out many statistics)?
Thanks!
Use the Database as much as you can.
Python, or Java, or any other language doesn't matter. Always use the SQL (if it is a RDBS ). Every time you execute an SQL multiple things happen:
Prepare
Execute
Fetch
You want every single step to be minimal. If you do all of these on the DB, the Prepare/Execute may take longer (but usually is less than all the queries combined), but Fetch (which sometimes is the biggest) becomes 1 (not n), and usually has also less records that the others combined.
Then, do execution plan to optimize your prepare and execution steps.
Next link is for java developers, but actually it can apply to python world too.

Accelerate bulk insert using Django's ORM?

I'm planning to upload a billion records taken from ~750 files (each ~250MB) to a db using django's ORM.
Currently each file takes ~20min to process, and I was wondering if there's any way to accelerate this process.
I've taken the following measures:
Use #transaction.commit_manually and commit once every 5000 records
Set DEBUG=False so that django won't accumulate all the sql commands in memory
The loop that runs over records in a single file is completely contained in a single function (minimize stack changes)
Refrained from hitting the db for queries (used a local hash of objects already in the db instead of using get_or_create)
Set force_insert=True in the save() in hopes it will save django some logic
Explicitly set the id in hopes it will save django some logic
General code minimization and optimization
What else can I do to speed things up? Here are some of my thoughts:
Use some kind of Python compiler or version which is quicker (Psyco?)
Override the ORM and use SQL directly
Use some 3rd party code that might be better (1, 2)
Beg the django community to create a bulk_insert function
Any pointers regarding these items or any other idea would be welcome :)
Django 1.4 provides a bulk_create() method on the QuerySet object, see:
https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.bulk_create
https://docs.djangoproject.com/en/dev/releases/1.4/
https://code.djangoproject.com/ticket/7596
This is not specific to Django ORM, but recently I had to bulk insert >60 Million rows of 8 columns of data from over 2000 files into a sqlite3 database. And I learned that the following three things reduced the insert time from over 48 hours to ~1 hour:
increase the cache size setting of your DB to use more RAM (default ones always very
small, I used 3GB); in sqlite, this is done by PRAGMA cache_size = n_of_pages;
do journalling in RAM instead of disk (this does cause slight
problem if system fails, but something I consider to be negligible
given that you have the source data on disk already); in sqlite this is done by PRAGMA journal_mode = MEMORY
last and perhaps most important one: do not build index while
inserting. This also means to not declare UNIQUE or other constraint that might cause DB to build index. Build index only after you are done inserting.
As someone mentioned previously, you should also use cursor.executemany() (or just the shortcut conn.executemany()). To use it, do:
cursor.executemany('INSERT INTO mytable (field1, field2, field3) VALUES (?, ?, ?)', iterable_data)
The iterable_data could be a list or something alike, or even an open file reader.
Drop to DB-API and use cursor.executemany(). See PEP 249 for details.
I ran some tests on Django 1.10 / Postgresql 9.4 / Pandas 0.19.0 and got the following timings:
Insert 3000 rows individually and get ids from populated objects using Django ORM: 3200ms
Insert 3000 rows with Pandas DataFrame.to_sql() and don't get IDs: 774ms
Insert 3000 rows with Django manager .bulk_create(Model(**df.to_records())) and don't get IDs: 574ms
Insert 3000 rows with to_csv to StringIO buffer and COPY (cur.copy_from()) and don't get IDs: 118ms
Insert 3000 rows with to_csv and COPY and get IDs via simple SELECT WHERE ID > [max ID before insert] (probably not threadsafe unless COPY holds a lock on the table preventing simultaneous inserts?): 201ms
def bulk_to_sql(df, columns, model_cls):
""" Inserting 3000 takes 774ms avg """
engine = ExcelImportProcessor._get_sqlalchemy_engine()
df[columns].to_sql(model_cls._meta.db_table, con=engine, if_exists='append', index=False)
def bulk_via_csv(df, columns, model_cls):
""" Inserting 3000 takes 118ms avg """
engine = ExcelImportProcessor._get_sqlalchemy_engine()
connection = engine.raw_connection()
cursor = connection.cursor()
output = StringIO()
df[columns].to_csv(output, sep='\t', header=False, index=False)
output.seek(0)
contents = output.getvalue()
cur = connection.cursor()
cur.copy_from(output, model_cls._meta.db_table, null="", columns=columns)
connection.commit()
cur.close()
The performance stats were all obtained on a table already containing 3,000 rows running on OS X (i7 SSD 16GB), average of ten runs using timeit.
I get my inserted primary keys back by assigning an import batch id and sorting by primary key, although I'm not 100% certain primary keys will always be assigned in the order the rows are serialized for the COPY command - would appreciate opinions either way.
Update 2020:
I tested the new to_sql(method="multi") functionality in Pandas >= 0.24, which puts all inserts into a single, multi-row insert statement. Surprisingly performance was worse than the single-row version, whether for Pandas versions 0.23, 0.24 or 1.1. Pandas single row inserts were also faster than a multi-row insert statement issued directly to the database. I am using more complex data in a bigger database this time, but to_csv and cursor.copy_from was still around 38% faster than the fastest alternative, which was a single-row df.to_sql, and bulk_import was occasionally comparable, but often slower still (up to double the time, Django 2.2).
There is also a bulk insert snippet at http://djangosnippets.org/snippets/446/.
This gives one insert command multiple value pairs (INSERT INTO x (val1, val2) VALUES (1,2), (3,4) --etc etc). This should greatly improve performance.
It also appears to be heavily documented, which is always a plus.
Also, if you want something quick and simple, you could try this: http://djangosnippets.org/snippets/2362/. It's a simple manager I used on a project.
The other snippet wasn't as simple and was really focused on bulk inserts for relationships. This is just a plain bulk insert and just uses the same INSERT query.
Development django got bulk_create: https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.bulk_create