How to temporarily disable Django indexes (for SQLite) - django

I'm trying to create a large SQLite database from around 500 smaller databases (each 50-200MB) to put into Django, and would like to speed up this process. I'm doing this via a custom command.
This answer helped me a lot, in reducing the speed to around a minute each in processing a smaller database. However it's still quite a long time.
The one thing I haven't done in that answer is to disable database indexing in Django and re-create them. I think this matters for me as my database has few tables with many rows.
Is there a way to do that in Django when it's running live? If not in Django then perhaps there's some SQLite query to remove all the indexes and re-create them after I insert my records?

I just used raw SQL to remove the indexes and re-create them. This improved the speed of creating a big database from 2 of my small databases from 1:46 to 1:30, so quite significant. It also reduced the size from 341.7MB to 321.1MB.
# Delete all indexes for faster database creation
with connection.cursor() as cursor:
cursor.execute(f'SELECT name, sql FROM sqlite_master WHERE name LIKE "{app_label}_%" AND type == "index"')
indexes = cursor.fetchall()
names, create_sqls = zip(*indexes)
for name in names:
cursor.execute(f'DROP INDEX {name}')
After I create the databases re-create the index:
# Re-create indexes
with connection.cursor() as cursor:
for create_sql in create_sqls:
cursor.execute(create_sql)

Related

Django model count() with caching

I have an Django application with Apache Prometheus monitoring and model called Sample.
I want to monitor Sample.objects.count() metric
and cache this value for concrete time interval
to avoid costly COUNT(*) queries in database.
From this tutorial
https://github.com/prometheus/client_python#custom-collectors
i read that i need to write custom collector.
What is best approach to achieve this?
Is there any way in django to
get Sample.objects.count() cached value and update it after K seconds?
I also use Redis in my application. Should i store this value there?
Should i make separate thread to update Sample.objects.count() cache value?
First thing to note is that you don't really need to cache the result of a count(*) query.
Though different RDBMS handle count operations differently, they are slow across the board for large tables. But one thing they have in common is that there is an alternative to SELECT COUNT(*) provided by the RDBMS which is in fact a cached result. Well sort of.
You haven't mentioned what your RDBMS is so let's see how it is in the popular ones used wtih Django
mysql
Provided you have a primary key on your table and you are using MyISAM. SELECT COUNT() is really fast on mysql and scales well. But chances are that you are using Innodb. And that's the right storage engine for various reasons. Innodb is transaction aware and can't handle COUNT() as well as MyISAM and the query slows down as the table grows.
the count query on a table with 2M records took 0.2317 seconds. The following query took 0.0015 seconds
SELECT table_rows FROM information_schema.tables
WHERE table_name='for_count';
but it reported a value of 1997289 instead of 2 million but close enough!
So you don't need your own caching system.
Sqlite
Sqlite COUNT(*) queries aren't really slow but it doesn't scale either. As the table size grows the speed of the count query slows down. Using a table similar to the one used in mysql, SELECT COUNT(*) FROM for_count required 0.042 seconds to complete.
There isn't a short cut. The sqlite_master table does not provide row counts. Neither does pragma table_info
You need your own system to cache the result of SELECT COUNT(*)
Postgresql
Despite being the most feature rich open source RDBMS, postgresql isn't good at handling count(*), it's slow and doesn't scale very well. In other words, no different from the poor relations!
The count query took 0.194 seconds on postgreql. On the other hand the following query took 0.003 seconds.
SELECT reltuples FROM pg_class WHERE relname = 'for_count'
You don't need your own caching system.
SQL Server
The COUNT query on SQL server took 0.160 seconds on average but it fluctuated rather wildly. For all the databases discussed here the first count(*) query was rather slow but the subsequent queries were faster because the file was cached by the operating system.
I am not an expert on SQL server so before answering this question, I didn't know how to look up the row count using schema info. I found this Q&A helpfull. One of them I tried produced the result in 0.004 seconds
SELECT t.name, s.row_count from sys.tables t
JOIN sys.dm_db_partition_stats s
ON t.object_id = s.object_id
AND t.type_desc = 'USER_TABLE'
AND t.name ='for_count'
AND s.index_id = 1
You dont' need your own caching system.
Integrate into Django
As can be seen, all databases considered except sqlite provide a built in 'Cached query count' There isn't a need for us to create one of our own. It's a simple matter of creating a customer manager to make use of this functionality.
class CustomManager(models.Manager):
def quick_count(self):
from django.db import connection
with connection.cursor() as cursor:
cursor.execute("""SELECT table_rows FROM information_schema.tables
WHERE table_name='for_count'""")
row = cursor.fetchone()
return row[0]
class Sample(models.Model):
....
objects = CustomManager()
The above example is for postgresql, but the same thing can be used for mysql or sql server by simply changing the query into one of those listed above.
Prometheus
How to plug this into django prometheus? I leave that as an exercise.
A custom collector that returns the previous value if it's not too old and fetches otherwise would be the way to go. I'd keep it all in-process.
If you're using MySQL you might want to look at the collectors the mysqld_exporter offers as there's some for table size that should be cheaper.

Django table TRUNCATE based on DB type in view

What is the fastest way to truncate a table in the Django ORM based on the database type in a view? I know you can do this for example
Books.objects.all().delete()
but with tables containing millions of rows it is very slow. I know it is also possible to use the cursor and some custom SQL
from django.db import connection
cursor = connection.cursor()
cursor.execute("TRUNCATE TABLE `books`")
However, the TRUNCATE command does not work with SQLite. And if the database moves to another db type, I need to account for that.
Any ideas? Would it be easier to just drop the table and recreate in my view?
Django's .delete() method is indeed very slow, as it loads the IDs of each object being deleted so that a post_save signal can be emitted.
This means that a simple connection.execute("DELTE FROM foo") will be significantly faster than Foo.objects.delete().
If that's still too slow, a truncate or drop+recreate is definitely the way to go. You can get the SQL used to create a table with: output, references = connection.creation.sql_create_model(model, style), where style = django.core.management.color_style() (this is taken from https://github.com/django/django/blob/master/django/core/management/sql.py#L14).

Django 1.2 PostgreSQL cascading delete for keys with ON DELETE NO ACTION

I have a postgresql database with about 150 tables(it's a Django 1.2 project). Django adds ON DELETE NO ACTION and ON UPDATE NO ACTION to foreign keys at the time of table creation.
Now I need to bulk delete data (about 800,000 records) from a bunch of tables based on certain condition.
Using Model.objects.filter().delete() is not an options because data is huge and it takes a lot of time.
Only sanest options seems a cascading delete, but since Django has add "ON DELETE NO ACTION" it seem like a no option.
So my question: Is there any way to change all foreing keys to ON DELETE CASCADE in an easy way(there are many of them) or something similar.
(I am aware that I can manually write the SQL queries for each table, but that would be a monumental and difficult to maintain task.)
https://docs.djangoproject.com/en/dev/ref/models/fields/#django.db.models.ForeignKey.on_delete
As pointed out in the link which comprises Andrew's answer, if you set this to CASCADE in Django, then Django will go and do the deletes "retail". If it is set to NO ACTION you can create a database-level foreign key definition to handle things. That sounds like a reasonable plan to me.
Be sure you have an index defined on the referencing set of columns for every foreign key; otherwise you're going to see very slow performance. Some database products will automatically create such an index when you define a foreign key, but there are situations where that is not advantageous, so PostgreSQL puts the matter in your hands to optimize as you see fit. (Just as one example, it might not be worth the cost of maintaining the index during normal operations, but be worth building it before a purge and dropping it after.)
One note: ON DELETE CASCADE performs miserably on bulk operations. The reason is that this is done as a trigger. Consequently the way it looks from an algorithmic perspective is:
for row in delete_set:
for dependent row in (scan for referencing rows):
delete dependent row
If you are deleting 800000 rows in a parent table this translates into 800000 separate delete scans on the dependent tables. Even at your best case, with indexes usable 800000 separate index scans will be much slower than one sequential scan.
A better way to do this is to use a writeable common table expression in 9.1 or higher, or to just do separate delete statements in the same transaction. Something like:
WITH rows_to_delete (id) AS (
SELECT id FROM mytable WHERE where_condition
),
deleted_rows (id) AS (
DELETE FROM referencing_table WHERE mytable_id IN (select id FROM rows_to_delete)
RETURNING mytable_id
),
DELETE FROM mytable WHERE id IN (select id FROM deleted_rows);
This Reduces to something like, algorithmically:
scan for rows to delete as delete_set
for dependent in scan for rows dependent to delete:
delete dependent
for to_delete in scan for rows referenced by deleted dependents:
delete to_delete
Getting rid of the forced nested loop scan will greatly speed things up.

Accelerate bulk insert using Django's ORM?

I'm planning to upload a billion records taken from ~750 files (each ~250MB) to a db using django's ORM.
Currently each file takes ~20min to process, and I was wondering if there's any way to accelerate this process.
I've taken the following measures:
Use #transaction.commit_manually and commit once every 5000 records
Set DEBUG=False so that django won't accumulate all the sql commands in memory
The loop that runs over records in a single file is completely contained in a single function (minimize stack changes)
Refrained from hitting the db for queries (used a local hash of objects already in the db instead of using get_or_create)
Set force_insert=True in the save() in hopes it will save django some logic
Explicitly set the id in hopes it will save django some logic
General code minimization and optimization
What else can I do to speed things up? Here are some of my thoughts:
Use some kind of Python compiler or version which is quicker (Psyco?)
Override the ORM and use SQL directly
Use some 3rd party code that might be better (1, 2)
Beg the django community to create a bulk_insert function
Any pointers regarding these items or any other idea would be welcome :)
Django 1.4 provides a bulk_create() method on the QuerySet object, see:
https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.bulk_create
https://docs.djangoproject.com/en/dev/releases/1.4/
https://code.djangoproject.com/ticket/7596
This is not specific to Django ORM, but recently I had to bulk insert >60 Million rows of 8 columns of data from over 2000 files into a sqlite3 database. And I learned that the following three things reduced the insert time from over 48 hours to ~1 hour:
increase the cache size setting of your DB to use more RAM (default ones always very
small, I used 3GB); in sqlite, this is done by PRAGMA cache_size = n_of_pages;
do journalling in RAM instead of disk (this does cause slight
problem if system fails, but something I consider to be negligible
given that you have the source data on disk already); in sqlite this is done by PRAGMA journal_mode = MEMORY
last and perhaps most important one: do not build index while
inserting. This also means to not declare UNIQUE or other constraint that might cause DB to build index. Build index only after you are done inserting.
As someone mentioned previously, you should also use cursor.executemany() (or just the shortcut conn.executemany()). To use it, do:
cursor.executemany('INSERT INTO mytable (field1, field2, field3) VALUES (?, ?, ?)', iterable_data)
The iterable_data could be a list or something alike, or even an open file reader.
Drop to DB-API and use cursor.executemany(). See PEP 249 for details.
I ran some tests on Django 1.10 / Postgresql 9.4 / Pandas 0.19.0 and got the following timings:
Insert 3000 rows individually and get ids from populated objects using Django ORM: 3200ms
Insert 3000 rows with Pandas DataFrame.to_sql() and don't get IDs: 774ms
Insert 3000 rows with Django manager .bulk_create(Model(**df.to_records())) and don't get IDs: 574ms
Insert 3000 rows with to_csv to StringIO buffer and COPY (cur.copy_from()) and don't get IDs: 118ms
Insert 3000 rows with to_csv and COPY and get IDs via simple SELECT WHERE ID > [max ID before insert] (probably not threadsafe unless COPY holds a lock on the table preventing simultaneous inserts?): 201ms
def bulk_to_sql(df, columns, model_cls):
""" Inserting 3000 takes 774ms avg """
engine = ExcelImportProcessor._get_sqlalchemy_engine()
df[columns].to_sql(model_cls._meta.db_table, con=engine, if_exists='append', index=False)
def bulk_via_csv(df, columns, model_cls):
""" Inserting 3000 takes 118ms avg """
engine = ExcelImportProcessor._get_sqlalchemy_engine()
connection = engine.raw_connection()
cursor = connection.cursor()
output = StringIO()
df[columns].to_csv(output, sep='\t', header=False, index=False)
output.seek(0)
contents = output.getvalue()
cur = connection.cursor()
cur.copy_from(output, model_cls._meta.db_table, null="", columns=columns)
connection.commit()
cur.close()
The performance stats were all obtained on a table already containing 3,000 rows running on OS X (i7 SSD 16GB), average of ten runs using timeit.
I get my inserted primary keys back by assigning an import batch id and sorting by primary key, although I'm not 100% certain primary keys will always be assigned in the order the rows are serialized for the COPY command - would appreciate opinions either way.
Update 2020:
I tested the new to_sql(method="multi") functionality in Pandas >= 0.24, which puts all inserts into a single, multi-row insert statement. Surprisingly performance was worse than the single-row version, whether for Pandas versions 0.23, 0.24 or 1.1. Pandas single row inserts were also faster than a multi-row insert statement issued directly to the database. I am using more complex data in a bigger database this time, but to_csv and cursor.copy_from was still around 38% faster than the fastest alternative, which was a single-row df.to_sql, and bulk_import was occasionally comparable, but often slower still (up to double the time, Django 2.2).
There is also a bulk insert snippet at http://djangosnippets.org/snippets/446/.
This gives one insert command multiple value pairs (INSERT INTO x (val1, val2) VALUES (1,2), (3,4) --etc etc). This should greatly improve performance.
It also appears to be heavily documented, which is always a plus.
Also, if you want something quick and simple, you could try this: http://djangosnippets.org/snippets/2362/. It's a simple manager I used on a project.
The other snippet wasn't as simple and was really focused on bulk inserts for relationships. This is just a plain bulk insert and just uses the same INSERT query.
Development django got bulk_create: https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.bulk_create

SQLite - pre allocating database size

Is there a way to pre allocate my SQLite database to a certain size? Currently I'm adding and deleting a number of records and would like to avoid this over head at create time.
The fastest way to do this is with the zero_blob function:
Example:
Y:> sqlite3 large.sqlite
SQLite version 3.7.4
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> create table large (a);
sqlite> insert into large values (zeroblob(1024*1024));
sqlite> drop table large;
sqlite> .q
Y:> dir large.sqlite
Volume in drive Y is Personal
Volume Serial Number is 365D-6110
Directory of Y:\
01/27/2011 12:10 PM 1,054,720 large.sqlite
Note: As Kyle properly indicates in his comment:
There is a limit to how big each blob can be, so you may need to insert multiple blobs if you expect your database to be larger than ~1GB.
There is a hack - Insert a bunch of data into the database till the database size is what you want and then delete the data. This works because:
"When an object (table, index, or
trigger) is dropped from the database,
it leaves behind empty space. This
empty space will be reused the next
time new information is added to the
database. But in the meantime, the
database file might be larger than
strictly necessary."
Naturally, this isn't the most reliable method. (Also, you will need to make sure that auto_vacuum is disabled for this to work). You can learn more here - http://www.sqlite.org/lang_vacuum.html