Does Django bulk_create lock the entire table while inserting the rows? - django

I'm using Django 1.6 and postgres, would a bulk_create on a specific table lock the entire table? (in my case I'm bulk creating 10,000 rows and it takes ~10 seconds) I've tested this while creating objects every half second while the bulk create was happening and none of those individual creates hung but I'd just like to make sure. Thanks!

bulk_create inserts the provided list of objects into the database in an efficient manner (generally only 1 query, no matter how many objects there are), so it blocks the table to perform atomic transaction.
usage: bulk_create(obj_list, batch_size=None)
The batch_size parameter controls how many objects are created in single query. The default is to create all objects in one batch, except for SQLite where the default is such that at most 999 variables per query are used.
The following article can also give you an idea how fast is bulk_create relativly to other methods.

Related

Efficiency using triggers inside attached database with SQLite

Situation
I'm using multiple storage databases as attachments to one central "manager" DB.
The storage tables share one pseudo-AUTOINCREMENT index across all storage databases.
I need to iterate over the shared index frequently.
The final number and names of storage tables are not known on storage DB creation.
On some signal, a then-given range of entries will be deleted.
It is vital that no insertion fails and no entry gets deleted before its signal.
Energy outage is possible, data loss in this case is hardly, if ever, tolerable. Any solutions that may cause this (in-memory databases etc) are not viable.
Database access is currently controlled using strands. This takes care of sequential access.
Due to the high frequency of INSERT transactions, I must trigger WAL checkpoints manually. I've seen journals of up to 2GB in size otherwise.
Current solution
I'm inserting datasets using parameter binding to a precreated statement.
INSERT INTO datatable VALUES (:idx, ...);
Doing that, I remember the start and end index. Next, I bind it to an insert statement into the registry table:
INSERT INTO regtable VALUES (:idx, datatable);
My query determines the datasets to return like this:
SELECT MIN(rowid), MAX(rowid), tablename
FROM (SELECT rowid,tablename FROM entryreg LIMIT 30000)
GROUP BY tablename;
After that, I query
SELECT * FROM datatable WHERE rowid >= :minid AND rowid <= :maxid;
where I use predefined statements for each datatable and bind both variables to the first query's results.
This is too slow. As soon as I create the registry table, my insertions slow down so much I can't meet benchmark speed.
Possible Solutions
There are several other ways I can imagine it can be done:
Create a view of all indices as a UNION or OUTER JOIN of all table indices. This can't be done persistently on attached databases.
Create triggers for INSERT/REMOVE on table creation that fill a registry table. This can't be done persistently on attached databases.
Create a trigger for CREATE TABLE on database creation that will create the triggers described above. Requires user functions.
Questions
Now, before I go and add user functions (something I've never done before), I'd like some advice if this has any chances of solving my performance issues.
Assuming I create the databases using a separate connection before attaching them. Can I create views and/or triggers on the database (as main schema) that will work later when I connect to the database via ATTACH?
From what it looks like, a trigger AFTER INSERT will fire after every single line of insert. If it inserts stuff into another table, does that mean I'm increasing my number of transactions from 2 to 1+N? Or is there a mechanism that speeds up triggered interaction? The first case would slow down things horribly.
Is there any chance that a FULL OUTER JOIN (I know that I need to create it from other JOIN commands) is faster than filling a registry with insertion transactions every time? We're talking roughly ten transactions per second with an average of 1000 elements (insert) vs. one query of 30000 every two seconds (query).
Open the sqlite3 databases in multi-threading mode, handle the insert/update/query/delete functions by separate threads. I prefer to transfer query result to a stl container for processing.

Django Postgres ArrayField vs One-to-Many relationship

For a model in my database I need to store around 300 values for a specific field. What would be the drawbacks, in terms of performance and simplicity in query, if I use Postgres-specific ArrayField instead of a separate table with One-to-Many relationship?
If you use an array field
The size of each row in your DB is going to be a bit large thus Postgres is going to be using a lot more toast tables (http://www.postgresql.org/docs/9.5/static/storage-toast.html)
Every time you get the row, unless you specifically use defer (https://docs.djangoproject.com/en/1.9/ref/models/querysets/#defer) the field or otherwise exclude it from the query via only, or values or something, you paying the cost of loading all those values every time you iterate across that row. If that's what you need then so be it.
Filtering based on values in that array, while possible isn't going to be as nice and the Django ORM doesn't make it as obvious as it does for M2M tables.
If you use M2M
You can filter more easily on those related values
Those fields are postponed by default, you can use prefetch_related if you need them and then get fancy if you want only a subset of those values loaded
Total storage in the DB is going to be slightly higher with M2M because of keys, and extra id fields
The cost of the joins in this case is completely negligible because of keys.
Personally I'd say go with the M2M tables, but I don't know your specific application. If you're going to be working with a massive amount of data it's likely worth grabbing a representative dataset and testing both methods with it.

is it right to call News.objects.all()[:x] to fetch fixed number of records

what i am asking:
if i need to extract top 10 newest News from db. I found only way
News.objects.all().order_by('-pub_date')[:10]
but is it safe? How works this construction? Will it fetch all News records from database and then will order them and then will take only 10 pieces of them? Or it will optimize query and will take only 10 newest records from database? It is significant since I do have over 1000 News records in database and it will take long time to get them from database and even longer to sort them.
This is safe as QuerySets as lazy. At most ten objects will be fetched in your case as the database query will be optimized to return only ten records from the database.
You can read more about when QuerySets are evaluated and limiting QuerySets (this section deals with slicing QuerySets which is what you are doing).

Accelerate bulk insert using Django's ORM?

I'm planning to upload a billion records taken from ~750 files (each ~250MB) to a db using django's ORM.
Currently each file takes ~20min to process, and I was wondering if there's any way to accelerate this process.
I've taken the following measures:
Use #transaction.commit_manually and commit once every 5000 records
Set DEBUG=False so that django won't accumulate all the sql commands in memory
The loop that runs over records in a single file is completely contained in a single function (minimize stack changes)
Refrained from hitting the db for queries (used a local hash of objects already in the db instead of using get_or_create)
Set force_insert=True in the save() in hopes it will save django some logic
Explicitly set the id in hopes it will save django some logic
General code minimization and optimization
What else can I do to speed things up? Here are some of my thoughts:
Use some kind of Python compiler or version which is quicker (Psyco?)
Override the ORM and use SQL directly
Use some 3rd party code that might be better (1, 2)
Beg the django community to create a bulk_insert function
Any pointers regarding these items or any other idea would be welcome :)
Django 1.4 provides a bulk_create() method on the QuerySet object, see:
https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.bulk_create
https://docs.djangoproject.com/en/dev/releases/1.4/
https://code.djangoproject.com/ticket/7596
This is not specific to Django ORM, but recently I had to bulk insert >60 Million rows of 8 columns of data from over 2000 files into a sqlite3 database. And I learned that the following three things reduced the insert time from over 48 hours to ~1 hour:
increase the cache size setting of your DB to use more RAM (default ones always very
small, I used 3GB); in sqlite, this is done by PRAGMA cache_size = n_of_pages;
do journalling in RAM instead of disk (this does cause slight
problem if system fails, but something I consider to be negligible
given that you have the source data on disk already); in sqlite this is done by PRAGMA journal_mode = MEMORY
last and perhaps most important one: do not build index while
inserting. This also means to not declare UNIQUE or other constraint that might cause DB to build index. Build index only after you are done inserting.
As someone mentioned previously, you should also use cursor.executemany() (or just the shortcut conn.executemany()). To use it, do:
cursor.executemany('INSERT INTO mytable (field1, field2, field3) VALUES (?, ?, ?)', iterable_data)
The iterable_data could be a list or something alike, or even an open file reader.
Drop to DB-API and use cursor.executemany(). See PEP 249 for details.
I ran some tests on Django 1.10 / Postgresql 9.4 / Pandas 0.19.0 and got the following timings:
Insert 3000 rows individually and get ids from populated objects using Django ORM: 3200ms
Insert 3000 rows with Pandas DataFrame.to_sql() and don't get IDs: 774ms
Insert 3000 rows with Django manager .bulk_create(Model(**df.to_records())) and don't get IDs: 574ms
Insert 3000 rows with to_csv to StringIO buffer and COPY (cur.copy_from()) and don't get IDs: 118ms
Insert 3000 rows with to_csv and COPY and get IDs via simple SELECT WHERE ID > [max ID before insert] (probably not threadsafe unless COPY holds a lock on the table preventing simultaneous inserts?): 201ms
def bulk_to_sql(df, columns, model_cls):
""" Inserting 3000 takes 774ms avg """
engine = ExcelImportProcessor._get_sqlalchemy_engine()
df[columns].to_sql(model_cls._meta.db_table, con=engine, if_exists='append', index=False)
def bulk_via_csv(df, columns, model_cls):
""" Inserting 3000 takes 118ms avg """
engine = ExcelImportProcessor._get_sqlalchemy_engine()
connection = engine.raw_connection()
cursor = connection.cursor()
output = StringIO()
df[columns].to_csv(output, sep='\t', header=False, index=False)
output.seek(0)
contents = output.getvalue()
cur = connection.cursor()
cur.copy_from(output, model_cls._meta.db_table, null="", columns=columns)
connection.commit()
cur.close()
The performance stats were all obtained on a table already containing 3,000 rows running on OS X (i7 SSD 16GB), average of ten runs using timeit.
I get my inserted primary keys back by assigning an import batch id and sorting by primary key, although I'm not 100% certain primary keys will always be assigned in the order the rows are serialized for the COPY command - would appreciate opinions either way.
Update 2020:
I tested the new to_sql(method="multi") functionality in Pandas >= 0.24, which puts all inserts into a single, multi-row insert statement. Surprisingly performance was worse than the single-row version, whether for Pandas versions 0.23, 0.24 or 1.1. Pandas single row inserts were also faster than a multi-row insert statement issued directly to the database. I am using more complex data in a bigger database this time, but to_csv and cursor.copy_from was still around 38% faster than the fastest alternative, which was a single-row df.to_sql, and bulk_import was occasionally comparable, but often slower still (up to double the time, Django 2.2).
There is also a bulk insert snippet at http://djangosnippets.org/snippets/446/.
This gives one insert command multiple value pairs (INSERT INTO x (val1, val2) VALUES (1,2), (3,4) --etc etc). This should greatly improve performance.
It also appears to be heavily documented, which is always a plus.
Also, if you want something quick and simple, you could try this: http://djangosnippets.org/snippets/2362/. It's a simple manager I used on a project.
The other snippet wasn't as simple and was really focused on bulk inserts for relationships. This is just a plain bulk insert and just uses the same INSERT query.
Development django got bulk_create: https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.bulk_create

Are Django's QuerySets lazy enough to cope with large data sets?

I think I've read somewhere that Django's ORM lazily loads objects. Let's say I want to update a large set of objects (say 500,000) in a batch-update operation. Would it be possible to simply iterate over a very large QuerySet, loading, updating and saving objects as I go?
Similarly if I wanted to allow a paginated view of all of these thousands of objects, could I use the built in pagination facility or would I manually have to run a window over the data-set with a query each time because of the size of the QuerySet of all objects?
If you evaluate a 500000-result queryset, which is big, it will get cached in memory. Instead, you can use the iterator() method on your queryset, which will return results as requested, without the huge memory consumption.
Also, use update() and F() objects in order to do simple batch-updates in single query.
If the batch update is possible using a SQL query, then i think using sql-queries or django-orm will not make a major difference. But if the update actually requires loading each object, processing the data and then updating them, you can use the orm or write your own sql query and run update queries on each of the processed data, the overheads completely depends on the code logic.
The built-in pagination facility runs a limit,offset query (if you are doing it correct), so i don't think there are major overheads in the pagination either ..
As I benchmarked this for my current project with dataset of 2.5M records in one table.
I was reading information and counting records, for example, I needed to find IDs of records, which field "name" was updated more than once in certain timeframe. Django benchmark was using ORM, to retrieve all records and then to iterate through them. Data was saved in list for future processing. No any debug output, except result print in the end.
On the other end, I was using MySQLdb which was executing same queries (got from Django) and building same structure, using classes for storing data and saving instances in list for future processing. No any debug output, except result print in the end.
I found that:
without Django with Django
execution time x 10x
memory consumption y 25y
And I was only reading and counting, without performing update/insert queries.
Try to investigate this question for yourself, benchmark isn't hard to write and execute.