Do Django transactions make my non-db operations atomic? - django

I have a function that writes file to disk. Using a concurrent server, it is possible (likely even) that this function could be called by two threads, concurrently. Looking at the source code, it seems that wrapping my function up in django.db.transaction will keep both my db operations and my non-db operations atomic. Is this correct?
UPDATE: What I would really like is not just a yes or no answer but a link to an explanation or a comment on what exactly the thread stuff going on in enter_transaction_management in django.db.transaction.py is doing.

By "Django transaction", I assume that you mean the transactions in django.db.transactions?
And, if that's the case - no. They only pertain to database transactions (ie, they will only issue a BEGIN then a COMMIT or ROLLBACK).

NO it won't. Transactions are specific to the database, and are handled much differently than IPC locking.
You should add a process identifier to the file you are writing to to make sure it is unique. Otherwise lock the file to make sure you are the only one writing to it.

Related

Multiple db inserts with Django performance is not increased by parallel threads

I'm doing thousands and thousands of inserts to a PostgreSQL database with Python and Django (using the CLI, so no web server at all).
The objects that are inserted are already in memory, and I'm poping them one by one from a FIFO queue (using Python's native https://docs.python.org/2/library/queue.html)
What I'm doing basically is:
args1, args2 = queue.get()
m1, _ = Model1.objects.get_or_create(args1)
Model2.objects.create(m1, args2)
I was thinking a way to do this faster was too spawn a few more threads that can do this in parallel. To my surprise the performance is actually slightly decreased... I was expecting almost linear improvement in relation to the number of threads.. not sure what's going on..
Is there something database specific I'm missing, are there table locks that are blocking the threads when this is running?
Or does it have something to do with that each thread can only access a single database connection atomically during runtime?
I have standard configuration for PostgreSQL (9.3) and Django (1.7.7) installed with apt-get on Debian Jessie.
Also I tried with 4 threads, which is the same number of CPUs I have available on my box.
There are a few things going on here.
Firstly you are using very high level ORM methods (get_or_create, create). Those are generally not a good fit for bulk operations since methods like that tend to have a lot of overhead to provide a nice API and also do additional work to prevent users from shooting themselves in the foot too easily.
Secondly your careful use of a queue is very counterproductive in multiple ways:
Due to django running in autocommit mode by default each database operation is carried out in its own transaction. Since that is a relatively expensive operation this also causes unnecessary overhead.
Inserting each object by itself also causes a lot more back and forth communication between the database and django, which again produces overhead, slowing things down.
Thirdly the reason using multiple threads is even slower stems from the fact that python has a GIL (Global Interpreter Lock). This prevents multiple threads from executing Python code at the same time. There is a lot of material on the web about the whys and hows of the GIL and what can be done in which circumstances to mitigate it. There is a nice summary by Dave Beazly about the GIL that should get you started if you're interested in learning more about it.
Additionally I'd generally recommend against doing large inserts from multiple threads in any language since - depending on your database and data model - this can also cause slowdowns inside the database due to possibly required locking.
Now there are many solutions to your problem but I'd recommend to start with a simple one:
Django actually provides a handy low-level interface to create models in bulk, fittingly enough called bulk_create(). I'd suggest removing all that fancy queue and thread code and using this interface as directly as possible with the data you already have.
In case this isn't sufficient for your case a possible alternative would be to generate an INSERT INTO statement from the data and executing that directly on the database.
If all you want to achieve is simply insertion, could you instead just use the save() method instead of get_or_create(). get_or_create() queries the database first. If the table is large, the call to get_or_create() can be a bottleneck. And that's probably why having multiple parallel threads do not help.
The other possibility is with the insertion itself. Postgres by default enables auto-commit on a per insert (transaction) basis. The committing process involves complex mechanisms under the hood. Long story short, you may try disabling auto-commit and see if that would help in your particular case. A relevant article is here.

Application of Shared Read Locks

what is the need for a read shared lock?
I can understand that write locks have to be exclusive only. But what is the need for many clients to access the document simultaneously and still share only read privilege? Practical applications of Shared read locks would be of great help too.
Please move the question to any other forum you'd find it appropriate to be in.
Though this is a question purely related to ABAP programming and theory I'm doing, I'm guessing the applications are generic to all languages.
Thanks!
If you do complex and time-consuming calculations based on multiple datasets (e. g. postings), you have to ensure that none of these datasets is changed while you're working - otherwise the calculations might be wrong. Most of the time, the ACID principles will ensure this, but sometimes, that's not enough - for example if the datasource is so large that you have to break it up into parallel subtasks or if you have to call some function that performs a database commit or rollback internally. In this case, the transaction isolation is no longer enough, and you need to lock the entity on a logical level.

Testing concurrent data structure

What are some methods for testing concurrent data structures to make sure the data structs behave correctly when accessed from multiple threads ?
All of the other answers have focused on actually testing the code by putting it through its paces and actually running it in one form or another or politely saying "don't do it yourself, use an existing library".
This is great and all, but IMO, the most important (practical tests are important too) test is to look at the code line by line and for every line of code ask "what happens if I get interrupted by another thread here?" Imagine another thread, running just about any of the other lines/functions during this interruption. Do things still stay consistent? When competing for resources, does the other thread[s] block or spin?
This is what we did in school when learning about concurrency and it is a surprisingly effective approach. Bottom line, I feel that taking the time to prove to yourself that things are consistent and work as expected in all states is the first technique you should use when dealing with this stuff.
Concurrent systems are probabilistic and errors are often difficult to replicate. Therefore you need to run various input/output cases, each tested over time (hours, days, etc) in order to detect possible errors.
Tests for concurrent data structure involves examining the container's state before and after expected events such as insert and delete.
Use a pre-existing, pre-tested library that meets your needs if possible.
Make sure that the code has appropriate self-consistency checks (preferably fast sanity checks), and run your code on as many different types of hardware as possible to help narrow down interesting timing problems.
Have multiple people peer review the code, preferably without a pre-explanation of how it's supposed to work. That way they have to grok the code which should help catch more bugs.
Set up a bunch of threads that do nothing but random operations on the data structures and check for consistency at some rate.
Start with the assumption that your calls to access/modify data are not thread safe and use locks to ensure only a single thread can access/modify any part of the data at a time. Only after you can prove to yourself that a specific type of access is safe outside of the lock by multiple threads at once should you move that code outside of the lock.
Assume worst case scenarios, e.g. that your code will stop right in the middle of some pointer manipulation or another critical point, and that another thread will encounter that data in mid-transition. If that would have a bad result, leave it within the lock.
I normally test these kinds of things by interjecting sleep() calls at appropriate places in the distributed threads/processes.
For instance, to test a lock, put sleep(2) in all your threads at the point of contention, and spawn two threads roughly 1 second apart. The first one should obtain the lock, and the second should have to wait for it.
Most race conditions can be tested by extending this method, but if your system has too many components it may be difficult or impossible to know every possible condition that needs to be tested.
Run your concurrent threads for one or a few days and look what happens. (Sounds strange, but finding out race conditions is such a complex topic that simply trying it is the best approach).

Is checking current thread inside a function ok?

Is it ok to check the current thread inside a function?
For example if some non-thread safe data structure is only altered by one thread, and there is a function which is called by multiple threads, it would be useful to have separate code paths depending on the current thread. If the current thread is the one that alters the data structure, it is ok to alter the data structure directly in the function. However, if the current thread is some other thread, the actual altering would have to be delayed, so that it is performed when it is safe to perform the operation.
Or, would it be better to use some boolean which is given as a parameter to the function to separate the different code paths?
Or do something totally different?
What do you think?
You are not making all too much sense. You said a non-thread safe data structure is only ever altered by one thread, but in the next sentence you talk about delaying any changes made to that data structure by other threads. Make up your mind.
In general, I'd suggest wrapping the access to the data structure up with a critical section, or mutex.
It's possible to use such animals as reader/writer locks to differentiate between readers and writers of datastructures but the performance advantage for typical cases usually wont merit the additional complexity associated with their use.
From the way your question is stated, I'm guessing you're fairly new to multithreaded development. I highly suggest sticking with the simplist and most commonly used approaches for ensuring data integrity (most books/articles you readon the issue will mention the same uses for mutexes/critical sections). Multithreaded development is extremely easy to get wrong and can be difficult to debug. Also, what seems like the "optimal" solution very often doesn't buy you the huge performance benefit you might think. It's usually best to implement the simplist approach that will work then worry about optimizing it after the fact.
There is a trick that could work in case, as you said, the other threads will only make changes only once in a while, although it is still rather hackish:
make sure your "master" thread can't be interrupted by the other ones (higher priority, non fair scheduling)
check your thread
if "master", just change
if other, put off scheduling, if needed by putting off interrupts, make change, reinstall scheduling
really test to see whether there are no issues in your setup.
As you can see, if requirements change a little bit, this could turn out worse than using normal locks.
As mentioned, the simplest solution when two threads need access to the same data is to use some synchronization mechanism (i.e. critical section or mutex).
If you already have synchronization in your design try to reuse it (if possible) instead of adding more. For example, if the main thread receives its work from a synchronized queue you might be able to have thread 2 queue the data structure update. The main thread will pick up the request and can update it without additional synchronization.
The queuing concept can be hidden from the rest of the design through the Active Object pattern. The activ object may also be able to publish the data structure changes through the Observer pattern to other interested threads.

How to use SQLite in a multi-threaded application?

I'm developing an application with SQLite as the database, and am having a little trouble understanding how to go about using it in multiple threads (none of the other Stack Overflow questions really helped me, unfortunately).
My use case: The database has one table, let's call it "A", which has different groups of rows (based on one of their columns). I have the "main thread" of the application which reads the contents from table A. In addition, I decide, once in a while, to update a certain group of rows. To do this, I want to spawn a new thread, delete all the rows of the group, and re-insert them (that's the only way to do it in the context of my app). This might happen to different groups at the same time, so I might have 2+ threads trying to update the database.
I'm using different transactions from each thread, I.E. at the start of every thread's update cycle, I have a begin. In fact, what each thread actually does is call "BEGIN", delete from the database all the rows it needs to "update", and inserts them again with the new values (this is the way it must be done in the context of my application).
Now, I'm trying to understand how I go about implementing this. I've tried reading around (other answers on Stack Overflow, the SQLite site) but I haven't found all the answers. Here are some things I'm wondering about:
Do I need to call "open" and create a new sqlite structure from each thread?
Do I need to add any special code for all of this, or is it enough to spawn different threads, update the rows, and that's fine (since I'm using different transactions)?
I saw something talking about the different lock types there are, and the fact that I might receive "SQLite busy" from calling certain APIs, but honestly I didn't see any reference that completely explained when I need to take all this into account. Do I need to?
If anyone can answer the questions/point me in the direction of a good resource, I'd be very grateful.
UPDATE 1: From all that I've read so far, it seems like you can't have two threads who are going to write to a database file anyway.
See: http://www.sqlite.org/lockingv3.html. In section 3.0: A RESERVED lock means that the process is planning on writing to the database file at some point in the future but that it is currently just reading from the file. Only a single RESERVED lock may be active at one time, though multiple SHARED locks can coexist with a single RESERVED lock.
Does this mean that I may as well only spawn off a single thread to update a group of rows each time? I.e., have some kind of poller thread which decides that I need to update some of the rows, and then creates a new thread to do it, but never more than one at a time? Since it looks like any other thread I create will just get SQLITE_BUSY until the first thread finishes, anyway.
Have I understood things correctly?
BTW, thanks for the answers so far, they've helped a lot.
Some steps when starting out with SQLlite for multithreaded use:
Make sure sqlite is compiled with the multi threaded flag.
You must call open on your sqlite file to create a connection on each thread, don't share connections between threads.
SQLite has a very conservative threading model, when you do a write operation, which includes opening transactions that are about to do an INSERT/UPDATE/DELETE, other threads will be blocked until this operation completes.
If you don't use a transaction, then transactions are implicit, so if you start a INSERT/DELETE/UPDATE, sqlite will try to acquire an exclusive lock, and complete the operation before releasing it.
If you do a BEGIN EXCLUSIVE statement, it will acquire an exclusive lock before doing operations in that transaction. A COMMIT or ROLLBACK will release the lock.
Your sqlite3_step, sqlite3_prepare and some other calls may return SQLITE_BUSY or SQLITE_LOCKED. SQLITE_BUSY usually means that sqlite needs to acquire the lock. The biggest difference between the two return values:
SQLITE_LOCKED: if you get this from a sqlite3_step statement, you MUST call sqlite3_reset on the statement handle. You should only get this on the first call to sqlite3_step, so once reset is called you can actually "retry" your sqlite3_step call. On other operations, it's the same as SQLITE_BUSY
SQLITE_BUSY : There is no need to call sqlite3_reset, just retry your operation after waiting a bit for the lock to be released.
Check out this link. The easiest way is to do the locking yourself, and to avoid sharing the connection between threads. Another good resource can be found here, and it concludes with:
Make sure you're compiling SQLite with -DTHREADSAFE=1.
Make sure that each thread opens the database file and keeps its own sqlite structure.
Make sure you handle the likely possibility that one or more threads collide when they access the db file at the same time: handle SQLITE_BUSY appropriately.
Make sure you enclose within transactions the commands that modify the database file, like INSERT, UPDATE, DELETE, and others.
I realize this is an old thread and the responses are good but I've been looking into this recently and came across an interesting analysis of some different implementations. Mainly it goes over the strengths and weaknesses of connection sharing, message passing, thread-local connections and connection pooling. Take a look at it here: http://dev.yorhel.nl/doc/sqlaccess
Modern versions of SQLite has thread safety enabled by default. SQLITE_THREADSAFE compilation flag controls whether or not code is included in SQLite to enable it to operate safely in a multithreaded environment. Default value is SQLITE_THREADSAFE=1. It means Serialized mode. In this mode:
In this mode (which is the default when SQLite is compiled with SQLITE_THREADSAFE=1) the SQLite library will itself serialize access to database connections and prepared statements so that the application is free to use the same database connection or the same prepared statement in different threads at the same time.
Use sqlite3_threadsafe() function to check Sqlite library SQLITE_THREADSAFE compilation flag.
Default library thread safety behavior can be changed via sqlite3_config(). Use SQLITE_OPEN_NOMUTEX and SQLITE_OPEN_FULLMUTEX flags at sqlite3_open_v2() to adjust the threading mode of individual database connections.
Check this code from the SQLite wiki.
I have done something similar with C and I uploaded the code here.
I hope it's useful.
Summary
Transactions in SQLite are SERIALIZABLE.
Changes made in one database connection are invisible to all other database connections prior to commit.
A query sees all changes that are completed on the same database connection prior to the start of the query, regardless of whether or not those changes have been committed.
If changes occur on the same database connection after a query starts running but before the query completes, then it is undefined whether or not the query will see those changes.
If changes occur on the same database connection after a query starts running but before the query completes, then the query might return a changed row more than once, or it might return a row that was previously deleted.
For the purposes of the previous four items, two database connections that use the same shared cache and which enable PRAGMA read_uncommitted are considered to be the same database connection, not separate database connections.
In addition to the above information on multi-threaded access, it might be worth taking a look at this page on isolation, as many things have changed since this original question and the introduction of the write-ahead log (WAL).
It seems a hybrid approach of having several connections open to the database provides adequate concurrency guarantees, trading off the expense of opening a new connection with the benefit of allowing multi-threaded write transactions.
If you use connection pooling, like in Java EE, web application, set the connection pool max. size to 1. Access will be serialized.