I'm trying to port some code to Python that uses sqlite databases, and I'm trying to get transactions to work, and I'm getting really confused. I'm really confused by this; I've used sqlite a lot in other languages, because it's great, but I simply cannot work out what's wrong here.
Here is the schema for my test database (to be fed into the sqlite3 command line tool).
BEGIN TRANSACTION;
CREATE TABLE test (i integer);
INSERT INTO "test" VALUES(99);
COMMIT;
Here is a test program.
import sqlite3
sql = sqlite3.connect("test.db")
with sql:
c = sql.cursor()
c.executescript("""
update test set i = 1;
fnord;
update test set i = 0;
""")
You may notice the deliberate mistake in it. This causes the SQL script to fail on the second line, after the update has been executed.
According to the docs, the with sql statement is supposed to set up an implicit transaction around the contents, which is only committed if the block succeeds. However, when I run it, I get the expected SQL error... but the value of i is set from 99 to 1. I'm expecting it to remain at 99, because that first update should be rolled back.
Here is another test program, which explicitly calls commit() and rollback().
import sqlite3
sql = sqlite3.connect("test.db")
try:
c = sql.cursor()
c.executescript("""
update test set i = 1;
fnord;
update test set i = 0;
""")
sql.commit()
except sql.Error:
print("failed!")
sql.rollback()
This behaves in precisely the same way --- i gets changed from 99 to 1.
Now I'm calling BEGIN and COMMIT explicitly:
import sqlite3
sql = sqlite3.connect("test.db")
try:
c = sql.cursor()
c.execute("begin")
c.executescript("""
update test set i = 1;
fnord;
update test set i = 0;
""")
c.execute("commit")
except sql.Error:
print("failed!")
c.execute("rollback")
This fails too, but in a different way. I get this:
sqlite3.OperationalError: cannot rollback - no transaction is active
However, if I replace the calls to c.execute() to c.executescript(), then it works (i remains at 99)!
(I should also add that if I put the begin and commit inside the inner call to executescript then it behaves correctly in all cases, but unfortunately I can't use that approach in my application. In addition, changing sql.isolation_level appears to make no difference to the behaviour.)
Can someone explain to me what's happening here? I need to understand this; if I can't trust the transactions in the database, I can't make my application work...
Python 2.7, python-sqlite3 2.6.0, sqlite3 3.7.13, Debian.
For anyone who'd like to work with the sqlite3 lib regardless of its shortcomings, I found that you can keep some control of transactions if you do these two things:
set Connection.isolation_level = None (as per the docs, this means autocommit mode)
avoid using executescript at all, because according to the docs it "issues a COMMIT statement first" - ie, trouble. Indeed I found it interferes with any manually set transactions
So then, the following adaptation of your test works for me:
import sqlite3
sql = sqlite3.connect("/tmp/test.db")
sql.isolation_level = None
c = sql.cursor()
c.execute("begin")
try:
c.execute("update test set i = 1")
c.execute("fnord")
c.execute("update test set i = 0")
c.execute("commit")
except sql.Error:
print("failed!")
c.execute("rollback")
Per the docs,
Connection objects can be used as context managers that automatically
commit or rollback transactions. In the event of an exception, the
transaction is rolled back; otherwise, the transaction is committed:
Therefore, if you let Python exit the with-statement when an exception occurs, the transaction will be rolled back.
import sqlite3
filename = '/tmp/test.db'
with sqlite3.connect(filename) as conn:
cursor = conn.cursor()
sqls = [
'DROP TABLE IF EXISTS test',
'CREATE TABLE test (i integer)',
'INSERT INTO "test" VALUES(99)',]
for sql in sqls:
cursor.execute(sql)
try:
with sqlite3.connect(filename) as conn:
cursor = conn.cursor()
sqls = [
'update test set i = 1',
'fnord', # <-- trigger error
'update test set i = 0',]
for sql in sqls:
cursor.execute(sql)
except sqlite3.OperationalError as err:
print(err)
# near "fnord": syntax error
with sqlite3.connect(filename) as conn:
cursor = conn.cursor()
cursor.execute('SELECT * FROM test')
for row in cursor:
print(row)
# (99,)
yields
(99,)
as expected.
Python's DB API tries to be smart, and begins and commits transactions automatically.
I would recommend to use a DB driver that does not use the Python DB API, like apsw.
Here's what I think is happening based on my reading of Python's sqlite3 bindings as well as official Sqlite3 docs. The short answer is that if you want a proper transaction, you should stick to this idiom:
with connection:
db.execute("BEGIN")
# do other things, but do NOT use 'executescript'
Contrary to my intuition, with connection does not call BEGIN upon entering the scope. In fact it doesn't do anything at all in __enter__. It only has an effect when you __exit__ the scope, choosing either COMMIT or ROLLBACK depending on whether the scope is exiting normally or with an exception.
Therefore, the right thing to do is to always explicitly mark the beginning of your transactional with connection blocks using BEGIN. This renders isolation_level irrelevant within the block, because thankfully it only has an effect while autocommit mode is enabled, and autocommit mode is always suppressed within transaction blocks.
Another quirk is executescript, which always issues a COMMIT before running your script. This can easily mess up the transactional with connection block, so your choice is to either
use exactly one executescript within the with block and nothing else, or
avoid executescript entirely; you can call execute as many times as you want, subject to the one-statement-per-execute limitation.
Normal .execute()'s work as expected with the comfortable default auto-commit mode and the with conn: ... context manager doing auto-commit OR rollback - except for protected read-modify-write transactions, which are explained at the end of this answer.
sqlite3 module's non-standard conn_or_cursor.executescript() doesn't take part in the (default) auto-commit mode (and so doesn't work normally with the with conn: ... context manager) but forwards the script rather raw. Therefor it just commits a potentially pending auto-commit transactions at start, before "going raw".
This also means that without a "BEGIN" inside the script executescript() works without a transaction, and thus no rollback option upon error or otherwise.
So with executescript() we better use a explicit BEGIN (just as your inital schema creation script did for the "raw" mode sqlite command line tool). And this interaction shows step by step whats going on:
>>> list(conn.execute('SELECT * FROM test'))
[(99,)]
>>> conn.executescript("BEGIN; UPDATE TEST SET i = 1; FNORD; COMMIT""")
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
OperationalError: near "FNORD": syntax error
>>> list(conn.execute('SELECT * FROM test'))
[(1,)]
>>> conn.rollback()
>>> list(conn.execute('SELECT * FROM test'))
[(99,)]
>>>
The script didn't reach the "COMMIT". And thus we could the view the current intermediate state and decide for rollback (or commit nevertheless)
Thus a working try-except-rollback via excecutescript() looks like this:
>>> list(conn.execute('SELECT * FROM test'))
[(99,)]
>>> try: conn.executescript("BEGIN; UPDATE TEST SET i = 1; FNORD; COMMIT""")
... except Exception as ev:
... print("Error in executescript (%s). Rolling back" % ev)
... conn.executescript('ROLLBACK')
...
Error in executescript (near "FNORD": syntax error). Rolling back
<sqlite3.Cursor object at 0x011F56E0>
>>> list(conn.execute('SELECT * FROM test'))
[(99,)]
>>>
(Note the rollback via script here, because no .execute() took over commit control)
And here a note on the auto-commit mode in combination with the more difficult issue of a protected read-modify-write transaction - which made #Jeremie say "Out of all the many, many things written about transactions in sqlite/python, this is the only thing that let me do what I want (have an exclusive read lock on the database)." in a comment on an example which included a c.execute("begin"). Though sqlite3 normally does not make a long blocking exclusive read lock except for the duration of the actual write-back, but more clever 5-stage locks to achieve enough protection against overlapping changes.
The with conn: auto-commit context does not already put or trigger a lock strong enough for protected read-modify-write in the 5-stage locking scheme of sqlite3. Such lock is made implicitely only when the first data-modifying command is issued - thus too late.
Only an explicit BEGIN (DEFERRED) (TRANSACTION) triggers the wanted behavior:
The first read operation against a database creates a SHARED lock and
the first write operation creates a RESERVED lock.
So a protected read-modify-write transaction which uses the programming language in general way (and not a special atomic SQL UPDATE clause) looks like this:
with conn:
conn.execute('BEGIN TRANSACTION') # crucial !
v = conn.execute('SELECT * FROM test').fetchone()[0]
v = v + 1
time.sleep(3) # no read lock in effect, but only one concurrent modify succeeds
conn.execute('UPDATE test SET i=?', (v,))
Upon failure such read-modify-write transaction could be retried a couple of times.
You can use the connection as a context manager. It will then automatically rollback the transactions in the event of an exception or commit them otherwise.
try:
with con:
con.execute("insert into person(firstname) values (?)", ("Joe",))
except sqlite3.IntegrityError:
print("couldn't add Joe twice")
See https://docs.python.org/3/library/sqlite3.html#using-the-connection-as-a-context-manager
This is a bit old thread but if it helps I've found that doing a rollback on the connection object does the trick.
Related
My web app needs massive load from csv files. Files may have reference errors. How can I save "softly" each row and rollback all saved records if an error occurs?
I'm using django command.
You should want to use Transactions to guarantee atomicity on database.
This way you can set your block of code to persist on database only if all block is successfully completed. If any Exception occur, the transaction will be rolled back.
See this example code:
from django.db import transaction
def your_command_func():
# This code executes in autocommit mode (Django's default).
do_stuff()
with transaction.atomic():
# This block of code executes inside a transaction.
line = read_from_csv()
has_error = validate_line(line)
if has_error:
raise YourException("something went wrong.")
When you do:
#transaction.atomic
def update_db():
do_bulk_update()
while the function is running, does it lock the database?
I'm asking regarding django's atomic transaction:
https://docs.djangoproject.com/en/1.10/topics/db/transactions/#autocommit-details
(I'm assuming modern SQL databases in this answer.)
tl;dr
Transactions are not locks, but hold locks that are acquired automatically during operations. And django does not add any locking by default, so the answer is No, it does not lock the database.
E.g. if you were do:
#transaction.atomic
def update_db():
cursor.execute('UPDATE app_model SET model_name TO 'bob' WHERE model_id = 1;')
# some other stuff...
You will have locked the app_model row with id 1 for the duration of "other stuff". But it is not locked until that query. So if you want to ensure consistency you should probably use locks explicitly.
Transactions
As said, transactions are not locks because that would be awful for perfomance. In general they are lighter-weight mechanisms in the first instance for ensuring that if you make a load of changes that wouldn't make sense one at a time to other users of the database, those changes appear to happen all at once. I.e. are atomic. Transactions do not block other users from mutating the database, and indeed in general do not block other users from mutating the same rows you may be reading.
See this guide and your databases docs (e.g. postgres) for more details on how transactions are protected.
Django implementation of atomic.
Django itself does the following when you use the atomic decorator (referring to the code).
Not already in an atomic block
Disables autocommit. Autocommit is an application level feature which will always commit transactions immediately, so it looks to the application like there is never a transaction outstanding.
This tells the database to start a new transaction.
At this point psycopg2 for postgres sets the isolation level of the transaction to READ COMMITTED, which means that any reads in the transaction will only return committed data, which means if another transaction writes, you won't see that change until it commits it. It does mean though that if that transaction commits during your transaction, you may read again and see that the value has changed during your transaction.
Obviously this means that the database is not locked.
Runs your code. Any queries / mutations you make are not committed.
Commits the transaction.
Re-enables autocommit.
In an earlier atomic block
Basically in this case we try to use savepoints so we can revert back to them if we "rollback" the "transaction", but as far as the database connection is concerned we are in the same transaction.
Automatic locking
As said, the database may give your transaction some automatic locks, as outlined in this doc. To demonstrate this, consider the following code that operates on a postgres database with one table and one row in it:
my_table
id | age
---+----
1 | 50
And then you run this code:
import psycopg2 as Database
from multiprocessing import Process
from time import sleep
from contextlib import contextmanager
#contextmanager
def connection():
conn = Database.connect(
user='daphtdazz', host='localhost', port=5432, database='db_test'
)
try:
yield conn
finally:
conn.close()
def connect_and_mutate_after_seconds(seconds, age):
with connection() as conn:
curs = conn.cursor()
print('execute update age to %d...' % (age,))
curs.execute('update my_table set age = %d where id = 1;' % (age,))
print('sleep after update age to %d...' % (age,))
sleep(seconds)
print('commit update age to %d...' % (age,))
conn.commit()
def dump_table():
with connection() as conn:
curs = conn.cursor()
curs.execute('select * from my_table;')
print('table: %s' % (curs.fetchall(),))
if __name__ == '__main__':
p1 = Process(target=connect_and_mutate_after_seconds, args=(2, 99))
p1.start()
sleep(0.6)
p2 = Process(target=connect_and_mutate_after_seconds, args=(1, 100))
p2.start()
p2.join()
dump_table()
p1.join()
dump_table()
You get:
execute update age to 99...
sleep after update age to 99...
execute update age to 100...
commit update age to 99...
sleep after update age to 100...
commit update age to 100...
table: [(1, 100)]
table: [(1, 100)]
and the point is that the second process is started before the first command completes, but after it has called the update command, so the second process has to wait for the lock which is why we don't see sleep after update age to 100 until after the commit for age 99.
If you put the sleep before the exec, you get:
sleep before update age to 99...
sleep before update age to 100...
execute update age to 100...
commit update age to 100...
table: [(24, 3), (100, 2)]
execute update age to 99...
commit update age to 99...
table: [(24, 3), (99, 2)]
Indicating the lock was not acquired by the time the second process gets to its update, which happens first but during the first process's transaction.
As described by #daphtdazz answer, Django doesn't acquire any locks when you open a transaction, but as you update the data, the database may acquire automatic locks. The type and scope of the locks are database dependant and may also depend on the transaction isolation levels. Refer to your database documentations for the details of these automatic locks.
There are a few options if you want to take locks manually.
The main and simplest one is doing a select_for_update() query. This will acquire an update lock that will block all other updates to the rows matching the query. This is the same lock that is automatically acquired when you update a row in a transaction, but select_for_update() allows you to acquire the update lock before actually making the update, which can often be useful.
If row locking isn't suitable for your situation, you can acquire advisory locks in databases that supports them (e.g. Postgres). Out of the box, Django doesn't support this, but there are third party packages that adds support for advisory locks to Django, or you can simply issue the appropriate raw SQL query.
While executing simple SQL queries, cx_Oracle (5.1.3) throws some random exceptions intermittently.
def execute_query(name, asize=100, mapper=default_mapper, **kwargs):
with session_pool.acquire() as conn:
sql = read_sql(name)
cursor = conn.cursor()
cursor.arraysize = asize
cursor.prepare(sql, name)
cursor.execute(None, kwargs)
return mapper(cursor)
read_sql(name) simply reads query string from a file, while session_pool is a cx_Oracle.SessionPool object with plenty of idle connections. mapper is a row mapper function that returns a list of dict.
Most of calls will return no rows, but occasionally it throws randomly
ORA-03106: fatal two-task communication protocol error
ORA-01002: fetch out of sequence
ORA-01403: no data found
ORA-01013: user requested cancel of current operation
Also, the query sometimes just blocks for over a minute before continuing. It is a static query with a single table with one bind variable, can't get any simpler than that. Also, there is only a single thread executing this. Anyone with even remotely similar experience?
We developed a REST API using Django & mongoDB (PyMongo driver). The problem is that, on some requests to the API endpoints, PyMongo cursor returns a partial response which contains less documents than it should (but it’s a completely valid JSON document).
Let me explain it with an example of one of our views:
def get_data(key):
return collection.find({'key': key}, limit=24)
def my_view(request):
key = request.POST.get('key')
query = get_data(key)
res = [app for app in query]
return JsonResponse({'list': res})
We're sure that there is more than 8000 documents matching the query, but in
some calls we get less than 24 results (even zero). The first problem we've
investigated was that we had more than one MongoClient definition in our code. By resolving this, the number of occurrences of the problem decreased, but we still had it in a lot of calls.
After all of these investigations, we've designed a test in which we made 16 asynchronous requests at the same time to the server. With this approach, we could reproduce the problem. On each of these 16 requests, 6-8 of them had partial results. After running this test we reduced uWsgi’s number of processes to 6 and restarted the server. All results were good but after applying another heavy load on the server, the problem began again. At this point, we restarted uwsgi service and again everything was OK. With this last experiment we have a clue now that when the uwsgi service starts running, everything is working correctly but after a period of time and heavy load, the server begins to return partial or empty results again.
The latest investigation we had was to run the API using python manage.py with DEBUG=False, and we had the problem again after a period of time in this situation.
We can't figure out what the problem is and how to solve it. One reason that we can think of is that Django closes pymongo’s connections before completion. Because the returned result is a valid JSON.
Our stack is:
nginx (with no cache enabled)
uWsgi
MemCached (disabled during debugging procedure)
Django (v1.8 on python 3)
PyMongo (v3.0.3)
Your help is really appreciated.
Update:
Mongo version:
db version v3.0.7
git version: 6ce7cbe8c6b899552dadd907604559806aa2e9bd
We are running single mongod instance. No sharding/replicating.
We are creating connection using this snippet:
con = MongoClient('localhost', 27017)
Update 2
Subject thread in Pymongo issue tracker.
Pymongo cursors are not thread safe elements. So using them like what I did in a multi-threaded environment will cause what I've described in question. On the other hand Python's list operations are mostly thread safe, and changing snippet like this will solve the problem:
def get_data(key):
return list(collection.find({'key': key}, limit=24))
def my_view(request):
key = request.POST.get('key')
query = get_data(key)
res = [app for app in query]
return JsonResponse({'list': res})
My very speculative guess is that you are reusing a cursor somewhere in your code. Make sure you are initializing your collection within the view stack itself, and not outside of it.
For example, as written, if you are doing something like:
import ...
import con
collection = con.documents
# blah blah code
def my_view(request):
key = request.POST.get('key')
query = collection.find({'key': key}, limit=24)
res = [app for app in query]
return JsonResponse({'list': res})
You could end us reusing a cursor. Better to do something like
import ...
import con
# blah blah code
def my_view(request):
collection = con.documents
key = request.POST.get('key')
query = collection.find({'key': key}, limit=24)
res = [app for app in query]
return JsonResponse({'list': res})
EDIT at asker's request for clarification:
The reason you need to define the collection within the view stack and not when the file loads is that the collection variable has a cursor, which is basically how the database and your application talk to each other. Cursors do things like keep track of where you are in a long list of data, in addition to a bunch of other stuff, but thats the important part.
When you create the collection cursor outside the view method, it will re-use the cursor on each request if it exists. So, if you make one request, and then another really, really fast right after that (like what happened when you applied high load), the cursor might only be half way through talking to the database, and so some of your data goes to the first request, and some to the second. The reason you would get NO data in a request would be if a cursor finished fetching data but hadn't been closed yet, so the next request tried to fetch data from the cursor, and there was none left to fetch in the query.
By moving the collection definition (and by association, the cursor definition) into the view stack, you will ALWAYS get a new cursor when you process a new request. You wont get any cross talking between your cursors and different requests, as each request cycle will have its own.
We use H2 database to execute tests. To isolate each test from another one, the database schema and basic data-setup is dropped and re-created before each test.
Is it possible to create a restore-point after the first setup of the database and restore before each test the data of this point?
SCRIPT just creates a sql-file with all tables and datas. Not a big difference to our own initialization.
Question database restore to particular state for testing is the same, just for Oracle and Postgres.
An old question, but I find it is still relevant. AFAIK there is no restore-point support.
Here is a simple, yet fast approach to backup/restore.
Create a backup prior to running the first test:
Connection conn = DriverManager.getConnection("jdbc:h2:mem:myDatabase;DB_CLOSE_DELAY=-1;LOG=0");
Statement stat = conn.createStatement();
stat.execute("SCRIPT TO 'memFS:myDatabase.sql'");
stat.close();
conn.close();
Restore after each test:
Connection conn = DriverManager.getConnection("jdbc:h2:mem:myDatabase;DB_CLOSE_DELAY=-1;LOG=0");
Statement stat = conn.createStatement();
stat.execute("DROP ALL OBJECTS");
stat.close();
conn.close();
conn = DriverManager.getConnection("jdbc:h2:mem:myDatabase;DB_CLOSE_DELAY=-1;INIT=runscript from 'memFS:myDatabase.sql';LOG=0");
conn.close();
Note that SHUTDOWN command turned out to be faster than DROP ALL OBJECTS, but it caused some issues (connection pool unable to reestablish connection).
I would not say the above approach is slow, far from it. But with a large database and thousands of tests there is still room for improvement as the method above takes some time. I managed to achieve a few times faster backup/restore (~15ms for a DB with ~350 tables) manually composing a script performing TRUNCATE TABLE, ALTER SEQUENCE and do the INSERT of all initial data (needs SET REFERENTIAL_INTEGRITY FALSE for cleanup/restore procedure to be really fast). The code is cumbersome but was worth the effort.