My web app needs massive load from csv files. Files may have reference errors. How can I save "softly" each row and rollback all saved records if an error occurs?
I'm using django command.
You should want to use Transactions to guarantee atomicity on database.
This way you can set your block of code to persist on database only if all block is successfully completed. If any Exception occur, the transaction will be rolled back.
See this example code:
from django.db import transaction
def your_command_func():
# This code executes in autocommit mode (Django's default).
do_stuff()
with transaction.atomic():
# This block of code executes inside a transaction.
line = read_from_csv()
has_error = validate_line(line)
if has_error:
raise YourException("something went wrong.")
I have a Google App Engine Standard Environment application that has been working fine for a year or more, that, quite suddenly, refuses to enqueue new deferred tasks using deferred.defer.
Here's the Python 2.7 code that is making the deferred call:
# Find any inventory items that reference the product, and change them too.
# because this could take some time, we'll do it as a deferred task, and only
# if needed.
if upd:
updater = deferredtasks.InvUpdate()
deferred.defer(updater.run, product_key)
My app.yaml file has the necessary bits to support deferred.defer:
- url: /_ah/queue/deferred
script: google.appengine.ext.deferred.deferred.application
login: admin
builtins:
- deferred: on
And my deferred task has logging in it so I should see it running when it does:
#-------------------------------------------------------------------------------
# DEFERRED routine that updates the inventory items for a particular product. Should be callecd
# when ANY changes are made to the product, because it should trigger a re-download of the
# inventory record for that product to the iPad.
#-------------------------------------------------------------------------------
class InvUpdate(object):
def __init__(self):
self.to_put = []
self.product_key = None
self.updcount = 0
def run(self, product_key, batch_size=100):
updproduct = product_key.get()
if not updproduct:
logging.error("DEFERRED ERROR: Product key passed in does not exist")
return
logging.info(u"DEFERRED BEGIN: beginning inventory update for: {}".format(updproduct.name))
self.product_key = product_key
self._continue(None, batch_size)
...
When I run this in the development environment on my development box, everything works fine. Once I deploy it to the App Engine server, the inventory updates never get done (i.e. the deferred task is not executed), and there are no errors (and no other logging from the deferred task in fact) in the log files on the server. I know that with the sudden move to get everybody on Python 3 as quickly as possible, the deferred.defer library has been marked as not recommended because it only works with the 2.7 Python environment, and I planned on moving to task queues for this, but I wasn't expecting deferred.defer to suddenly stop working in the existing python environment.
Any insight would be greatly appreciated!
I'm pretty sure you cant pass the method of an instance to appengine taskqueue, because that instance will not get exist when your task runs since it will be running in a different process. I actually dont understand how your task ever worked when running remotely in the first place (and running locally is not an accurate representation of how things will run remotely)
Try changing your code to this:
if upd:
deferred.defer(deferredtasks.InvUpdate.run_cls, product_key)
and then InvUpdate is the same but has a new function run_cls:
class InvUpdate(object):
#classmethod
def run_cls(cls, product_key):
cls().run(product_key)
And I'm still on the process of migrating to cloud tasks and my deferred tasks still work
I have running celery with django. I import a stream of objects into my database by using tasks. each task imports one object. concurrency is 2. within the stream objects can be duplicated, but should not be inside my database.
code i'm running:
if qs.exists() and qs.count() == 1:
return qs.get()
elif qs.exists():
logger.exception('Multiple venues for same place')
raise ValueError('Multiple venues for same place')
else:
obj = self.create(**defaults)
problem is that if objects inside the stream are duplicate and very close to each other, the app still imports the same objects twice.
I assume that the db checks are not working properly with this concurrency setup. what architecture du you recommend to resolve this issue?
You have to use locking architecture, so will block the the two tasks from executing the object fetching part at the same time, you can use python-redis-lock to do that.
We developed a REST API using Django & mongoDB (PyMongo driver). The problem is that, on some requests to the API endpoints, PyMongo cursor returns a partial response which contains less documents than it should (but it’s a completely valid JSON document).
Let me explain it with an example of one of our views:
def get_data(key):
return collection.find({'key': key}, limit=24)
def my_view(request):
key = request.POST.get('key')
query = get_data(key)
res = [app for app in query]
return JsonResponse({'list': res})
We're sure that there is more than 8000 documents matching the query, but in
some calls we get less than 24 results (even zero). The first problem we've
investigated was that we had more than one MongoClient definition in our code. By resolving this, the number of occurrences of the problem decreased, but we still had it in a lot of calls.
After all of these investigations, we've designed a test in which we made 16 asynchronous requests at the same time to the server. With this approach, we could reproduce the problem. On each of these 16 requests, 6-8 of them had partial results. After running this test we reduced uWsgi’s number of processes to 6 and restarted the server. All results were good but after applying another heavy load on the server, the problem began again. At this point, we restarted uwsgi service and again everything was OK. With this last experiment we have a clue now that when the uwsgi service starts running, everything is working correctly but after a period of time and heavy load, the server begins to return partial or empty results again.
The latest investigation we had was to run the API using python manage.py with DEBUG=False, and we had the problem again after a period of time in this situation.
We can't figure out what the problem is and how to solve it. One reason that we can think of is that Django closes pymongo’s connections before completion. Because the returned result is a valid JSON.
Our stack is:
nginx (with no cache enabled)
uWsgi
MemCached (disabled during debugging procedure)
Django (v1.8 on python 3)
PyMongo (v3.0.3)
Your help is really appreciated.
Update:
Mongo version:
db version v3.0.7
git version: 6ce7cbe8c6b899552dadd907604559806aa2e9bd
We are running single mongod instance. No sharding/replicating.
We are creating connection using this snippet:
con = MongoClient('localhost', 27017)
Update 2
Subject thread in Pymongo issue tracker.
Pymongo cursors are not thread safe elements. So using them like what I did in a multi-threaded environment will cause what I've described in question. On the other hand Python's list operations are mostly thread safe, and changing snippet like this will solve the problem:
def get_data(key):
return list(collection.find({'key': key}, limit=24))
def my_view(request):
key = request.POST.get('key')
query = get_data(key)
res = [app for app in query]
return JsonResponse({'list': res})
My very speculative guess is that you are reusing a cursor somewhere in your code. Make sure you are initializing your collection within the view stack itself, and not outside of it.
For example, as written, if you are doing something like:
import ...
import con
collection = con.documents
# blah blah code
def my_view(request):
key = request.POST.get('key')
query = collection.find({'key': key}, limit=24)
res = [app for app in query]
return JsonResponse({'list': res})
You could end us reusing a cursor. Better to do something like
import ...
import con
# blah blah code
def my_view(request):
collection = con.documents
key = request.POST.get('key')
query = collection.find({'key': key}, limit=24)
res = [app for app in query]
return JsonResponse({'list': res})
EDIT at asker's request for clarification:
The reason you need to define the collection within the view stack and not when the file loads is that the collection variable has a cursor, which is basically how the database and your application talk to each other. Cursors do things like keep track of where you are in a long list of data, in addition to a bunch of other stuff, but thats the important part.
When you create the collection cursor outside the view method, it will re-use the cursor on each request if it exists. So, if you make one request, and then another really, really fast right after that (like what happened when you applied high load), the cursor might only be half way through talking to the database, and so some of your data goes to the first request, and some to the second. The reason you would get NO data in a request would be if a cursor finished fetching data but hadn't been closed yet, so the next request tried to fetch data from the cursor, and there was none left to fetch in the query.
By moving the collection definition (and by association, the cursor definition) into the view stack, you will ALWAYS get a new cursor when you process a new request. You wont get any cross talking between your cursors and different requests, as each request cycle will have its own.
I'm trying to port some code to Python that uses sqlite databases, and I'm trying to get transactions to work, and I'm getting really confused. I'm really confused by this; I've used sqlite a lot in other languages, because it's great, but I simply cannot work out what's wrong here.
Here is the schema for my test database (to be fed into the sqlite3 command line tool).
BEGIN TRANSACTION;
CREATE TABLE test (i integer);
INSERT INTO "test" VALUES(99);
COMMIT;
Here is a test program.
import sqlite3
sql = sqlite3.connect("test.db")
with sql:
c = sql.cursor()
c.executescript("""
update test set i = 1;
fnord;
update test set i = 0;
""")
You may notice the deliberate mistake in it. This causes the SQL script to fail on the second line, after the update has been executed.
According to the docs, the with sql statement is supposed to set up an implicit transaction around the contents, which is only committed if the block succeeds. However, when I run it, I get the expected SQL error... but the value of i is set from 99 to 1. I'm expecting it to remain at 99, because that first update should be rolled back.
Here is another test program, which explicitly calls commit() and rollback().
import sqlite3
sql = sqlite3.connect("test.db")
try:
c = sql.cursor()
c.executescript("""
update test set i = 1;
fnord;
update test set i = 0;
""")
sql.commit()
except sql.Error:
print("failed!")
sql.rollback()
This behaves in precisely the same way --- i gets changed from 99 to 1.
Now I'm calling BEGIN and COMMIT explicitly:
import sqlite3
sql = sqlite3.connect("test.db")
try:
c = sql.cursor()
c.execute("begin")
c.executescript("""
update test set i = 1;
fnord;
update test set i = 0;
""")
c.execute("commit")
except sql.Error:
print("failed!")
c.execute("rollback")
This fails too, but in a different way. I get this:
sqlite3.OperationalError: cannot rollback - no transaction is active
However, if I replace the calls to c.execute() to c.executescript(), then it works (i remains at 99)!
(I should also add that if I put the begin and commit inside the inner call to executescript then it behaves correctly in all cases, but unfortunately I can't use that approach in my application. In addition, changing sql.isolation_level appears to make no difference to the behaviour.)
Can someone explain to me what's happening here? I need to understand this; if I can't trust the transactions in the database, I can't make my application work...
Python 2.7, python-sqlite3 2.6.0, sqlite3 3.7.13, Debian.
For anyone who'd like to work with the sqlite3 lib regardless of its shortcomings, I found that you can keep some control of transactions if you do these two things:
set Connection.isolation_level = None (as per the docs, this means autocommit mode)
avoid using executescript at all, because according to the docs it "issues a COMMIT statement first" - ie, trouble. Indeed I found it interferes with any manually set transactions
So then, the following adaptation of your test works for me:
import sqlite3
sql = sqlite3.connect("/tmp/test.db")
sql.isolation_level = None
c = sql.cursor()
c.execute("begin")
try:
c.execute("update test set i = 1")
c.execute("fnord")
c.execute("update test set i = 0")
c.execute("commit")
except sql.Error:
print("failed!")
c.execute("rollback")
Per the docs,
Connection objects can be used as context managers that automatically
commit or rollback transactions. In the event of an exception, the
transaction is rolled back; otherwise, the transaction is committed:
Therefore, if you let Python exit the with-statement when an exception occurs, the transaction will be rolled back.
import sqlite3
filename = '/tmp/test.db'
with sqlite3.connect(filename) as conn:
cursor = conn.cursor()
sqls = [
'DROP TABLE IF EXISTS test',
'CREATE TABLE test (i integer)',
'INSERT INTO "test" VALUES(99)',]
for sql in sqls:
cursor.execute(sql)
try:
with sqlite3.connect(filename) as conn:
cursor = conn.cursor()
sqls = [
'update test set i = 1',
'fnord', # <-- trigger error
'update test set i = 0',]
for sql in sqls:
cursor.execute(sql)
except sqlite3.OperationalError as err:
print(err)
# near "fnord": syntax error
with sqlite3.connect(filename) as conn:
cursor = conn.cursor()
cursor.execute('SELECT * FROM test')
for row in cursor:
print(row)
# (99,)
yields
(99,)
as expected.
Python's DB API tries to be smart, and begins and commits transactions automatically.
I would recommend to use a DB driver that does not use the Python DB API, like apsw.
Here's what I think is happening based on my reading of Python's sqlite3 bindings as well as official Sqlite3 docs. The short answer is that if you want a proper transaction, you should stick to this idiom:
with connection:
db.execute("BEGIN")
# do other things, but do NOT use 'executescript'
Contrary to my intuition, with connection does not call BEGIN upon entering the scope. In fact it doesn't do anything at all in __enter__. It only has an effect when you __exit__ the scope, choosing either COMMIT or ROLLBACK depending on whether the scope is exiting normally or with an exception.
Therefore, the right thing to do is to always explicitly mark the beginning of your transactional with connection blocks using BEGIN. This renders isolation_level irrelevant within the block, because thankfully it only has an effect while autocommit mode is enabled, and autocommit mode is always suppressed within transaction blocks.
Another quirk is executescript, which always issues a COMMIT before running your script. This can easily mess up the transactional with connection block, so your choice is to either
use exactly one executescript within the with block and nothing else, or
avoid executescript entirely; you can call execute as many times as you want, subject to the one-statement-per-execute limitation.
Normal .execute()'s work as expected with the comfortable default auto-commit mode and the with conn: ... context manager doing auto-commit OR rollback - except for protected read-modify-write transactions, which are explained at the end of this answer.
sqlite3 module's non-standard conn_or_cursor.executescript() doesn't take part in the (default) auto-commit mode (and so doesn't work normally with the with conn: ... context manager) but forwards the script rather raw. Therefor it just commits a potentially pending auto-commit transactions at start, before "going raw".
This also means that without a "BEGIN" inside the script executescript() works without a transaction, and thus no rollback option upon error or otherwise.
So with executescript() we better use a explicit BEGIN (just as your inital schema creation script did for the "raw" mode sqlite command line tool). And this interaction shows step by step whats going on:
>>> list(conn.execute('SELECT * FROM test'))
[(99,)]
>>> conn.executescript("BEGIN; UPDATE TEST SET i = 1; FNORD; COMMIT""")
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
OperationalError: near "FNORD": syntax error
>>> list(conn.execute('SELECT * FROM test'))
[(1,)]
>>> conn.rollback()
>>> list(conn.execute('SELECT * FROM test'))
[(99,)]
>>>
The script didn't reach the "COMMIT". And thus we could the view the current intermediate state and decide for rollback (or commit nevertheless)
Thus a working try-except-rollback via excecutescript() looks like this:
>>> list(conn.execute('SELECT * FROM test'))
[(99,)]
>>> try: conn.executescript("BEGIN; UPDATE TEST SET i = 1; FNORD; COMMIT""")
... except Exception as ev:
... print("Error in executescript (%s). Rolling back" % ev)
... conn.executescript('ROLLBACK')
...
Error in executescript (near "FNORD": syntax error). Rolling back
<sqlite3.Cursor object at 0x011F56E0>
>>> list(conn.execute('SELECT * FROM test'))
[(99,)]
>>>
(Note the rollback via script here, because no .execute() took over commit control)
And here a note on the auto-commit mode in combination with the more difficult issue of a protected read-modify-write transaction - which made #Jeremie say "Out of all the many, many things written about transactions in sqlite/python, this is the only thing that let me do what I want (have an exclusive read lock on the database)." in a comment on an example which included a c.execute("begin"). Though sqlite3 normally does not make a long blocking exclusive read lock except for the duration of the actual write-back, but more clever 5-stage locks to achieve enough protection against overlapping changes.
The with conn: auto-commit context does not already put or trigger a lock strong enough for protected read-modify-write in the 5-stage locking scheme of sqlite3. Such lock is made implicitely only when the first data-modifying command is issued - thus too late.
Only an explicit BEGIN (DEFERRED) (TRANSACTION) triggers the wanted behavior:
The first read operation against a database creates a SHARED lock and
the first write operation creates a RESERVED lock.
So a protected read-modify-write transaction which uses the programming language in general way (and not a special atomic SQL UPDATE clause) looks like this:
with conn:
conn.execute('BEGIN TRANSACTION') # crucial !
v = conn.execute('SELECT * FROM test').fetchone()[0]
v = v + 1
time.sleep(3) # no read lock in effect, but only one concurrent modify succeeds
conn.execute('UPDATE test SET i=?', (v,))
Upon failure such read-modify-write transaction could be retried a couple of times.
You can use the connection as a context manager. It will then automatically rollback the transactions in the event of an exception or commit them otherwise.
try:
with con:
con.execute("insert into person(firstname) values (?)", ("Joe",))
except sqlite3.IntegrityError:
print("couldn't add Joe twice")
See https://docs.python.org/3/library/sqlite3.html#using-the-connection-as-a-context-manager
This is a bit old thread but if it helps I've found that doing a rollback on the connection object does the trick.