I am using transaction.atomic as a context manager for transactions in django 1.6. There is a block of code which I want to be in a transaction which has a couple of network calls and some database writes. I am seeing very weird behaviour. Every once in while (maybe 1 in 20 times) I have noticed a partial rollback happening without any exception having been raised and the view executing without any errors. My application is hosted on heroku and we use heroku postgres v9.2.8. Pseudo code:
from django.db import transaction
def some_view(request):
try:
with transation.atomic():
network_call_1()
db_write_1.save(update_fields=['col4',])
db_write_2.save(update_fields=['col3',])
db_write_3.save(update_fields=['col1',])
network_call_2()
db_write_4.save(update_fields=['col6',])
db_write_5.bulk_create([object1, object2])
db_write_6.bulk_create([object1, object2])
except Exception, e:
logger.error(e)
return HttpResponse()
The behaviour that I have noticed is that without any exception having been raised, either db write 1-3 have rolled back and the rest gone through or db write 1 has been rolled back and rest have gone through and so on. I don't understand why this should be happening. First, if there is a rollback, shouldn't it be a complete rollback of the transaction? If there is a rollback shouldn't an exception also be raised so that I know a rollback has happened? Everytime this has happened, no exception has been raised and the code just continues executing and returns a successful HttpResponse.
Relevant settings:
DATABASES = {
'default': {
'ENGINE': 'django.db.backends.postgresql_psycopg2',
'NAME': 'mydb',
'USER': 'root',
'PASSWORD': 'root',
'HOST': 'localhost',
'PORT': '5432',
},
}
CONN_MAX_AGE = None
This bug has me baffled since days. Any clues will be of great help!
After hours of debugging, we have found the culprit.
When we start our application on gunicorn, it spawns workers. Every request coming to the same worker uses the same django DatabaseWrapper instance (postgres in our case) also referred to as a connection. If, in the middle of a transaction in one request, the worker were to receive another request, this request resets the state of the connection causing the transaction to behave in unexpected ways as documented in this bug: https://code.djangoproject.com/ticket/21239
Sometimes the transaction doesn't get committed and there is no exception raised to let you know that happened. Sometimes parts of it do get committed while the rest is lost and it looks like a partial rollback.
We thought that a connection is thread safe, but this bit of gunicorn patching magic here makes sure that's not the case: https://github.com/benoitc/gunicorn/blob/18.0/gunicorn/management/commands/run_gunicorn.py#L16
Still open to suggestions on how to sidestep this issue if possible at all.
EDIT: Don't use the run_gunicorn management command to start Django. It does some funky patching which causes DB connections to not be thread safe. The solution that worked for us is to just use "gunicorn myapp.wsgi:application -c gunicorn.conf". Django persistent DB connections don't work with the gevent worker type yet so avoid using that unless you want to run out of connections.
Not a Django expert, but I do know Postgres. I agree with your assessment that this sounds like very atypical behavior for a transaction: the rollback should be all-or-nothing, and there should be an exception. That being the case, can you be absolutely certain that this is a rollback-type situation? There are lots of other possible causes that could account for different data appearing in the database than you expected, and many of those scenarios would fit better with your observed occurrences that does a rollback.
You haven't provided any specifics as to your data, but what I imagine is, you're seeing something like "I set the value of col4 to 'foo', but after the commit, the old value 'bar' is still in the database." Is that correct?
If so, then other possible causes could be:
The code that is supposed to setting the 'foo' value somehow, on occasion, is actually setting either the existing 'bar' value, or a NULL value.
The code is setting the 'foo' value, but the there is a data access layer (aka DAL) with a 'dirty' flag that is not being set (e.g. if the object is in a disconnected state), so when the commit is done, the DAL doesn't see that as being a change it is supposed to write.
These are just a few examples to get you started. There are lots of other possible scenarios. Sometimes, the basic philosophy of debugging problems like this is similar to the problem of the DDT and pelicans: since the database is at the top of the food chain, you can often see problems there that-- while they appear to be database problems-- are actually caused somewhere else in your solution.
Good luck and hope that helps!
My 3 cents:
Exceptions
We're certain no exceptions have occurred. But are we? Your pseudo-code "handles" an exception by just logging. Make sure there are no exceptions "handled" elsewhere by logging or pass.
The partial rollback
We expect the the whole transaction to be rolled back, not just part. Since django 1.6 nested atomic transactions create a savepoint and rollbacks go back to the last savepoint. Make sure there are no nested transactions. Perhaps you have transaction middleware active check ATOMIC_REQUESTS and MIDDLEWARE_CLASSES. Maybe transactions are started in those network_call functions.
Reproducing
Since that network_call code may block. Try to replace them with mock calls, that timeout (maybe not in production). If that results in 100% (partial) rollbacks. It should make locating the problem of partial rollbacks easier.
Let me just make few remarks first.
It is not necessary to have an exception in this code and still have rollback.
Maybe there is some kind of timeout outside this code. Think if you killed python process in the middle of the second network call. This particular exception would not be logged.
I would also recommend adding
raise
at the end of except, it will log and re-raise the same exception. Cacthing all exceptions is rarely good.
Also, there might be a threading issue. Try importing threding and logging current thread id in your logger with the exception. You may find out that you actually have more than one thread, so one has to wait on another.
Generally, it is not a good idea to have some external calls in the middle of transaction.
Do both your calls before you start atomic transaction, so it can be as fast as possible.
Hope this helps.
Related
So... in response to an API call I do:
i = CertainObject(paramA=1, paramB=2)
i.save()
now my writer database has a new record.
Processing can take a bit and I do not wish to hold off my response to the API caller, so the next line I am transferring the object ID to an async job using Celery:
run_async_job.delay(i.id)
right away, or a few secs away depending on the queue run_async_job tried to load up the record from the database with that ID provided. It's a gamble. Sometimes it works, sometimes doesn't depending whether the read replicas updated or not.
Is there pattern to guarantee success and not having to "sleep" for a few seconds before reading or hope for good luck?
Thanks.
The simplest way seems to be using the retries as mentioned by Greg and Elrond in their answers. If you're using shared_task or #app.task decorators, you can use the following code snippet.
#shared_task(bind=True)
def your_task(self, certain_object_id):
try:
certain_obj = CertainObject.objects.get(id=certain_object_id)
# Do your stuff
except CertainObject.DoesNotExist as e:
self.retry(exc=e, countdown=2 ** self.request.retries, max_retries=20)
I used an exponential countdown in between every retry. You can modify it according to your needs.
You can find the documentation for custom retry delay here.
There is also another document explaining the exponential backoff in this link
When you call retry it’ll send a new message, using the same task-id, and it’ll take care to make sure the message is delivered to the same queue as the originating task. You can read more about this in the documentation here
As writing and then loading it immediately is a high priority, then why not store it in memory based DB like Memcache or Redis. So that after sometime, you can write it in the Database using a periodic job in celery which will run lets say every minute or so. When it is done writing to DB, it will delete the keys from Redis/Memcache.
You can keep the data in memory based DB for certain time, lets say 1 hour when the data is needed most. Also you can create a service method, which will check if the data is in memory or not.
Django Redis is a great package to connect to redis(if you are using it as broker in Celery).
I am providing some example based on Django cache:
# service method
from django.core.cache import cache
def get_object(obj_id, model_cls):
obj_dict = cache.get(obj_id, None) # checks if obj id is in cache, O(1) complexity
if obj_dict:
return model_cls(**obj_dict)
else:
return model_cls.objects.get(id=obj_id)
# celery job
#app.task
def store_objects():
logger.info("-"*25)
# you can use .bulk_create() to reduce DB hits and faster DB entries
for obj_id in cache.keys("foo_*"):
CertainObject.objects.create(**cache.get(obj_id))
cache.delete(obj_id)
logger.info("-"*25)
The simplest solution would be to catch any DoesNotExist errors thrown at the start of the task, then schedule a retry. This can be done by converting run_async_job into a Bound Task:
#app.task(bind=True)
def run_async_job(self, object_id):
try:
instance = CertainObject.objects.get(id=object_id)
except CertainObject.DoesNotExist:
return self.retry(object_id)
This article goes pretty deep into how you can handle read-after-write issues with replicated databases: https://medium.com/box-tech-blog/how-we-learned-to-stop-worrying-and-read-from-replicas-58cc43973638.
Like the author, I know of no foolproof catch-all way to handle read-after-write inconsistency.
The main strategy I've used before is to have some kind of expect_and_get(pk, max_attempts=10, delay_seconds=5) method that attempts to fetch the record, and attempts it max_attempts times, delaying delay_seconds seconds in between attempts. The idea is that it "expects" the record to exist, and so it treats a certain number of failures as just transient DB issues. It's a little more reliable than just sleeping for some time since it will pick up records quicker and hopefully delay the job execution much less often.
Another strategy would be to delay returning from a special save_to_read method until read replicas have the value, either by synchronously pushing the new value to the read replicas somehow or just polling them all until they return the record. This way seems a little hackier IMO.
For a lot of your reads, you probably don't have to worry about read-after-write consistency:
If we’re rendering the name of the enterprise a user is part of, it’s really not that big a deal if in the incredibly rare occasion that an admin changes it, it takes a minute to have the change propagate to the enterprise’s users.
Imagine you have a User model in your web app, and that you need to keep this user in sync with an external service via an API. Thus, when you create a user locally, you need to create it remotely as well.
You have all your operations under transaction.atomic() and you try to keep all your 3rd-party API calls after the atomic block, which is reasonable.
But, a system being a system, it grows in complexity until the point you have some really hard to remove 3rd-party calls within an update call.
That said, is there a way to extend Django's transaction mechanism, kind of adding some callback functions, like rollback.add_callback(clean_3rdparty_user(user_id=134))?
That way I can guarantee that all necessary rollback actions are taken and my system is in sync?
The author of Django's transaction hook code has this to say about why there is on_commit() but not on_rollback():
A rollback hook is even harder to implement robustly than a commit hook, since a variety of things can cause an implicit rollback. For instance, your database connection was dropped because your process was killed without a chance to shutdown gracefully: your rollback hook will never run.
Since rollbacks are typically triggered by an exception, a simple approach is to just catch any exceptions and run your undo code there.
try:
with transaction.atomic():
# Do database stuff
# Do external stuff
except:
# We know the database stuff has rolled back, so...
# Undo external stuff
raise
This is not particularly elegant. I agree with the following from the same source:
The solution is simple: instead of doing something during the atomic block (transaction) and then undoing it if the transaction fails, use on_commit to delay doing it in the first place until after the transaction succeeds. It’s a lot easier to undo something you never did in the first place!
But it sounds like you already agree with that as well.
I'm unclear on the exact behaviour of Django in the face of database serialization errors in transactions.
The docs transaction.atomic() docs don't specify this behaviour as far as I can tell.
If the DB hits a consistency error while committing a transaction (e.g. another transaction updated a value that was read in the current transaction), reading django.db.transaction.py, it looks like the transaction will rollback, and the DatabaseError will be raised to the calling code (e.g. the transaction.atomic() context manager). Is this correct?
And, more importantly, are there cases when the transaction could be rolled back without the transaction.atomic wrapper receiving an exception?
(Note that I'm not asking about DatabaseErrors that are raised inside the context manager, as the docs clearly explain what happens to them. I'm asking only about database errors which occur during the commit of the transaction, which occurs on exit of the context manager.)
If the DB hits a consistency error while committing a transaction ... it looks like the transaction will rollback, and the DatabaseError will be raised to the calling code (e.g. the transaction.atomic() context manager). Is this correct?
Yes, precisely.
Are there cases when the transaction could be rolled back without the transaction.atomic wrapper receiving an exception?
No. You can verify this from the code inside transaction.py where the only time a rollback is initiated is if DatabaseError is thrown. This is also confirmed in the documentation that you link to:
When exiting an atomic block, Django looks at whether it’s exited normally or with an exception to determine whether to commit or roll back.
I'm creating a few simple helper classes and methods for working with libpq, and am wondering if I receive an error from the database - (e.g. SQL error), how should I handle it?
At the moment, each method returns a bool depending on whether the operation was a success, and so is up to the user to check before continuing with new operations.
However, after reading the libpq docs, if an error occurs the best I can come up with is that I should log the error message / status and otherwise ignore. For example, if the application is in the middle of a transaction, then I believe it can still continue (Postgresql won't cancel the transaction as far as I know).
Is there something I can do with PostgreSQL / libpq to make the consequences of such errors safe regarding the database server, or is ignorance the better policy?
You should examine the SQLSTATE in the error and make handling decisions based on that and that alone. Never try to make decisions in code based on the error message text.
An application should simply retry transactions for certain kinds of errors:
Serialization failures
Deadlock detection transaction aborts
For connection errors, you should reconnect then re-try the transaction.
Of course you want to set a limit on the number of retries, so you don't loop forever if the issue doesn't clear up.
Other kinds of errors aren't going to be resolved by trying again, so the app should report an error to the client. Syntax error? Unique violation? Check constraint violation? Running the statement again won't help.
There is a list of error codes in the documentation but the docs don't explain much about each error, but the preamble is quite informative.
On a side note: One trap to avoid falling into is "testing" connections with a trivial query before using them, and assuming that means the real query can't fail. That's a race condition. Don't bother testing connections; simply run the real query and handle any error.
The details of what exactly to do depend on the error and on the application. If there was a single always-right answer, libpq would already do it for you.
My suggestions:
Always keep a record of the transaction until you've got a confirmed commit from the DB, in case you have to re-run. Don't just fire-and-forget SQL statements.
Retry the transaction without a disconnect and reconnect for SQLSTATEs 40001 (serialization_failure) and 40P01 (deadlock_detected), as these are transient conditions generally resolved by re-trying. You should log them, as they're opportunities to improve how the app interacts with the DB and if they happen a lot they're a performance problem.
Disconnect, reconnect, and retry the transaction at least once for error class 08 (connection exceptions).
Handle 53300 (too_many_connections) and 53400 (connection limit exceeded) with specific and informative errors to the user. Same with the other 53 class entries.
Handle class 57's entries with specific and informative errors to the user. Do not retry if you get a query_cancelled (57014), it'll make sysadmins very angry.
Handle 25006 (read_only_sql_transaction) by reporting a different error, telling the user you tried to write to a read-only database or using a read-only transaction.
Report a different error for 23505 (UNIQUE violation), indicating that there's a conflict in a unique constraint or primary key constraint. There's no point retrying.
Error class 01 should never produce an exception.
Treat other cases as errors and report them to the caller, with details from the problem - most importantly SQLSTATE. Log all the details if you return a simplified error.
Hope that's useful.
I have to rollback the changes done to the database depending on some condition, but up to that 'some condition' the changes should be reflected in the database for other users.
#transaction.atomic
def populate_db(input):
Object = Table.objects.select_for_update().get(attributeX=input)
Object.attributeY = False
Object.save()
** some operation here **
Problem I'm facing is, the value of attributeY is not getting stored in the database until the whole function is executed successfully, but what i want is changed value of attributeY should be reflected in database until some operation fails.
And I cannot get to know whether some operation is failed or not, because the failures I'm trying to handle here are closing browser accidentally, power outage kind of things.
Any help is appreciated, thanks !
So what would populate_db see that indicates the transaction did not complete?
For example, the seat has been reserved but not yet paid for (because of fault). In this case, populate_db should not complete the transaction until it also has a payment authorization code.
Alternately, if you want to mark the seat's status as being_reserved, then there is no transaction, the status gets set to being_reserved and other clients can see it. In this model, populate_db would be responsible for detecting the fault (through exceptions possibly) and returning the seat status to available in another database update.
The error in your thinking is that the database can remain consistent regardless of the failure of any component. That requirement cannot be satisfied. You cannot both allow other clients to see being_reserved and suffer a failure of populate_db.
This trade-off is central to every reservation system ever written. And there are too many ways to regain consistency in the face of arbitrary failure to enumerate here.