What are the failure modes of transaction.atomic()? - django

I'm unclear on the exact behaviour of Django in the face of database serialization errors in transactions.
The docs transaction.atomic() docs don't specify this behaviour as far as I can tell.
If the DB hits a consistency error while committing a transaction (e.g. another transaction updated a value that was read in the current transaction), reading django.db.transaction.py, it looks like the transaction will rollback, and the DatabaseError will be raised to the calling code (e.g. the transaction.atomic() context manager). Is this correct?
And, more importantly, are there cases when the transaction could be rolled back without the transaction.atomic wrapper receiving an exception?
(Note that I'm not asking about DatabaseErrors that are raised inside the context manager, as the docs clearly explain what happens to them. I'm asking only about database errors which occur during the commit of the transaction, which occurs on exit of the context manager.)

If the DB hits a consistency error while committing a transaction ... it looks like the transaction will rollback, and the DatabaseError will be raised to the calling code (e.g. the transaction.atomic() context manager). Is this correct?
Yes, precisely.
Are there cases when the transaction could be rolled back without the transaction.atomic wrapper receiving an exception?
No. You can verify this from the code inside transaction.py where the only time a rollback is initiated is if DatabaseError is thrown. This is also confirmed in the documentation that you link to:
When exiting an atomic block, Django looks at whether it’s exited normally or with an exception to determine whether to commit or roll back.

Related

Django Transactions: How to run extra code during rollback?

Imagine you have a User model in your web app, and that you need to keep this user in sync with an external service via an API. Thus, when you create a user locally, you need to create it remotely as well.
You have all your operations under transaction.atomic() and you try to keep all your 3rd-party API calls after the atomic block, which is reasonable.
But, a system being a system, it grows in complexity until the point you have some really hard to remove 3rd-party calls within an update call.
That said, is there a way to extend Django's transaction mechanism, kind of adding some callback functions, like rollback.add_callback(clean_3rdparty_user(user_id=134))?
That way I can guarantee that all necessary rollback actions are taken and my system is in sync?
The author of Django's transaction hook code has this to say about why there is on_commit() but not on_rollback():
A rollback hook is even harder to implement robustly than a commit hook, since a variety of things can cause an implicit rollback. For instance, your database connection was dropped because your process was killed without a chance to shutdown gracefully: your rollback hook will never run.
Since rollbacks are typically triggered by an exception, a simple approach is to just catch any exceptions and run your undo code there.
try:
with transaction.atomic():
# Do database stuff
# Do external stuff
except:
# We know the database stuff has rolled back, so...
# Undo external stuff
raise
This is not particularly elegant. I agree with the following from the same source:
The solution is simple: instead of doing something during the atomic block (transaction) and then undoing it if the transaction fails, use on_commit to delay doing it in the first place until after the transaction succeeds. It’s a lot easier to undo something you never did in the first place!
But it sounds like you already agree with that as well.

Particular NServiceBus Sagas: Concurrent Access to Saga Data Persisted in Azure Table Storage

This question is in regards to concurrent access to saga data, when saga data is persisted in Azure Table Storage. It is also references information found in Particular's documentation: http://docs.particular.net/nservicebus/nservicebus-sagas-and-concurrency
We've noticed that, within a single saga executing handlers concurrently, modifications to saga data appear to be operating in a "last one to post changes to azure table storage wins" scenario. Is this intended behavior when using NSB in conjunction with Azure Table Storage as the Saga data persistence layer?
Example:
Integer property in Saga Data, assume it currently = 5
5 commands are handled by 5 instances of the same handler in this saga
Each command handler decrements the integer property in saga data
Final value of the integer property in saga data could actually be 4 after handling these 5 messages - if each message was handled by a new instance of the saga, potentially on different servers, each having a copy of saga data indicating the integer property is 5, decrement it to 4, and post back up. What I just described was the extremely concurrent example, however it is likely the integer will be greater than 0 if any of the 5 messages were handled concurrently, the only time the saga data integer property reaches 0 is when the 5 commands happen to have executing serially.
Also, as Azure Table Storage supports Optimistic Concurrency, is it possible to enable the use of of this feature for Table Storage just as it is enabled for RavenDB when Raven is used as the persistence tech?
If this is not possible, what is the recommended approach for handling this? Currently we are subscribing to the paradigm that any handler in a saga that could ever potentially be handling multiple messages concurrently is not allowed to modify saga data, meaning our coordination of saga message is being accomplished via means external to the saga rather than using Saga Data as we'd initially intended.
After working with Particular support - the symptoms described above ended up being an defect in NServiceBus.Azure. This issue has been patched by Particular in NServiceBus.Azure 5.3.11 and 6.2+. I can personally confirm that updating to 5.3.11 resolved our issues.
For reference, a tell-tale sign of this issue manifesting itself is the following exception getting thrown and not getting handled.
Failed to process message
Microsoft.WindowsAzure.Storage.StorageException: Unexpected response
code for operation : 0
The details of the exception will indicate that "UpdateConditionNotSatisfied" - referring to the optimistic concurrency check.
Thanks to Yves Goeleven and Sean Feldman from Particular for diagnosing and resolving this issue.
The azure saga storage persister uses optimistic concurency, if multiple messages arrive at the same time, the last one to update should throw an exception, retry and make the data correct again.
So this sounds like a bug, can you share which version you're on?
PS: last year we have resolved an issue that sounds very similar to this https://github.com/Particular/NServiceBus.Azure/issues/124 it has been resolved in NServiceBus.Azure 5.2 and upwards

How should I handle an error in libpq for postgresql

I'm creating a few simple helper classes and methods for working with libpq, and am wondering if I receive an error from the database - (e.g. SQL error), how should I handle it?
At the moment, each method returns a bool depending on whether the operation was a success, and so is up to the user to check before continuing with new operations.
However, after reading the libpq docs, if an error occurs the best I can come up with is that I should log the error message / status and otherwise ignore. For example, if the application is in the middle of a transaction, then I believe it can still continue (Postgresql won't cancel the transaction as far as I know).
Is there something I can do with PostgreSQL / libpq to make the consequences of such errors safe regarding the database server, or is ignorance the better policy?
You should examine the SQLSTATE in the error and make handling decisions based on that and that alone. Never try to make decisions in code based on the error message text.
An application should simply retry transactions for certain kinds of errors:
Serialization failures
Deadlock detection transaction aborts
For connection errors, you should reconnect then re-try the transaction.
Of course you want to set a limit on the number of retries, so you don't loop forever if the issue doesn't clear up.
Other kinds of errors aren't going to be resolved by trying again, so the app should report an error to the client. Syntax error? Unique violation? Check constraint violation? Running the statement again won't help.
There is a list of error codes in the documentation but the docs don't explain much about each error, but the preamble is quite informative.
On a side note: One trap to avoid falling into is "testing" connections with a trivial query before using them, and assuming that means the real query can't fail. That's a race condition. Don't bother testing connections; simply run the real query and handle any error.
The details of what exactly to do depend on the error and on the application. If there was a single always-right answer, libpq would already do it for you.
My suggestions:
Always keep a record of the transaction until you've got a confirmed commit from the DB, in case you have to re-run. Don't just fire-and-forget SQL statements.
Retry the transaction without a disconnect and reconnect for SQLSTATEs 40001 (serialization_failure) and 40P01 (deadlock_detected), as these are transient conditions generally resolved by re-trying. You should log them, as they're opportunities to improve how the app interacts with the DB and if they happen a lot they're a performance problem.
Disconnect, reconnect, and retry the transaction at least once for error class 08 (connection exceptions).
Handle 53300 (too_many_connections) and 53400 (connection limit exceeded) with specific and informative errors to the user. Same with the other 53 class entries.
Handle class 57's entries with specific and informative errors to the user. Do not retry if you get a query_cancelled (57014), it'll make sysadmins very angry.
Handle 25006 (read_only_sql_transaction) by reporting a different error, telling the user you tried to write to a read-only database or using a read-only transaction.
Report a different error for 23505 (UNIQUE violation), indicating that there's a conflict in a unique constraint or primary key constraint. There's no point retrying.
Error class 01 should never produce an exception.
Treat other cases as errors and report them to the caller, with details from the problem - most importantly SQLSTATE. Log all the details if you return a simplified error.
Hope that's useful.

Django : How to rollback changes done to the database depending on some condition?

I have to rollback the changes done to the database depending on some condition, but up to that 'some condition' the changes should be reflected in the database for other users.
#transaction.atomic
def populate_db(input):
Object = Table.objects.select_for_update().get(attributeX=input)
Object.attributeY = False
Object.save()
** some operation here **
Problem I'm facing is, the value of attributeY is not getting stored in the database until the whole function is executed successfully, but what i want is changed value of attributeY should be reflected in database until some operation fails.
And I cannot get to know whether some operation is failed or not, because the failures I'm trying to handle here are closing browser accidentally, power outage kind of things.
Any help is appreciated, thanks !
So what would populate_db see that indicates the transaction did not complete?
For example, the seat has been reserved but not yet paid for (because of fault). In this case, populate_db should not complete the transaction until it also has a payment authorization code.
Alternately, if you want to mark the seat's status as being_reserved, then there is no transaction, the status gets set to being_reserved and other clients can see it. In this model, populate_db would be responsible for detecting the fault (through exceptions possibly) and returning the seat status to available in another database update.
The error in your thinking is that the database can remain consistent regardless of the failure of any component. That requirement cannot be satisfied. You cannot both allow other clients to see being_reserved and suffer a failure of populate_db.
This trade-off is central to every reservation system ever written. And there are too many ways to regain consistency in the face of arbitrary failure to enumerate here.

Weird behaviour for transactions in django 1.6.1

I am using transaction.atomic as a context manager for transactions in django 1.6. There is a block of code which I want to be in a transaction which has a couple of network calls and some database writes. I am seeing very weird behaviour. Every once in while (maybe 1 in 20 times) I have noticed a partial rollback happening without any exception having been raised and the view executing without any errors. My application is hosted on heroku and we use heroku postgres v9.2.8. Pseudo code:
from django.db import transaction
def some_view(request):
try:
with transation.atomic():
network_call_1()
db_write_1.save(update_fields=['col4',])
db_write_2.save(update_fields=['col3',])
db_write_3.save(update_fields=['col1',])
network_call_2()
db_write_4.save(update_fields=['col6',])
db_write_5.bulk_create([object1, object2])
db_write_6.bulk_create([object1, object2])
except Exception, e:
logger.error(e)
return HttpResponse()
The behaviour that I have noticed is that without any exception having been raised, either db write 1-3 have rolled back and the rest gone through or db write 1 has been rolled back and rest have gone through and so on. I don't understand why this should be happening. First, if there is a rollback, shouldn't it be a complete rollback of the transaction? If there is a rollback shouldn't an exception also be raised so that I know a rollback has happened? Everytime this has happened, no exception has been raised and the code just continues executing and returns a successful HttpResponse.
Relevant settings:
DATABASES = {
'default': {
'ENGINE': 'django.db.backends.postgresql_psycopg2',
'NAME': 'mydb',
'USER': 'root',
'PASSWORD': 'root',
'HOST': 'localhost',
'PORT': '5432',
},
}
CONN_MAX_AGE = None
This bug has me baffled since days. Any clues will be of great help!
After hours of debugging, we have found the culprit.
When we start our application on gunicorn, it spawns workers. Every request coming to the same worker uses the same django DatabaseWrapper instance (postgres in our case) also referred to as a connection. If, in the middle of a transaction in one request, the worker were to receive another request, this request resets the state of the connection causing the transaction to behave in unexpected ways as documented in this bug: https://code.djangoproject.com/ticket/21239
Sometimes the transaction doesn't get committed and there is no exception raised to let you know that happened. Sometimes parts of it do get committed while the rest is lost and it looks like a partial rollback.
We thought that a connection is thread safe, but this bit of gunicorn patching magic here makes sure that's not the case: https://github.com/benoitc/gunicorn/blob/18.0/gunicorn/management/commands/run_gunicorn.py#L16
Still open to suggestions on how to sidestep this issue if possible at all.
EDIT: Don't use the run_gunicorn management command to start Django. It does some funky patching which causes DB connections to not be thread safe. The solution that worked for us is to just use "gunicorn myapp.wsgi:application -c gunicorn.conf". Django persistent DB connections don't work with the gevent worker type yet so avoid using that unless you want to run out of connections.
Not a Django expert, but I do know Postgres. I agree with your assessment that this sounds like very atypical behavior for a transaction: the rollback should be all-or-nothing, and there should be an exception. That being the case, can you be absolutely certain that this is a rollback-type situation? There are lots of other possible causes that could account for different data appearing in the database than you expected, and many of those scenarios would fit better with your observed occurrences that does a rollback.
You haven't provided any specifics as to your data, but what I imagine is, you're seeing something like "I set the value of col4 to 'foo', but after the commit, the old value 'bar' is still in the database." Is that correct?
If so, then other possible causes could be:
The code that is supposed to setting the 'foo' value somehow, on occasion, is actually setting either the existing 'bar' value, or a NULL value.
The code is setting the 'foo' value, but the there is a data access layer (aka DAL) with a 'dirty' flag that is not being set (e.g. if the object is in a disconnected state), so when the commit is done, the DAL doesn't see that as being a change it is supposed to write.
These are just a few examples to get you started. There are lots of other possible scenarios. Sometimes, the basic philosophy of debugging problems like this is similar to the problem of the DDT and pelicans: since the database is at the top of the food chain, you can often see problems there that-- while they appear to be database problems-- are actually caused somewhere else in your solution.
Good luck and hope that helps!
My 3 cents:
Exceptions
We're certain no exceptions have occurred. But are we? Your pseudo-code "handles" an exception by just logging. Make sure there are no exceptions "handled" elsewhere by logging or pass.
The partial rollback
We expect the the whole transaction to be rolled back, not just part. Since django 1.6 nested atomic transactions create a savepoint and rollbacks go back to the last savepoint. Make sure there are no nested transactions. Perhaps you have transaction middleware active check ATOMIC_REQUESTS and MIDDLEWARE_CLASSES. Maybe transactions are started in those network_call functions.
Reproducing
Since that network_call code may block. Try to replace them with mock calls, that timeout (maybe not in production). If that results in 100% (partial) rollbacks. It should make locating the problem of partial rollbacks easier.
Let me just make few remarks first.
It is not necessary to have an exception in this code and still have rollback.
Maybe there is some kind of timeout outside this code. Think if you killed python process in the middle of the second network call. This particular exception would not be logged.
I would also recommend adding
raise
at the end of except, it will log and re-raise the same exception. Cacthing all exceptions is rarely good.
Also, there might be a threading issue. Try importing threding and logging current thread id in your logger with the exception. You may find out that you actually have more than one thread, so one has to wait on another.
Generally, it is not a good idea to have some external calls in the middle of transaction.
Do both your calls before you start atomic transaction, so it can be as fast as possible.
Hope this helps.