Django Transactions: How to run extra code during rollback? - django

Imagine you have a User model in your web app, and that you need to keep this user in sync with an external service via an API. Thus, when you create a user locally, you need to create it remotely as well.
You have all your operations under transaction.atomic() and you try to keep all your 3rd-party API calls after the atomic block, which is reasonable.
But, a system being a system, it grows in complexity until the point you have some really hard to remove 3rd-party calls within an update call.
That said, is there a way to extend Django's transaction mechanism, kind of adding some callback functions, like rollback.add_callback(clean_3rdparty_user(user_id=134))?
That way I can guarantee that all necessary rollback actions are taken and my system is in sync?

The author of Django's transaction hook code has this to say about why there is on_commit() but not on_rollback():
A rollback hook is even harder to implement robustly than a commit hook, since a variety of things can cause an implicit rollback. For instance, your database connection was dropped because your process was killed without a chance to shutdown gracefully: your rollback hook will never run.
Since rollbacks are typically triggered by an exception, a simple approach is to just catch any exceptions and run your undo code there.
try:
with transaction.atomic():
# Do database stuff
# Do external stuff
except:
# We know the database stuff has rolled back, so...
# Undo external stuff
raise
This is not particularly elegant. I agree with the following from the same source:
The solution is simple: instead of doing something during the atomic block (transaction) and then undoing it if the transaction fails, use on_commit to delay doing it in the first place until after the transaction succeeds. It’s a lot easier to undo something you never did in the first place!
But it sounds like you already agree with that as well.

Related

django, multi-databases (writer, read-reploicas) and a sync issue

So... in response to an API call I do:
i = CertainObject(paramA=1, paramB=2)
i.save()
now my writer database has a new record.
Processing can take a bit and I do not wish to hold off my response to the API caller, so the next line I am transferring the object ID to an async job using Celery:
run_async_job.delay(i.id)
right away, or a few secs away depending on the queue run_async_job tried to load up the record from the database with that ID provided. It's a gamble. Sometimes it works, sometimes doesn't depending whether the read replicas updated or not.
Is there pattern to guarantee success and not having to "sleep" for a few seconds before reading or hope for good luck?
Thanks.
The simplest way seems to be using the retries as mentioned by Greg and Elrond in their answers. If you're using shared_task or #app.task decorators, you can use the following code snippet.
#shared_task(bind=True)
def your_task(self, certain_object_id):
try:
certain_obj = CertainObject.objects.get(id=certain_object_id)
# Do your stuff
except CertainObject.DoesNotExist as e:
self.retry(exc=e, countdown=2 ** self.request.retries, max_retries=20)
I used an exponential countdown in between every retry. You can modify it according to your needs.
You can find the documentation for custom retry delay here.
There is also another document explaining the exponential backoff in this link
When you call retry it’ll send a new message, using the same task-id, and it’ll take care to make sure the message is delivered to the same queue as the originating task. You can read more about this in the documentation here
As writing and then loading it immediately is a high priority, then why not store it in memory based DB like Memcache or Redis. So that after sometime, you can write it in the Database using a periodic job in celery which will run lets say every minute or so. When it is done writing to DB, it will delete the keys from Redis/Memcache.
You can keep the data in memory based DB for certain time, lets say 1 hour when the data is needed most. Also you can create a service method, which will check if the data is in memory or not.
Django Redis is a great package to connect to redis(if you are using it as broker in Celery).
I am providing some example based on Django cache:
# service method
from django.core.cache import cache
def get_object(obj_id, model_cls):
obj_dict = cache.get(obj_id, None) # checks if obj id is in cache, O(1) complexity
if obj_dict:
return model_cls(**obj_dict)
else:
return model_cls.objects.get(id=obj_id)
# celery job
#app.task
def store_objects():
logger.info("-"*25)
# you can use .bulk_create() to reduce DB hits and faster DB entries
for obj_id in cache.keys("foo_*"):
CertainObject.objects.create(**cache.get(obj_id))
cache.delete(obj_id)
logger.info("-"*25)
The simplest solution would be to catch any DoesNotExist errors thrown at the start of the task, then schedule a retry. This can be done by converting run_async_job into a Bound Task:
#app.task(bind=True)
def run_async_job(self, object_id):
try:
instance = CertainObject.objects.get(id=object_id)
except CertainObject.DoesNotExist:
return self.retry(object_id)
This article goes pretty deep into how you can handle read-after-write issues with replicated databases: https://medium.com/box-tech-blog/how-we-learned-to-stop-worrying-and-read-from-replicas-58cc43973638.
Like the author, I know of no foolproof catch-all way to handle read-after-write inconsistency.
The main strategy I've used before is to have some kind of expect_and_get(pk, max_attempts=10, delay_seconds=5) method that attempts to fetch the record, and attempts it max_attempts times, delaying delay_seconds seconds in between attempts. The idea is that it "expects" the record to exist, and so it treats a certain number of failures as just transient DB issues. It's a little more reliable than just sleeping for some time since it will pick up records quicker and hopefully delay the job execution much less often.
Another strategy would be to delay returning from a special save_to_read method until read replicas have the value, either by synchronously pushing the new value to the read replicas somehow or just polling them all until they return the record. This way seems a little hackier IMO.
For a lot of your reads, you probably don't have to worry about read-after-write consistency:
If we’re rendering the name of the enterprise a user is part of, it’s really not that big a deal if in the incredibly rare occasion that an admin changes it, it takes a minute to have the change propagate to the enterprise’s users.

How do I add simple delayed tasks in Django?

I am creating a chatbot and need a solution to send messages to the user in the future after a specific delay. I have my system set up with Nginx, Gunicorn and Django. The idea is that if the bot needs to send the user several messages, it can delay each subsequent message by a certain amount of time before it sends it to seem more 'human'.
However, a simple threading.Timer approach won't work because the user might interrupt this process at any moment prompting future messages to be changed, but the timer threads might not be available to be stopped as they are on a different worker. So far I have come across two solutions:
Use threading.Timer blindly to check a to-send list in the database, can create problems with lots of unneeded threads. Also makes the database less clean/organized.
Use celery or some other system to execute these future tasks. Seems like overkill and over-engineering a simple problem. Tasks will always just be delayed function calls. Also a hassle dealing with which messages belong to which conversation.
What would be the best solution for this problem?
Also, a more generic question:
Ideally the best solution would be a framework where I can 'simulate' a new bot for each conversation so it acts as its own entity and holds all the state/message queue information in memory for itself. It would be necessary for this framework to only allocate resources to a bot when it needs to do something based on a preset delay or incoming message. Is there anything that exists like this?
Personally I would use Celery for this; executing delayed function calls is its job. And I don't know why knowing what messages belong where would be more of a problem there than doing it in a thread.
But you might also want to investigate the new Django-Channels work that Andrew Godwin is doing, since that is intended to support async background tasks.

How to handle concurrency by eventual consistency?

How to handle concurrency by eventual consistency? Or I could ask how to ensure data integrity by eventual consistency?
By CQRS and event sourcing, eventual consistency means, that you put your domain events into a queue, and you set event handlers which are projections. Those projections update the read cache in an async way. Now if you validate using that read cache, you cannot be sure that the information you base your validation on, is still valid. There can be unprocessed (or unprojected?) domain events in the queue when you send your command, which can change the outcome of the validation. So this is just another type of concurrency... What do you think, how to handle these rare concurrency issues? Domain events are already saved in the storage, so you cannot do anything about them, you cannot just remove them from the event storage (because it supposed to be write only once), and tell the user in an email, that sorry, we made up our mind and cancelled your request. Or can you?
update:
A possible solution to handle concurrency by an event storage:
by write model
if
last-known-aggregate-version < stored-aggregate-version
then
throw error
else
execute command on aggregate
raise domain-event
store domain-event
++stored-aggregate-version (by aggregate-id)
by read model
process query
if
result contains aggregate-id
then
attach read-cached-aggregate-version
by projection
process domain-event
read-cached-aggregate-version = domain-event-related-aggregate-version (by aggregate-id)
As long as state changes you cannot assume anything will ever be 100% consistent. Technically you can ensure that various bits are 100% consistent with what you know.
Your queued domain event scenario is no different from a queue of work on a user's desk that still has to be input into the system.
Any other user performing an action dependent on the system state has no way to know that another user still needs to perform some action that may interfere with their operation.
I guess a lot is based on assuming the data is consistent and developing alternate flows and processes that can deal with these scenarios as they arise.

Weird behaviour for transactions in django 1.6.1

I am using transaction.atomic as a context manager for transactions in django 1.6. There is a block of code which I want to be in a transaction which has a couple of network calls and some database writes. I am seeing very weird behaviour. Every once in while (maybe 1 in 20 times) I have noticed a partial rollback happening without any exception having been raised and the view executing without any errors. My application is hosted on heroku and we use heroku postgres v9.2.8. Pseudo code:
from django.db import transaction
def some_view(request):
try:
with transation.atomic():
network_call_1()
db_write_1.save(update_fields=['col4',])
db_write_2.save(update_fields=['col3',])
db_write_3.save(update_fields=['col1',])
network_call_2()
db_write_4.save(update_fields=['col6',])
db_write_5.bulk_create([object1, object2])
db_write_6.bulk_create([object1, object2])
except Exception, e:
logger.error(e)
return HttpResponse()
The behaviour that I have noticed is that without any exception having been raised, either db write 1-3 have rolled back and the rest gone through or db write 1 has been rolled back and rest have gone through and so on. I don't understand why this should be happening. First, if there is a rollback, shouldn't it be a complete rollback of the transaction? If there is a rollback shouldn't an exception also be raised so that I know a rollback has happened? Everytime this has happened, no exception has been raised and the code just continues executing and returns a successful HttpResponse.
Relevant settings:
DATABASES = {
'default': {
'ENGINE': 'django.db.backends.postgresql_psycopg2',
'NAME': 'mydb',
'USER': 'root',
'PASSWORD': 'root',
'HOST': 'localhost',
'PORT': '5432',
},
}
CONN_MAX_AGE = None
This bug has me baffled since days. Any clues will be of great help!
After hours of debugging, we have found the culprit.
When we start our application on gunicorn, it spawns workers. Every request coming to the same worker uses the same django DatabaseWrapper instance (postgres in our case) also referred to as a connection. If, in the middle of a transaction in one request, the worker were to receive another request, this request resets the state of the connection causing the transaction to behave in unexpected ways as documented in this bug: https://code.djangoproject.com/ticket/21239
Sometimes the transaction doesn't get committed and there is no exception raised to let you know that happened. Sometimes parts of it do get committed while the rest is lost and it looks like a partial rollback.
We thought that a connection is thread safe, but this bit of gunicorn patching magic here makes sure that's not the case: https://github.com/benoitc/gunicorn/blob/18.0/gunicorn/management/commands/run_gunicorn.py#L16
Still open to suggestions on how to sidestep this issue if possible at all.
EDIT: Don't use the run_gunicorn management command to start Django. It does some funky patching which causes DB connections to not be thread safe. The solution that worked for us is to just use "gunicorn myapp.wsgi:application -c gunicorn.conf". Django persistent DB connections don't work with the gevent worker type yet so avoid using that unless you want to run out of connections.
Not a Django expert, but I do know Postgres. I agree with your assessment that this sounds like very atypical behavior for a transaction: the rollback should be all-or-nothing, and there should be an exception. That being the case, can you be absolutely certain that this is a rollback-type situation? There are lots of other possible causes that could account for different data appearing in the database than you expected, and many of those scenarios would fit better with your observed occurrences that does a rollback.
You haven't provided any specifics as to your data, but what I imagine is, you're seeing something like "I set the value of col4 to 'foo', but after the commit, the old value 'bar' is still in the database." Is that correct?
If so, then other possible causes could be:
The code that is supposed to setting the 'foo' value somehow, on occasion, is actually setting either the existing 'bar' value, or a NULL value.
The code is setting the 'foo' value, but the there is a data access layer (aka DAL) with a 'dirty' flag that is not being set (e.g. if the object is in a disconnected state), so when the commit is done, the DAL doesn't see that as being a change it is supposed to write.
These are just a few examples to get you started. There are lots of other possible scenarios. Sometimes, the basic philosophy of debugging problems like this is similar to the problem of the DDT and pelicans: since the database is at the top of the food chain, you can often see problems there that-- while they appear to be database problems-- are actually caused somewhere else in your solution.
Good luck and hope that helps!
My 3 cents:
Exceptions
We're certain no exceptions have occurred. But are we? Your pseudo-code "handles" an exception by just logging. Make sure there are no exceptions "handled" elsewhere by logging or pass.
The partial rollback
We expect the the whole transaction to be rolled back, not just part. Since django 1.6 nested atomic transactions create a savepoint and rollbacks go back to the last savepoint. Make sure there are no nested transactions. Perhaps you have transaction middleware active check ATOMIC_REQUESTS and MIDDLEWARE_CLASSES. Maybe transactions are started in those network_call functions.
Reproducing
Since that network_call code may block. Try to replace them with mock calls, that timeout (maybe not in production). If that results in 100% (partial) rollbacks. It should make locating the problem of partial rollbacks easier.
Let me just make few remarks first.
It is not necessary to have an exception in this code and still have rollback.
Maybe there is some kind of timeout outside this code. Think if you killed python process in the middle of the second network call. This particular exception would not be logged.
I would also recommend adding
raise
at the end of except, it will log and re-raise the same exception. Cacthing all exceptions is rarely good.
Also, there might be a threading issue. Try importing threding and logging current thread id in your logger with the exception. You may find out that you actually have more than one thread, so one has to wait on another.
Generally, it is not a good idea to have some external calls in the middle of transaction.
Do both your calls before you start atomic transaction, so it can be as fast as possible.
Hope this helps.

How do I detect an aborted connection in Django?

I have a Django view that does some pretty heavy processing and takes around 20-30 seconds to return a result.
Sometimes the user will end up closing the browser window (terminating the connection) before the request completes -- in that case, I'd like to be able to detect this and stop working. The work I do is read-only on the database so there isn't any issue with transactions.
In PHP the connection_aborted function does exactly this. Is this functionality available in Django?
Here's example code I'd like to write:
def myview(request):
while not connection_aborted():
# do another bit of work...
if work_complete:
return HttpResponse('results go here')
Thanks.
I don't think Django provides it because it basically can't. More than Django itself, this depends on the way Django interfaces with your web server. All this depends on your software stack (which you have not specified). I don't think it's even part of the FastCGI and WSGI protocols!
Edit: I'm also pretty sure that Django does not start sending any data to the client until your view finishes execution, so it can't possibly know if the connection is dead. The underlying socket won't trigger an error unless the server tries to send some data back to the user.
That connection_aborted method in PHP doesn't do what you think it does. It will tell you if the client disconnected but only if the buffer has been flushed, i.e. some sort of response is sent from the server back to the client. The PHP versions wouldn't even work as you've written if above. You'd have to add a call to something like flush within your loop to have the server attempt to send data.
HTTP is a stateless protocol. It's designed to not have either the client or the server dependent on each other. As a result the state of either is only known when there is a connection is created, and that only occurs when there's some data to send one way or another.
Your best bet is to do as #MattH suggested and do this through a bit of AJAX, and if you'd like you can integrate something like Node.js to make client "check-ins" during processing. How to set that up properly is beyond my area of expertise, though.
So you have an AJAX view that runs a query that takes 20-30 seconds to process requested in the background of a rendered page and you're concerned about wasted resources for when someone cancels the page load.
I see that you've got options in three broad categories:
Live with it. Improve the situation by caching the results in case the user comes back.
Make it faster. Throw more space at a time/space trade-off. Maintain intermediate tables. Precalculate the entire thing, etc.
Do something clever with the browser fast-polling a "is it ready yet?" query and the server cancelling the query if it doesn't receive a nag within interval * 2 or similar. If you're really clever, you could return progress / ETA to the nags. However, this might not have particularly useful behaviour when the system is under load or your site is being accessed over limited bandwidth.
I don't think you should go for option 3 because it's increasing complexity and resource usage for not much gain.