django, multi-databases (writer, read-reploicas) and a sync issue - django

So... in response to an API call I do:
i = CertainObject(paramA=1, paramB=2)
i.save()
now my writer database has a new record.
Processing can take a bit and I do not wish to hold off my response to the API caller, so the next line I am transferring the object ID to an async job using Celery:
run_async_job.delay(i.id)
right away, or a few secs away depending on the queue run_async_job tried to load up the record from the database with that ID provided. It's a gamble. Sometimes it works, sometimes doesn't depending whether the read replicas updated or not.
Is there pattern to guarantee success and not having to "sleep" for a few seconds before reading or hope for good luck?
Thanks.

The simplest way seems to be using the retries as mentioned by Greg and Elrond in their answers. If you're using shared_task or #app.task decorators, you can use the following code snippet.
#shared_task(bind=True)
def your_task(self, certain_object_id):
try:
certain_obj = CertainObject.objects.get(id=certain_object_id)
# Do your stuff
except CertainObject.DoesNotExist as e:
self.retry(exc=e, countdown=2 ** self.request.retries, max_retries=20)
I used an exponential countdown in between every retry. You can modify it according to your needs.
You can find the documentation for custom retry delay here.
There is also another document explaining the exponential backoff in this link
When you call retry it’ll send a new message, using the same task-id, and it’ll take care to make sure the message is delivered to the same queue as the originating task. You can read more about this in the documentation here

As writing and then loading it immediately is a high priority, then why not store it in memory based DB like Memcache or Redis. So that after sometime, you can write it in the Database using a periodic job in celery which will run lets say every minute or so. When it is done writing to DB, it will delete the keys from Redis/Memcache.
You can keep the data in memory based DB for certain time, lets say 1 hour when the data is needed most. Also you can create a service method, which will check if the data is in memory or not.
Django Redis is a great package to connect to redis(if you are using it as broker in Celery).
I am providing some example based on Django cache:
# service method
from django.core.cache import cache
def get_object(obj_id, model_cls):
obj_dict = cache.get(obj_id, None) # checks if obj id is in cache, O(1) complexity
if obj_dict:
return model_cls(**obj_dict)
else:
return model_cls.objects.get(id=obj_id)
# celery job
#app.task
def store_objects():
logger.info("-"*25)
# you can use .bulk_create() to reduce DB hits and faster DB entries
for obj_id in cache.keys("foo_*"):
CertainObject.objects.create(**cache.get(obj_id))
cache.delete(obj_id)
logger.info("-"*25)

The simplest solution would be to catch any DoesNotExist errors thrown at the start of the task, then schedule a retry. This can be done by converting run_async_job into a Bound Task:
#app.task(bind=True)
def run_async_job(self, object_id):
try:
instance = CertainObject.objects.get(id=object_id)
except CertainObject.DoesNotExist:
return self.retry(object_id)

This article goes pretty deep into how you can handle read-after-write issues with replicated databases: https://medium.com/box-tech-blog/how-we-learned-to-stop-worrying-and-read-from-replicas-58cc43973638.
Like the author, I know of no foolproof catch-all way to handle read-after-write inconsistency.
The main strategy I've used before is to have some kind of expect_and_get(pk, max_attempts=10, delay_seconds=5) method that attempts to fetch the record, and attempts it max_attempts times, delaying delay_seconds seconds in between attempts. The idea is that it "expects" the record to exist, and so it treats a certain number of failures as just transient DB issues. It's a little more reliable than just sleeping for some time since it will pick up records quicker and hopefully delay the job execution much less often.
Another strategy would be to delay returning from a special save_to_read method until read replicas have the value, either by synchronously pushing the new value to the read replicas somehow or just polling them all until they return the record. This way seems a little hackier IMO.
For a lot of your reads, you probably don't have to worry about read-after-write consistency:
If we’re rendering the name of the enterprise a user is part of, it’s really not that big a deal if in the incredibly rare occasion that an admin changes it, it takes a minute to have the change propagate to the enterprise’s users.

Related

Django: for loop through parallel process and store values and return after it finishes

I have a for loop in django. It will loop through a list and get the corresponding data from database and then do some calculation based on the database value and then append it another list
def getArrayList(request):
list_loop = [...set of values to loop through]
store_array = [...store values here from for loop]
for a in list_loop:
val_db = SomeModel.objects.filter(somefield=a).first()
result = perform calculation on val_db
store_array.append(result)
The list if 10,000 entries. If the user want this request he is ready to wait and will be informed that it will take time
I have tried joblib with backed=threading its not saving much time than normal loop
But when i try with backend=multiprocessing. it says "Apps aren't loaded yet"
I read multiprocessing is not possible in module based files.
So i am looking at celery now. I am not sure how can this be done in celery.
Can any one guide how can we faster the for loop calculation using mutliprocessing techniques available.
You're very likely looking for the wrong solution. But then again - this is pseudo code so we can't be sure.
In either case, your pseudo code is a self-fulfilling prophecy, since you run queries in a for loop. That means network latency, result set fetching, tying up database resources etc etc. This is never a good pattern, at best it's a last resort.
The simple solution is to get all values in one query:
list_values = [ ... ]
results = []
db_values = SomeModel.objects.filter(field__in=list_values)
for value in db_values:
results.append(calc(value))
If for some reason you need to loop, then to do this in celery, you would mark the function as a task (plenty of examples to find). It won't speed up anything. But you won't speed up anything - it will we be run in the background and so you render a "please wait" message and somehow you need to notify the user again that the job is done.
I'm saying somehow, because there isn't a really good integration package that I'm aware of that ties in all the components. There's django-notifications-hq, but if this is your only background task, it's a lot of extra baggage just for that - so you may want to change the notification part to "we will send you an email when the job is done", cause that's easy to achieve inside your function.
And thirdly, if this is simply creating a report, that doesn't need things like automatic retries on failure, then you can simply opt to use Django Channels and a browser-native websocket to start and report on the job (which also allows you to send email).
You could try concurrent.futures.ProcessPoolExecutor, which is a high level api for processing cpu bound tasks
def perform_calculation(item):
pass
# specify number of workers(default: number of processors on your machine)
with concurrent.futures.ProcessPoolExecutor(max_workers=6) as executor:
res = executor.map(perform_calculation, tasks)
EDIT
In case of IO bound operation, you could make use of ThreadPoolExecutor to open a few connections in parallel, you can wrap the pool in a contextmanager which handles the cleanup work for you(close idle connections). Here is one example but handles the connection closing manually.

Django Transactions: How to run extra code during rollback?

Imagine you have a User model in your web app, and that you need to keep this user in sync with an external service via an API. Thus, when you create a user locally, you need to create it remotely as well.
You have all your operations under transaction.atomic() and you try to keep all your 3rd-party API calls after the atomic block, which is reasonable.
But, a system being a system, it grows in complexity until the point you have some really hard to remove 3rd-party calls within an update call.
That said, is there a way to extend Django's transaction mechanism, kind of adding some callback functions, like rollback.add_callback(clean_3rdparty_user(user_id=134))?
That way I can guarantee that all necessary rollback actions are taken and my system is in sync?
The author of Django's transaction hook code has this to say about why there is on_commit() but not on_rollback():
A rollback hook is even harder to implement robustly than a commit hook, since a variety of things can cause an implicit rollback. For instance, your database connection was dropped because your process was killed without a chance to shutdown gracefully: your rollback hook will never run.
Since rollbacks are typically triggered by an exception, a simple approach is to just catch any exceptions and run your undo code there.
try:
with transaction.atomic():
# Do database stuff
# Do external stuff
except:
# We know the database stuff has rolled back, so...
# Undo external stuff
raise
This is not particularly elegant. I agree with the following from the same source:
The solution is simple: instead of doing something during the atomic block (transaction) and then undoing it if the transaction fails, use on_commit to delay doing it in the first place until after the transaction succeeds. It’s a lot easier to undo something you never did in the first place!
But it sounds like you already agree with that as well.

Celery tasks per Model Object. Cleanest way to track progress

I have distributed hardware sensor nodes that will be interrogated by celery tasks. Each sensor node has a object associated holding recent readings, and config data.
I never want more than one celery task interrogating a single sensornode. But requests might come to interrogate the node while it is still being worked on from a previous request.
I didn't see any example of this sort of task tracking in any of the celery docs. But I assume its a fairly common requirement.
My first thought was to just mark the model object at the beginning and end of the task with a task_in_progress like flag.
Is there anything in the task instantiation that I can use to better realize my task tracking?
What you want is to lock a task on a given resource, there is a very nice example on the Celery.
To summarize the example suggests to use a cache key to hold the lock, a task will check the lock key (you can generate a instance specific cache key like "sensor-%(id)s") before starting and execute only if the cache key is not set.
example.
def check_sensor(sensor_id):
if check_lock_from_cache(sensor_id):
... handle the lock ...
else:
lock(sensor_id)
... use the sensor ...
unlock(sensor_id)
you probably want to be really sure to do the unlock properly (try except finally)
here's the celery example http://ask.github.com/celery/cookbook/tasks.html#ensuring-a-task-is-only-executed-one-at-a-time

How do I detect an aborted connection in Django?

I have a Django view that does some pretty heavy processing and takes around 20-30 seconds to return a result.
Sometimes the user will end up closing the browser window (terminating the connection) before the request completes -- in that case, I'd like to be able to detect this and stop working. The work I do is read-only on the database so there isn't any issue with transactions.
In PHP the connection_aborted function does exactly this. Is this functionality available in Django?
Here's example code I'd like to write:
def myview(request):
while not connection_aborted():
# do another bit of work...
if work_complete:
return HttpResponse('results go here')
Thanks.
I don't think Django provides it because it basically can't. More than Django itself, this depends on the way Django interfaces with your web server. All this depends on your software stack (which you have not specified). I don't think it's even part of the FastCGI and WSGI protocols!
Edit: I'm also pretty sure that Django does not start sending any data to the client until your view finishes execution, so it can't possibly know if the connection is dead. The underlying socket won't trigger an error unless the server tries to send some data back to the user.
That connection_aborted method in PHP doesn't do what you think it does. It will tell you if the client disconnected but only if the buffer has been flushed, i.e. some sort of response is sent from the server back to the client. The PHP versions wouldn't even work as you've written if above. You'd have to add a call to something like flush within your loop to have the server attempt to send data.
HTTP is a stateless protocol. It's designed to not have either the client or the server dependent on each other. As a result the state of either is only known when there is a connection is created, and that only occurs when there's some data to send one way or another.
Your best bet is to do as #MattH suggested and do this through a bit of AJAX, and if you'd like you can integrate something like Node.js to make client "check-ins" during processing. How to set that up properly is beyond my area of expertise, though.
So you have an AJAX view that runs a query that takes 20-30 seconds to process requested in the background of a rendered page and you're concerned about wasted resources for when someone cancels the page load.
I see that you've got options in three broad categories:
Live with it. Improve the situation by caching the results in case the user comes back.
Make it faster. Throw more space at a time/space trade-off. Maintain intermediate tables. Precalculate the entire thing, etc.
Do something clever with the browser fast-polling a "is it ready yet?" query and the server cancelling the query if it doesn't receive a nag within interval * 2 or similar. If you're really clever, you could return progress / ETA to the nags. However, this might not have particularly useful behaviour when the system is under load or your site is being accessed over limited bandwidth.
I don't think you should go for option 3 because it's increasing complexity and resource usage for not much gain.

Django: start a process in a background thread?

I'm trying to work out how to run a process in a background thread in Django. I'm new to both Django and threads, so please bear with me if I'm using the terminology wrong.
Here's the code I have. Basically I'd like start_processing to begin as soon as the success function is triggered. However start_processing is the kind of function that could easily take a few minutes or fail (it's dependent on an external service over which I have no control), and I don't want the user to have to wait for it to complete successfully before the view is rendered. ('Success' as far as they are concerned isn't dependent on the result of start_processing; I'm the only person who needs to worry if it fails.)
def success(request, filepath):
start_processing(filepath)
return render_to_response('success.html', context_instance = RequestContext(request))
From the Googling I've done, most people suggest that background threads aren't used in Django, and instead a cron job is more suitable. But I would quite like start_processing to begin as soon as the user gets to the success function, rather than waiting until the cron job runs. Is there a way to do this?
If you really need a quick hack, simply start a process using subprocess.
But I would not recommend spawning a process (or even a thread), especially if your web site is public: in case of high load (which could be "natural" or the result of a trivial DoS attack), you would be spawning many processes or threads, which would end up using up all your system resources and killing your server.
I would instead recommend using a job server: I use Celery (with Redis as the backend), it's very simple and works just great. You can check out many other job servers, such as RabbitMQ or Gearman. In your case, a job server might be overkill: you could simply run Redis and use it as a light-weight message server. Here is an example of how to do this.
Cheers
In case someone really wants to run another thread
def background_process():
import time
print("process started")
time.sleep(100)
print("process finished")
def index(request):
import threading
t = threading.Thread(target=background_process, args=(), kwargs={})
t.setDaemon(True)
t.start()
return HttpResponse("main thread content")
This will return response first, then print "process finished" to console. So user will not face any delay.
Using Celery is definitely a better solution. However, installing Celery could be unnecessary for a very small project with a limited server etc.
You may also need to use threads in a big project. Because running Celery in all your servers is not a good idea. Then there won't be a way to run a separate process in each server. You may need threads to handle this case. File system operations might be an example. It's not very likely though and it is still better to use Celery with long running processes.
Use wisely.
I'm not sure you need a thread for that. It sounds like you just want to spawn off a process, so look into the subprocess module.
IIUC, The problem here is that the webserver process might not like extra long-running threads, it might kill/spawn server processes as demand go up and down, etc etc.
You're probably better of by communicating to an external service process for this type of processing, instead of embedding it in in the webserver's wsgi/fastcgi process.
If the only thing you're sending over is the filepath, it ought to be pretty easy to write that service app.