Django non-blocking save? - django

Is there's way to call save() on an model in django, without waiting for a response from the db?
You could consider this async, though I need less, as async calls usually gives you callback- which I dont need here.
So basically I want -
SomeModel.objects.bulk_create([list of objects ]) , every say 1000 objects,
Without this line blocking my code. I will have no use in these rows in my code.
I'm looking for something simple, package like celery seems to offer way more than this..

As of 2016, Django is a web framework working (for the moment, if we are ignoring channels) taking a HTTP request "as argument" and returns a HTTP response as soon as possible.
This architecture means there is no concept of asynchronous operation in the framework. If you want to delay saving and returns response to the user without waiting, you can:
either run another thread/async block (which can be tedious with database transactions...) ;
services like IronWorker that allows you to queue operations to run async a.s.a.p ;
celery, that may bring too much features for your case but will do a better than job than some homemade solution.

rq (Redis Queue) is another option for asynchronous operations (apart from those that Maxime Lorant mentions in his answer). It uses Redis as a broker (the middle man that holds the tasks) so if you are already using Redis or if you would like to add it to your project, you should consider it. It's a nice and simple solution, much simpler than celery. There is also django-rq a simple app that provides django integration for rq.
Update:
Summarizing comments
django_rq provides a management command (rqworker) that starts a worker process. Any job that is put in the queue will be executed by this process. You can either send one job to the queue for each object (a job would be a function with an object in its arguments and it will save the object in the database) or collect a list of objects and send a job with this list. In the second case you need to temporary store this list somewhere which might be tricky.
Using redis to temporary store the objects (Recommended)
I think that the most robust way to do it is to serialize objects to json and store them to a redis list. Then regularly check the length of it and when it has the desired length, you can send a job to the queue having this list in its arguments.
Using worker's memory to temporary store the objects
You could also use your worker's RAM as a temporary storage. This could be made since the worker process has its own memory. In this case the main process (the runserver) creates a job with an object. The job doesn't save the object, it just adds it to a list. You can keep appending objects to this list. Since the jobs are executed in the worker process, this list exists in the worker's memory. When it has the desirable length then you can save all objects.
But imagine the case in which you create more than one workers. In this case each job in the queue will be picked by the current free worker. So some objects will be appended in a list in the memory of worker_1, some other objects in the list of worker_2 etc. and you would have to deal with as many lists as workers.

Related

Django: for loop through parallel process and store values and return after it finishes

I have a for loop in django. It will loop through a list and get the corresponding data from database and then do some calculation based on the database value and then append it another list
def getArrayList(request):
list_loop = [...set of values to loop through]
store_array = [...store values here from for loop]
for a in list_loop:
val_db = SomeModel.objects.filter(somefield=a).first()
result = perform calculation on val_db
store_array.append(result)
The list if 10,000 entries. If the user want this request he is ready to wait and will be informed that it will take time
I have tried joblib with backed=threading its not saving much time than normal loop
But when i try with backend=multiprocessing. it says "Apps aren't loaded yet"
I read multiprocessing is not possible in module based files.
So i am looking at celery now. I am not sure how can this be done in celery.
Can any one guide how can we faster the for loop calculation using mutliprocessing techniques available.
You're very likely looking for the wrong solution. But then again - this is pseudo code so we can't be sure.
In either case, your pseudo code is a self-fulfilling prophecy, since you run queries in a for loop. That means network latency, result set fetching, tying up database resources etc etc. This is never a good pattern, at best it's a last resort.
The simple solution is to get all values in one query:
list_values = [ ... ]
results = []
db_values = SomeModel.objects.filter(field__in=list_values)
for value in db_values:
results.append(calc(value))
If for some reason you need to loop, then to do this in celery, you would mark the function as a task (plenty of examples to find). It won't speed up anything. But you won't speed up anything - it will we be run in the background and so you render a "please wait" message and somehow you need to notify the user again that the job is done.
I'm saying somehow, because there isn't a really good integration package that I'm aware of that ties in all the components. There's django-notifications-hq, but if this is your only background task, it's a lot of extra baggage just for that - so you may want to change the notification part to "we will send you an email when the job is done", cause that's easy to achieve inside your function.
And thirdly, if this is simply creating a report, that doesn't need things like automatic retries on failure, then you can simply opt to use Django Channels and a browser-native websocket to start and report on the job (which also allows you to send email).
You could try concurrent.futures.ProcessPoolExecutor, which is a high level api for processing cpu bound tasks
def perform_calculation(item):
pass
# specify number of workers(default: number of processors on your machine)
with concurrent.futures.ProcessPoolExecutor(max_workers=6) as executor:
res = executor.map(perform_calculation, tasks)
EDIT
In case of IO bound operation, you could make use of ThreadPoolExecutor to open a few connections in parallel, you can wrap the pool in a contextmanager which handles the cleanup work for you(close idle connections). Here is one example but handles the connection closing manually.

Notifying a task from multiple other tasks without extra work

My application is futures-based with async/await, and has the following structure within one of its components:
a "manager", which is responsible for starting/stopping/restarting "workers", based both on external input and on the current state of "workers";
a dynamic set of "workers", which perform some continuous work, but may fail or be stopped externally.
A worker is just a spawned task which does some I/O work. Internally it is a loop which is intended to be infinite, but it may exit early due to errors or other reasons, and in this case the worker must be restarted from scratch by the manager.
The manager is implemented as a loop which awaits on several channels, including one returned by async_std::stream::interval, which essentially makes the manager into a poller - and indeed, I need this because I do need to poll some Mutex-protected external state. Based on this state, the manager, among everything else, creates or destroys its workers.
Additionally, the manager stores a set of async_std::task::JoinHandles representing live workers, and it uses these handles to check whether any workers has exited, restarting them if so. (BTW, I do this currently using select(handle, future::ready()), which is totally suboptimal because it relies on the select implementation detail, specifically that it polls the left future first. I couldn't find a better way of doing it; something like race() would make more sense, but race() consumes both futures, which won't work for me because I don't want to lose the JoinHandle if it is not ready. This is a matter for another question, though.)
You can see that in this design workers can only be restarted when the next poll "tick" in the manager occurs. However, I don't want to use a too small interval for polling, because in most cases polling just wastes CPU cycles. Large intervals, however, can delay restarting a failed/canceled worker by too much, leading to undesired latencies. Therefore, I though I'd set up another channel of ()s back from each worker to the manager, which I'd add to the main manager loop, so when a worker stops due to an error or otherwise, it will first send a message to its channel, resulting in the manager being woken up earlier than the next poll in order to restart the worker right away.
Unfortunately, with any kinds of channels this might result in more polls than needed, in case two or more workers stop at approximately the same time (which due to the nature of my application, is somewhat likely to happen). In such case it would make sense to only run the manager loop once, handling all of the stopped workers, but with channels it will necessarily result in the number of polls equal to the number of stopped workers, even if additional polls don't do anything.
Therefore, my question is: how do I notify the manager from its workers that they are finished, without resulting in extra polls in the manager? I've tried the following things:
As explained above, regular unbounded channels just won't work.
I thought that maybe bounded channels could work - if I used a channel with capacity 0, and there was a way to try and send a message into it but just drop the message if the channel is full (like the offer() method on Java's BlockingQueue), this seemingly would solve the problem. Unfortunately, the channels API, while providing such a method (try_send() seems to be like it), also has this property of having capacity larger than or equal to the number of senders, which means it can't really be used for such notifications.
Some kind of atomic or a mutex-protected boolean flag also look as if it could work, but there is no atomic or mutex API which would provide a future to wait on, and would also require polling.
Restructure the manager implementation to include JoinHandles into the main select somehow. It might do the trick, but it would result in large refactoring which I'm unwilling to make at this point. If there is a way to do what I want without this refactoring, I'd like to use that first.
I guess some kind of combination of atomics and channels might work, something like setting an atomic flag and sending a message, and then skipping any extra notifications in the manager based on the flag (which is flipped back to off after processing one notification), but this also seems like a complex approach, and I wonder if anything simpler is possible.
I recommend using the FuturesUnordered type from the futures crate. This collection allows you to push many futures of the same type into a collection and wait for any one of them to complete at once.
It implements Stream, so if you import StreamExt, you can use unordered.next() to obtain a future that completes once any future in the collection completes.
If you also need to wait for a timeout or mutex etc., you can use select to create a future that completes once either the timeout or one of the join handles completes. The future returned by next() implements Unpin, so it is usable with select without problems.

How do I add simple delayed tasks in Django?

I am creating a chatbot and need a solution to send messages to the user in the future after a specific delay. I have my system set up with Nginx, Gunicorn and Django. The idea is that if the bot needs to send the user several messages, it can delay each subsequent message by a certain amount of time before it sends it to seem more 'human'.
However, a simple threading.Timer approach won't work because the user might interrupt this process at any moment prompting future messages to be changed, but the timer threads might not be available to be stopped as they are on a different worker. So far I have come across two solutions:
Use threading.Timer blindly to check a to-send list in the database, can create problems with lots of unneeded threads. Also makes the database less clean/organized.
Use celery or some other system to execute these future tasks. Seems like overkill and over-engineering a simple problem. Tasks will always just be delayed function calls. Also a hassle dealing with which messages belong to which conversation.
What would be the best solution for this problem?
Also, a more generic question:
Ideally the best solution would be a framework where I can 'simulate' a new bot for each conversation so it acts as its own entity and holds all the state/message queue information in memory for itself. It would be necessary for this framework to only allocate resources to a bot when it needs to do something based on a preset delay or incoming message. Is there anything that exists like this?
Personally I would use Celery for this; executing delayed function calls is its job. And I don't know why knowing what messages belong where would be more of a problem there than doing it in a thread.
But you might also want to investigate the new Django-Channels work that Andrew Godwin is doing, since that is intended to support async background tasks.

Doubts regarding to table, producer/consumer locking solutions

Context: Let's say we have a service with 500 clients connected that are on constant activity, and I want to log most part of it by inserting them on a MySQL InnoDB based table. The server is working on a simple thread.
From an external program (website), I proceed selecting data from that table, will it cause it to be locked?
In case it does, I assume the server won't be able to proceed on inserting or updating data until the selecting task is finished from the external program.
The first thing that came to my mind was to implement a producer-consumer concurrency, where one will push data into a queue, and another would insert that data into the database; so, in case of selecting data from an external program, the consumer won't proceed and not the whole server.
I've seen some consumer/producer examples where the producer is not able to push data while it is being processed. In this case, is it ok to make two containers, and simply push it on the one that is not being used? Since, if the consumer is procesing the data, the producer won't proceed pushing it into the queue, making me doubt about its efficiency.
Also, I have been looking into this example:
http://www.codeproject.com/Articles/43510/Lock-Free-Single-Producer-Single-Consumer-Circular
In case it works as it describes, is there anything I should be worried about? Is there something I'm missing?
In case the select query takes too much and the inserting query returns a time out, would it be recomendable to increase the timeout value or to retry the query in case of failure?
Thanks

Celery tasks per Model Object. Cleanest way to track progress

I have distributed hardware sensor nodes that will be interrogated by celery tasks. Each sensor node has a object associated holding recent readings, and config data.
I never want more than one celery task interrogating a single sensornode. But requests might come to interrogate the node while it is still being worked on from a previous request.
I didn't see any example of this sort of task tracking in any of the celery docs. But I assume its a fairly common requirement.
My first thought was to just mark the model object at the beginning and end of the task with a task_in_progress like flag.
Is there anything in the task instantiation that I can use to better realize my task tracking?
What you want is to lock a task on a given resource, there is a very nice example on the Celery.
To summarize the example suggests to use a cache key to hold the lock, a task will check the lock key (you can generate a instance specific cache key like "sensor-%(id)s") before starting and execute only if the cache key is not set.
example.
def check_sensor(sensor_id):
if check_lock_from_cache(sensor_id):
... handle the lock ...
else:
lock(sensor_id)
... use the sensor ...
unlock(sensor_id)
you probably want to be really sure to do the unlock properly (try except finally)
here's the celery example http://ask.github.com/celery/cookbook/tasks.html#ensuring-a-task-is-only-executed-one-at-a-time