Celery tasks per Model Object. Cleanest way to track progress

Celery tasks per Model Object. Cleanest way to track progress - django

I have distributed hardware sensor nodes that will be interrogated by celery tasks. Each sensor node has a object associated holding recent readings, and config data.
I never want more than one celery task interrogating a single sensornode. But requests might come to interrogate the node while it is still being worked on from a previous request.
I didn't see any example of this sort of task tracking in any of the celery docs. But I assume its a fairly common requirement.
My first thought was to just mark the model object at the beginning and end of the task with a task_in_progress like flag.
Is there anything in the task instantiation that I can use to better realize my task tracking?

What you want is to lock a task on a given resource, there is a very nice example on the Celery.
To summarize the example suggests to use a cache key to hold the lock, a task will check the lock key (you can generate a instance specific cache key like "sensor-%(id)s") before starting and execute only if the cache key is not set.
example.
def check_sensor(sensor_id):
if check_lock_from_cache(sensor_id):
... handle the lock ...
else:
lock(sensor_id)
... use the sensor ...
unlock(sensor_id)
you probably want to be really sure to do the unlock properly (try except finally)
here's the celery example http://ask.github.com/celery/cookbook/tasks.html#ensuring-a-task-is-only-executed-one-at-a-time

Related

django, multi-databases (writer, read-reploicas) and a sync issue

So... in response to an API call I do:
i = CertainObject(paramA=1, paramB=2)
i.save()
now my writer database has a new record.
Processing can take a bit and I do not wish to hold off my response to the API caller, so the next line I am transferring the object ID to an async job using Celery:
run_async_job.delay(i.id)
right away, or a few secs away depending on the queue run_async_job tried to load up the record from the database with that ID provided. It's a gamble. Sometimes it works, sometimes doesn't depending whether the read replicas updated or not.
Is there pattern to guarantee success and not having to "sleep" for a few seconds before reading or hope for good luck?
Thanks.

The simplest way seems to be using the retries as mentioned by Greg and Elrond in their answers. If you're using shared_task or #app.task decorators, you can use the following code snippet.
#shared_task(bind=True)
def your_task(self, certain_object_id):
try:
certain_obj = CertainObject.objects.get(id=certain_object_id)
# Do your stuff
except CertainObject.DoesNotExist as e:
self.retry(exc=e, countdown=2 ** self.request.retries, max_retries=20)
I used an exponential countdown in between every retry. You can modify it according to your needs.
You can find the documentation for custom retry delay here.
There is also another document explaining the exponential backoff in this link
When you call retry it’ll send a new message, using the same task-id, and it’ll take care to make sure the message is delivered to the same queue as the originating task. You can read more about this in the documentation here

As writing and then loading it immediately is a high priority, then why not store it in memory based DB like Memcache or Redis. So that after sometime, you can write it in the Database using a periodic job in celery which will run lets say every minute or so. When it is done writing to DB, it will delete the keys from Redis/Memcache.
You can keep the data in memory based DB for certain time, lets say 1 hour when the data is needed most. Also you can create a service method, which will check if the data is in memory or not.
Django Redis is a great package to connect to redis(if you are using it as broker in Celery).
I am providing some example based on Django cache:
# service method
from django.core.cache import cache
def get_object(obj_id, model_cls):
obj_dict = cache.get(obj_id, None) # checks if obj id is in cache, O(1) complexity
if obj_dict:
return model_cls(**obj_dict)
else:
return model_cls.objects.get(id=obj_id)
# celery job
#app.task
def store_objects():
logger.info("-"*25)
# you can use .bulk_create() to reduce DB hits and faster DB entries
for obj_id in cache.keys("foo_*"):
CertainObject.objects.create(**cache.get(obj_id))
cache.delete(obj_id)
logger.info("-"*25)

The simplest solution would be to catch any DoesNotExist errors thrown at the start of the task, then schedule a retry. This can be done by converting run_async_job into a Bound Task:
#app.task(bind=True)
def run_async_job(self, object_id):
try:
instance = CertainObject.objects.get(id=object_id)
except CertainObject.DoesNotExist:
return self.retry(object_id)

This article goes pretty deep into how you can handle read-after-write issues with replicated databases: https://medium.com/box-tech-blog/how-we-learned-to-stop-worrying-and-read-from-replicas-58cc43973638.
Like the author, I know of no foolproof catch-all way to handle read-after-write inconsistency.
The main strategy I've used before is to have some kind of expect_and_get(pk, max_attempts=10, delay_seconds=5) method that attempts to fetch the record, and attempts it max_attempts times, delaying delay_seconds seconds in between attempts. The idea is that it "expects" the record to exist, and so it treats a certain number of failures as just transient DB issues. It's a little more reliable than just sleeping for some time since it will pick up records quicker and hopefully delay the job execution much less often.
Another strategy would be to delay returning from a special save_to_read method until read replicas have the value, either by synchronously pushing the new value to the read replicas somehow or just polling them all until they return the record. This way seems a little hackier IMO.
For a lot of your reads, you probably don't have to worry about read-after-write consistency:
If we’re rendering the name of the enterprise a user is part of, it’s really not that big a deal if in the incredibly rare occasion that an admin changes it, it takes a minute to have the change propagate to the enterprise’s users.

Django: for loop through parallel process and store values and return after it finishes

I have a for loop in django. It will loop through a list and get the corresponding data from database and then do some calculation based on the database value and then append it another list
def getArrayList(request):
list_loop = [...set of values to loop through]
store_array = [...store values here from for loop]
for a in list_loop:
val_db = SomeModel.objects.filter(somefield=a).first()
result = perform calculation on val_db
store_array.append(result)
The list if 10,000 entries. If the user want this request he is ready to wait and will be informed that it will take time
I have tried joblib with backed=threading its not saving much time than normal loop
But when i try with backend=multiprocessing. it says "Apps aren't loaded yet"
I read multiprocessing is not possible in module based files.
So i am looking at celery now. I am not sure how can this be done in celery.
Can any one guide how can we faster the for loop calculation using mutliprocessing techniques available.

You're very likely looking for the wrong solution. But then again - this is pseudo code so we can't be sure.
In either case, your pseudo code is a self-fulfilling prophecy, since you run queries in a for loop. That means network latency, result set fetching, tying up database resources etc etc. This is never a good pattern, at best it's a last resort.
The simple solution is to get all values in one query:
list_values = [ ... ]
results = []
db_values = SomeModel.objects.filter(field__in=list_values)
for value in db_values:
results.append(calc(value))
If for some reason you need to loop, then to do this in celery, you would mark the function as a task (plenty of examples to find). It won't speed up anything. But you won't speed up anything - it will we be run in the background and so you render a "please wait" message and somehow you need to notify the user again that the job is done.
I'm saying somehow, because there isn't a really good integration package that I'm aware of that ties in all the components. There's django-notifications-hq, but if this is your only background task, it's a lot of extra baggage just for that - so you may want to change the notification part to "we will send you an email when the job is done", cause that's easy to achieve inside your function.
And thirdly, if this is simply creating a report, that doesn't need things like automatic retries on failure, then you can simply opt to use Django Channels and a browser-native websocket to start and report on the job (which also allows you to send email).

You could try concurrent.futures.ProcessPoolExecutor, which is a high level api for processing cpu bound tasks
def perform_calculation(item):
pass
# specify number of workers(default: number of processors on your machine)
with concurrent.futures.ProcessPoolExecutor(max_workers=6) as executor:
res = executor.map(perform_calculation, tasks)
EDIT
In case of IO bound operation, you could make use of ThreadPoolExecutor to open a few connections in parallel, you can wrap the pool in a contextmanager which handles the cleanup work for you(close idle connections). Here is one example but handles the connection closing manually.

Notifying a task from multiple other tasks without extra work

My application is futures-based with async/await, and has the following structure within one of its components:
a "manager", which is responsible for starting/stopping/restarting "workers", based both on external input and on the current state of "workers";
a dynamic set of "workers", which perform some continuous work, but may fail or be stopped externally.
A worker is just a spawned task which does some I/O work. Internally it is a loop which is intended to be infinite, but it may exit early due to errors or other reasons, and in this case the worker must be restarted from scratch by the manager.
The manager is implemented as a loop which awaits on several channels, including one returned by async_std::stream::interval, which essentially makes the manager into a poller - and indeed, I need this because I do need to poll some Mutex-protected external state. Based on this state, the manager, among everything else, creates or destroys its workers.
Additionally, the manager stores a set of async_std::task::JoinHandles representing live workers, and it uses these handles to check whether any workers has exited, restarting them if so. (BTW, I do this currently using select(handle, future::ready()), which is totally suboptimal because it relies on the select implementation detail, specifically that it polls the left future first. I couldn't find a better way of doing it; something like race() would make more sense, but race() consumes both futures, which won't work for me because I don't want to lose the JoinHandle if it is not ready. This is a matter for another question, though.)
You can see that in this design workers can only be restarted when the next poll "tick" in the manager occurs. However, I don't want to use a too small interval for polling, because in most cases polling just wastes CPU cycles. Large intervals, however, can delay restarting a failed/canceled worker by too much, leading to undesired latencies. Therefore, I though I'd set up another channel of ()s back from each worker to the manager, which I'd add to the main manager loop, so when a worker stops due to an error or otherwise, it will first send a message to its channel, resulting in the manager being woken up earlier than the next poll in order to restart the worker right away.
Unfortunately, with any kinds of channels this might result in more polls than needed, in case two or more workers stop at approximately the same time (which due to the nature of my application, is somewhat likely to happen). In such case it would make sense to only run the manager loop once, handling all of the stopped workers, but with channels it will necessarily result in the number of polls equal to the number of stopped workers, even if additional polls don't do anything.
Therefore, my question is: how do I notify the manager from its workers that they are finished, without resulting in extra polls in the manager? I've tried the following things:
As explained above, regular unbounded channels just won't work.
I thought that maybe bounded channels could work - if I used a channel with capacity 0, and there was a way to try and send a message into it but just drop the message if the channel is full (like the offer() method on Java's BlockingQueue), this seemingly would solve the problem. Unfortunately, the channels API, while providing such a method (try_send() seems to be like it), also has this property of having capacity larger than or equal to the number of senders, which means it can't really be used for such notifications.
Some kind of atomic or a mutex-protected boolean flag also look as if it could work, but there is no atomic or mutex API which would provide a future to wait on, and would also require polling.
Restructure the manager implementation to include JoinHandles into the main select somehow. It might do the trick, but it would result in large refactoring which I'm unwilling to make at this point. If there is a way to do what I want without this refactoring, I'd like to use that first.
I guess some kind of combination of atomics and channels might work, something like setting an atomic flag and sending a message, and then skipping any extra notifications in the manager based on the flag (which is flipped back to off after processing one notification), but this also seems like a complex approach, and I wonder if anything simpler is possible.

I recommend using the FuturesUnordered type from the futures crate. This collection allows you to push many futures of the same type into a collection and wait for any one of them to complete at once.
It implements Stream, so if you import StreamExt, you can use unordered.next() to obtain a future that completes once any future in the collection completes.
If you also need to wait for a timeout or mutex etc., you can use select to create a future that completes once either the timeout or one of the join handles completes. The future returned by next() implements Unpin, so it is usable with select without problems.

Django non-blocking save?

Is there's way to call save() on an model in django, without waiting for a response from the db?
You could consider this async, though I need less, as async calls usually gives you callback- which I dont need here.
So basically I want -
SomeModel.objects.bulk_create([list of objects ]) , every say 1000 objects,
Without this line blocking my code. I will have no use in these rows in my code.
I'm looking for something simple, package like celery seems to offer way more than this..

As of 2016, Django is a web framework working (for the moment, if we are ignoring channels) taking a HTTP request "as argument" and returns a HTTP response as soon as possible.
This architecture means there is no concept of asynchronous operation in the framework. If you want to delay saving and returns response to the user without waiting, you can:
either run another thread/async block (which can be tedious with database transactions...) ;
services like IronWorker that allows you to queue operations to run async a.s.a.p ;
celery, that may bring too much features for your case but will do a better than job than some homemade solution.

rq (Redis Queue) is another option for asynchronous operations (apart from those that Maxime Lorant mentions in his answer). It uses Redis as a broker (the middle man that holds the tasks) so if you are already using Redis or if you would like to add it to your project, you should consider it. It's a nice and simple solution, much simpler than celery. There is also django-rq a simple app that provides django integration for rq.
Update:
Summarizing comments
django_rq provides a management command (rqworker) that starts a worker process. Any job that is put in the queue will be executed by this process. You can either send one job to the queue for each object (a job would be a function with an object in its arguments and it will save the object in the database) or collect a list of objects and send a job with this list. In the second case you need to temporary store this list somewhere which might be tricky.
Using redis to temporary store the objects (Recommended)
I think that the most robust way to do it is to serialize objects to json and store them to a redis list. Then regularly check the length of it and when it has the desired length, you can send a job to the queue having this list in its arguments.
Using worker's memory to temporary store the objects
You could also use your worker's RAM as a temporary storage. This could be made since the worker process has its own memory. In this case the main process (the runserver) creates a job with an object. The job doesn't save the object, it just adds it to a list. You can keep appending objects to this list. Since the jobs are executed in the worker process, this list exists in the worker's memory. When it has the desirable length then you can save all objects.
But imagine the case in which you create more than one workers. In this case each job in the queue will be picked by the current free worker. So some objects will be appended in a list in the memory of worker_1, some other objects in the list of worker_2 etc. and you would have to deal with as many lists as workers.

CQRS, multiple write nodes for a single aggregate entry, while maintaining concurrency

Let's say I have a command to edit a single entry of an article, called ArticleEditCommand.
User 1 issues an ArticleEditCommand based on V1 of the article.
User 2 issues an ArticleEditCommand based on V1 of the same
article.
If I can ensure that my nodes process the older ArticleEditCommand commands first, I can be sure that the command from User 2 will fail because User 1's command will have changed the version of the article to V2.
However, if I have two nodes process ArticleEditCommand messages concurrently, even though the commands will be taken of the queue in the correct order, I cannot guarantee that the nodes will actually process the first command before the second command, due to a spike in CPU or something similar. I could use a sql transaction to update an article where version = expectedVersion and make note of the number of records changed, but my rules are more complex, and can't live solely in SQL. I would like my entire logic of the command processing guaranteed to be concurrent between ArticleEditCommand messages that alter that same article.
I don't want to lock the queue while I process the command, because the point of having multiple command handlers is to handle commands concurrently for scalability. With that said, I don't mind these commands being processed consecutively, but only for a single instance/id of an article. I don't expect a high volume of ArticleEditCommand messages to be sent for a single article.
With the said, here is the question.
Is there a way to handle commands consecutively across multiple nodes for a single unique object (database record), but handle all other commands (distinct database records) concurrently?
Or, is this a problem I created myself because of a lack of understanding of CQRS and concurrency?
Is this a problem that message brokers typically have solved? Such as Windows Service Bus, MSMQ/NServiceBus, etc?
EDIT: I think I know how to handle this now. When User 2 issues the ArticleEditCommand, an exception should be throw to the user letting them know that there is a current pending operation on that article that must be completed before then can queue the ArticleEditCommand. That way, there is never two ArticleEditCommand messages in the queue that effect the same article.

First let me say, if you don't expect a high volume of ArticleEditCommand messages being sent, this sounds like premature optimization.
In other solutions, this problem is usually not solved by message brokers, but by optimistic locking enforced by the persistence implementation. I don't understand why a simple version field for optimistic locking that can be trivially handled by SQL contradicts complicated business logic/updates, maybe you could elaborate more?

It's actually quite simple and I did that. Basically, it looks like this ( pseudocode)
//message handler
ModelTools.TryUpdateEntity(
()=>{
var entity= _repo.Get(myId);
entity.Do(whateverCommand);
_repo.Save(entity);
}
10); //retry 10 times until giving up
//repository
long? _version;
public MyObject Get(Guid id)
{
//query data and version
_version=data.version;
return data.ToMyObject();
}
public void Save(MyObject data)
{
//update row in db where version=_version.Value
if (rowsUpdated==0)
{
//things have changed since we've retrieved the object
throw new NewerVersionExistsException();
}
}
ModelTools.TryUpdateEntity and NewerVersionExistsException are part of my CavemanTools generic purpose library (available on Nuget).
The idea is to try doing things normally, then if the object version (rowversion/timestamp in sql) has changed we'll retry the whole operation again after waiting a couple of miliseconds. And that's exactly what the TryUpdateEntity() method does. And you can tweak how much to wait between tries or how many times it should retry the operation.
If you need to notify the user, then forget about retrying, just catch the exception directly and then tell the user to refresh or something.

Partition based solution
Achieve node stickiness by routing the incoming command based on the object's ID (eg. articleId modulo your-number-of-nodes) to make sure the commands of User1 and User2 ends up on the same node, then process the commands consecutively. You can choose to process all commands one by one or if you want to parallelize the execution, partition the commands on something like ID, odd/even, by country or similar.
Grid based solution
Use an in-memory grid (eg. Hazelcast or Coherence) and use a distributed Executor Service (http://docs.hazelcast.org/docs/2.0/manual/html/ch09.html#DistributedExecution) or similar to coordinate the command processing across the cluster.
Regardless - before adding this kind of complexity, you should of course ask yourself if it's really a problem if User2's command would be accepted and User1 got a concurrency error back. As long as User1's changes are not lost and can be re-applied after a refresh of the article it might be perfectly fine.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js