Django Job queue for interfacing with celery - django

My django web-app logic is heavily geared towards background task execution (both periodic as well as stand alone, synchronous as well as asynchronous). All the research seems to point to using Celery being the most recommended approach. I plan to eventually deploy on Heroku and the fact that it has support for Celery + Redis (what I'm using for local development) is a big plus for me.
However I need more extensive scheduling capabilities than celery provides. I need some of my periodic tasks to be able to run schedules like 'run on last sun of the month' etc. So I've implemented my own models in django to store a recurrence rule and other needed parameters.
Now I'm stumped with how to interface my tables with celery. Ideally what I'd like to do is to have my own Job model which has the schedule, the task which should be run when it becomes due as well as the parameters for the task. Sort of like function ptr in C++. Then I would run a daemon which keeps checking the job queue for which job has become due, if its periodic it creates the next job instance and pushes it into queue, then runs the associated task with parameters using celery's delay method or similar.
questions:
Does this approach even make sense?
If not what other alternative approach(es) can I use
If yes how do I go about designing that Job/Event queue...
I'd love to hear a better approach to doing this or if there's an existing implementation of a job queue that might be suitable or a way to use celery's job queue itself...
Thanks heaps..

The periodic tasks in Celery works pretty much like this. There's a dedicated scheduler process (celery beat) which simply sends off tasks when they are due.
You can also create new schedulers to use with beat by subclassing the celery.beat.Scheduler class, and you can create custom schedules too (like the crontab schedule that is already built-in) by subclassing celery.schedules.schedule.
There's a database-backed scheduler implementation in the django-celery extension (djcelery.schedulers.DatabaseScheduler), which uses many tricks to avoid too frequent polling of the database and so on (sadly it's not well commented).
Scheduler: https://github.com/celery/celery/tree/master/celery/beat.py
schedules: https://github.com/celery/celery/tree/master/celery/schedules.py
DatabaseScheduler: https://github.com/celery/django-celery/tree/master/djcelery/schedulers.py

Related

Django + Celery with long-term scheduled tasks

I'm developing a Django app which relies heavily on Celery task scheduling, using Redis as backend. Tasks can be set to run at a large periods of time, as well as in a few seconds/minutes.
I've read about Redis visibility timeout and consequences of scheduling tasks with timedelta greater than visibility timeout (I'm also in the process of dealing with it in a previous project), so I'm interested if there's anything neater than my solution, which is to have another "helper" task run 5 minutes before the "main" one needs to be executed, scheduling the "main" task to run in required time, storing task id in DB, and then checking in "main" task if the stored task id is the one that is being run. The last part (with task id storing) is required as multiple runs of "helper" task could spawn a lot of "main" task instances, but with this approach each will have different task id.
I really hate how that approach sounds and how it works, as if the task is scheduled to be run a month from current time, "helper" and "main" tasks are executed up to a hundred times.
I also know that it's an open issue, so I'm interested in more a neat workaround than a solution itself.
Having tested available options, in my opinion only using RabbitMQ as broker solves the whole problem.
Although it's a viable option for me, lack of some of redis configuration parameters (e.g. pool size) makes it unusable for those who are using hosting services with some limit on opened broker connection.

How to record all tasks information with Django and Celery?

In my Django project I'm using Celery with a RabbitMQ broker for asynchronous tasks, how can I record the information of all of my tasks (e.g. created time (task appears in queue), worker consume task time, execution time, status, ...) to monitor how Celery is doing?
I know there are solutions like Flower but that seems to much for what I need, django-celery-results looks like what I want but it's missing a few information I need like task created time.
Thanks!
It seems like you often find the answer yourself after asking on SO. I settled with using celery signals to do all the recording I want and store the results in a database table.

Best way to schedule tasks to run in the future dynamically using celery + sqs

So I'm struggling to figure out the optimal way to schedule some events to happen at some point in the future using celery. An example of this is when a new user has registered, we want to send them an email the next day.
We have celery setup and some of you may allude to the eta parameter when calling apply_async. However that won't work for us, as we use SQS which has a visibility timeout that would conflict and generally the eta param shouldn't be used for lengthy periods.
One solution we've implemented at this point is to create events and store them in the database with a 'to-process' timestamp (refers to when to process the event). We use the celery beat scheduler with a task that runs literally every second to see if there are any new events that are ready to process. If there are, we carry out the subsequent tasks.
This solution works, although it doesn't feel great since we're queueing a task every second on SQS. Any thoughts or ideas on this would be great?

Why celery-haystack?

For a django project i like to run index updated by a celery worker to not hit the page parse time. I noticed celery-haystack that is able to do this but i'm wondering why it's that complicated. A much simpler solution would be to simply apply an async task from a post_save signal and invoke the signal processor from there, so not to apply the async part from within the signal processor but before.
I guess i'm missing something?
I'm aware that instances may not exist any more in case of delete signals...
So celery is only the task distributor, right? And indexing is jobs to be done. Search is the end result. When your resource is limited, tasks will be queued up and scheduled to be ran workers are available. You can pursue your approach just fine, but Celery will optimize by delegating tasks to different workers, which may reside in other machines.
I kind of forgot about the details.... (sorry). But to comment: i ended up not using celery-haystack but instead use django signals (not just post_save but i created more specific custom signals) that trigger async celery tasks (so delegate to other queue's/nodes) and these run the index update using the signal processor. I also extended the signal processor to support update and removal of single objects and iterable of objects.
Paul

Simulating Google Appengine's Task Queue with Gearman

One of the characteristics I love most about Google's Task Queue is its simplicity. More specifically, I love that it takes a URL and some parameters and then posts to that URL when the task queue is ready to execute the task.
This structure means that the tasks are always executing the most current version of the code. Conversely, my gearman workers all run code within my django project -- so when I push a new version live, I have to kill off the old worker and run a new one so that it uses the current version of the code.
My goal is to have the task queue be independent from the code base so that I can push a new live version without restarting any workers. So, I got to thinking: why not make tasks executable by url just like the google app engine task queue?
The process would work like this:
User request comes in and triggers a few tasks that shouldn't be blocking.
Each task has a unique URL, so I enqueue a gearman task to POST to the specified URL.
The gearman server finds a worker, passes the url and post data to a worker
The worker simply posts to the url with the data, thus executing the task.
Assume the following:
Each request from a gearman worker is signed somehow so that we know it's coming from a gearman server and not a malicious request.
Tasks are limited to run in less than 10 seconds (There would be no long tasks that could timeout)
What are the potential pitfalls of such an approach? Here's one that worries me:
The server can potentially get hammered with many requests all at once that are triggered by a previous request. So one user request might entail 10 concurrent http requests. I suppose I could have a single worker with a sleep before every request to rate-limit.
Any thoughts?
As a user of both Django and Google AppEngine, I can certainly appreciate what you're getting at. At work I'm currently working on the exact same scenario using some pretty cool open source tools.
Take a look at Celery. It's a distributed task queue built with Python that exposes three concepts - a queue, a set of workers, and a result store. It's pluggable with different tools for each part.
The queue should be battle-hardened, and fast. Check out RabbitMQ for a great queue implementation in Erlang, using the AMQP protocol.
The workers ultimately can be Python functions. You can trigger workers using either queue messages, or perhaps more pertinent to what you're describing - using webhooks
Check out the Celery webhook documentation. Using all these tools you can build a production ready distributed task queue that implements your requirements above.
I should also mention that in regards to your first pitfall, celery implements rate-limiting of tasks using a Token Bucket algorithm.