Difference between usage of Django celery and Django cron-jobs? - django

I am sorry if its basics but I did not find any answers on the Internet comparing these two technologies. How should I decide when to use which as both can be used to schedule and process periodic tasks.
This is what an article says:
Django-celery :
Jobs are essential part of any application that does
some processing for you in the background. If your job is real time
Django application celery can be used.
Django-cronjobs :
django-cronjobs can be used to schedule periodic_task which is a
valid job. django-cronjobs is a simple Django app that runs registered
cron jobs via a management command.
can anyone explain me the difference between when should I choose which one and Why? Also I need to know why celery is used when the computing is distributed and why not cron jobs

The two things can be used for the same goal (background execution). However, if you are going to choose wisely, you should really understand that they are actually completely different things.
Here's what I wish someone had told me back when I was a noob (instead of the novice level that I have achieved today :)).
cron
The concept of a cron job is that we want a command / process to be executed on some schedule. Furthermore, we want that process to receive x,y,z parameters, run with a,b,c environment variables, and as user id 123.
Some cron systems may facilitate a few extra features, such as:
catching up on missed tasks (e.g. the server was off for a power outage all night and as soon as we turn it on, it runs the 8 instances of the command we normally run hourly).
might help you with the type of locking you normally do using a pid file in order to avoid parallel runs of the same command.
For the most part, cron systems are meant to be dumb: "just run this command at this time, thanks!".
Celery
The concept of Celery is much more sophisticated. It works with tasks, chains & chords of tasks, error handling, and (in most cases) collection of work result. It has a queue (or many queues) of work and a worker (or many). When a task (really just a message describing requested work) enters the queue it waits there until a worker is available to handle it. Much the same way as 1 or more employees at the DMV service a room full of waiting customers.
Furthermore, Celery can facilitate distributed work. That's a bit like (if I may torture the analogy a bit) - the difference between a DMV office where every worker shares the same phone, computer, copier, etc and a DMV where workers have dedicated resources and are never blocked by other workers.
Celery for web apps
In web applications, Celery is often used when a bit of web access results in a thing to be done that should be handled out of band of the conversation with the web browser. For example:
the web user just did something which should result in an email being sent. In order to send an email, your web server will need to contact a mail server. This could take time, the server could be busy, etc - we cant make the web user just wait, seeing nothing on their browser while we do this. Well, you can but it won't work reliably. So, we do that email send as a bit of work in the queue. That way, it can happen "whenever" and the web server can get back to communicating with the browser.
the user just submitted a credit card as payment. You're going to need to contact the card processor, but that might take several seconds. You might even have to contact them multiple times (e.g. they are really busy there right now). Again, you don't want your user's web browser to just sit blankly and you don't want a web server process or thread of execution tied up. Instead, you use Celery to create a job, you tell the browser to check back in a few seconds (or use a "web socket"), and your web server moves on and talks to other web users. When the browser checks back later, you lookup the task id and find out from celery whether it is finished and what the outcome was (card declined, etc).
Using Celery as cron
When you use Celery as a "cron system" all you are really doing is saying: "hey, can someone please generate work of X type on Y schedule". A process is created that runs continuously which sleeps most of the time and wakes up occasionally to inject a bit of work into the queue on the schedule you requested.
Usually the "hey someone" that you ask to do that for you is: celery beat and beat gets the schedule you want from somewhere in the database or from your settings file.

I searched for celery vs cron and found a few results that might be helpful to you.
https://www.reddit.com/r/Python/comments/m2dg8/explain_like_im_five_why_or_why_not_would_celery/
Why would running scheduled tasks with Celery be preferable over crontab?
Distributed task queues (Ex. Celery) vs crontab scripts

Related

Handling long requests

I'm working on a long request to a django app (nginx reverse proxy, mysql db, celery-rabbitMQ-redis set) and have some doubts about the solution i should apply :
Functionning : One functionality of the app allows users to migrate thousands of objects from one system to another. Each migration is logged into a db, and the users are provided the possibility to get in a csv format the history of the migration : which objects have been migrated, which status (success, errors, ...)
To get the history, a get request is sent to a django view, which returns, after serialization and rendering into csv, the download response.
Problem : the serialisation and rendering processes, for a large set of objects (e.g. 160 000) are quite long and the request times out.
Some solutions I was thinking about/found thanks to pervious search are :
Increasing the amount of time before timeout : easy, but I saw everywhere that this is a global nginx setting and would affect every requests on the server.
Using an asynchronous task handled by celery : the concept would be to make an initial request to the server, which would launch the serializing and rendering task with celery, and give a special httpresponse to the client. Then the client would regularly ask the server if the job is done, and the server would deliver the history at the end of processing. I like this one but I'm not sure about how to technically implement that.
Creating and temporarily storing the csv file on the server, and give the user a way to access it & to download it. I'm not a big fan of that one.
So my question is : has anyone already faced a similar question ? Do you have advises for the technical implementation of the solution (#2), or a better solution to propose me ?
Thqnks !
Clearly you should use Celery + RabbitMQ/REDIS. If you look at the docs it´s not that hard to setup.
The first question is whether to use RabbitMQ or Redis. There are many SO questions about this with good information about pros/cons.
The implementation in django is really simple. You can just wrap django functions with celery tasks (with #task attribute) and it´ll become async, so this is the easy part.
The problem I see in your project is that the server who is handling http traffic is the same server running the long process. That can affect performance and user experience even if celery is running on the background. Of course that depends on how much traffic you are expecting on that machine and how many migrations can run at the same time.
One of the things you setup on Celery is the number of workers (concurrent processing units) available. So the number of cores in your machine will matter.
If you need to handle http calls quickly I would suggest to delegate the migration process to another machine. Celery/REDIS can be configured that way. Let´s say you´ve got 2 servers. One would handle only normal django calls (no celery) and trigger celery tasks on the other server (the one who actually runs the migration process). Both servers can connect to the same database.
But this is just an infrastructure optimization and you may not need it.
I hope this answers your question. If you have specific Celery issues it would be better to create another question.

Not sure if I should use celery

I have never used celery before and I'm also a django newbie so I'm not sure if I should use celery in my project.
Brief description of my project:
There is an API for sending (via SSH) jobs to scientific computation clusters. The API is an abstraction to the different scientific job queue vendors out there. http://saga-project.github.io/saga-python/
My project is basically about doing a web GUI for this API with django.
So, my concern is that, if I use celery, I would have a queue in the local web server and another one in each of the remote clusters. I'm afraid this might complicate the implementation needlessly.
The API is still in development and some of the features aren't fully finished. There is a function for checking the state of the remote job execution (running, finished, etc.) but the callback support for state changes is not ready. Here is where I think celery might be appropriate. I would have one or several periodic task(s) monitoring the job states.
Any advice on how to proceed please? No celery at all? celery for everything? celery just for the job states?
I use celery for similar purpose and it works well. Basically I have one node running celery workers that manage the entire cluster. These workers generate input data for the cluster nodes, assign tasks, process the results for reporting or generating dependent tasks.
Each cluster node is running a very small python server which takes a db id of it's assigned job. It then calls into the main (http) server to request the data it needs and finally posts the data back when complete. In my case, the individual nodes don't need to message each other and run time of each task is very long (hours). This makes the delays introduced by central management and polling insignificant.
It would be possible to run a celery worker on each node taking tasks directly from the message queue. That approach is appealing. However, I have complex dependencies that are easier to work out from a centralized control. Also, I sometimes need to segment the cluster and centralized control makes this possible to do on the fly.
Celery isn't good at managing priorities or recovering lost tasks (more reasons for central control).
Thanks for calling my attention to SAGA. I'm looking at it now to see if it's useful to me.
Celery is useful for execution of tasks which are too expensive to be executed in the handler of HTTP request (i.e. Django view). Consider making an HTTP request from Django view to some remote web server and think about latencies, possible timeouts, time for data transfer, etc. It also makes sense to queue computation intensive tasks taking much time for background execution with Celery.
We can only guess what web GUI for API should do. However Celery fits very well for queuing requests to scientific computation clusters. It also allows to track the state of background task and their results.
I do not understand your concern about having many queues on different servers. You can have Django, Celery broker (implementing queues for tasks) and worker processes (consuming queues and executing Celery tasks) all on the same server.

Long running tasks with Django

My goal is to create an application that will be able to do long-lasting mainly system tasks, such as:
checking out code from the repositories,
copying directories between various localizations,
etc.
The problem is I need to prepare it somehow independently from the web browser. I mean that for example after starting the checkout/copy action, closing the web browser will not interrupt the action. So after going back to that site I can see that the copying goes on or another action started when the browser was closed...
I was searching through various tools, like RabbitMQ + Celery, Twisted, Pyro, XML-RPC but I don't know if any of these will be suitable for me. Has anyone encountered similar needs when creating Django app? Please let me know if there are any methods/packages that I should know. Code samples also will be more than welcome!
Thank you in advance for your suggestions!
(And sorry for my bad English. I'm working on it.)
Basically you need to have a process that runs outside of the request. The absolute simplest way to do this (on a Unix-like operating system, at least) is to fork():
if os.fork() == 0:
do_long_thing()
sys.exit(0)
… continue with request …
This has some downsides, though (ex, if the server crashes, the “long thing” will be lost)… Which is where, ex, Celery can come in handy. It will keep track of the jobs that need to be done, the results of jobs (success/failure/whatever) and make it easy to run the jobs on other machines.
Using Celery with a Redis backend (see Kombu's Redis transport) is very simple, so I would recommend looking there first.
You might need to have a process outside the request / response cycle. If that is the case, Celery with a Redis backend is what I would suggest looking into, as that integrates nicely with Django (as David Wolever suggested).
Another option is to create Django management commands, and then use cron to execute them at scheduled intervals.

Need a server-side timer (independent of browser)

I'm putting together a website that will track user-defined events with time limits. Every user would be free to create events, and when the time limit expired, the server would need to take some action based on the outcome of the event. The specific component I'm struggling with is the time-keeping: think like eBay's auction clock -- it's set to expire at a certain time, clearly runs server-side, and takes some action when the time runs out. Searches for a "server side timer," unfortunately, just bring back results for a timer that gets the time from the server instead of the client. :(
The most obvious solution is to run a script on the server, some program that would watch all the clocks and take action when any of them expired. Tragically, I'll be using free web hosting, and sincerely doubt that I'll be able to find someone who'll let me run arbitrary stuff on their servers.
The solutions that I've looked into:
Major concept option 1: persuade each user's browser to run the necessary timers (trivial javascript), and when the timers expire, take necessary action. The problem with this approach is obvious: there could be hundreds, if not thousands, of simultaneous expiring timers (they'll tend to expire in clusters), and the worst case is that every possible user could be viewing their timer expire. That's a server overload waiting to happen at the worst possible instant.
Major concept option 2: have one really trusted browser, say, a user logged in to the website as "cron" which could run all of the timers at once. The action would all happen in that browser's javascript, and would work great, as long as that browser never crashed, that machine never failed, and that internet connection never went down.
As you can see, I feel like I'm barking up the wrong forest on this problem. Some other ideas that have presented themselves:
AJAX: I'm not seeing anything here that will do quite what I need. It's all browser-run stuff, nothing like a server-side process that could run independent of the user's browser.
PHP: Runs neatly on the server, but only in response to client requests. I'm not seeing any clean way to make PHP fork off a process and run a timer independent of the user's browser.
JS: same problems as PHP, but easier to read. ;)
Ruby: There may be some multi-threading with Ruby, but it isn't readily apparent to me. Would it be possible to have each user's browser check to see if a timer process was running for their event, and spawn a new server-side ruby process if it wasn't?
I'm wide open for ideas -- I've started playing with concepts in JS and PHP, but I'm not tied to any language, particularly. The only constraint, really, is that I won't own the server that I'm running the site on, so I can't just run a neat little local process that does what I need it to do. :(
Any thoughts? Thanks in advance,
Dan
ASP.NET has multi-threading. You can have a static variable to collect the event data, and use a thread to do whatever needed when the time comes. After you can empty the static variable so it's ready for future use.
http://leedale.wordpress.com/2007/07/22/multithreading-with-aspnet-20/
You might want to take a look at the Quartz scheduler for Java which also has a .NET version. With a friendly open source license (Apache 2.0) this is probably a very good starting point.
If you can control cron jobs, which at least I could on HostPapa's shared hosting, you could run a php file every second which checks the timers and takes action based on them.
I would suggest AJAX anyway, what we did on a game server was emulation of "server connects to client" via AJAX request to server without any time-out (asynchronous connection). Basically you create one extra connection for each client that hangs on the server and waits for the server to take self-invoked action. After the action is done you start a new hanging connection immediately so you have one hanging all the time (so the server can talk to your client any time it wants). You can send javascript code from the server that will decide what will happen next. You can check clients to have these hanging connections on the server side to count as valid and of course run your timers on the server.

System architecture: simple approach for setting up background tasks behind a web application -- will it work?

I have a Django web application and I have some tasks that should operate (or actually: be initiated) on the background.
The application is deployed as follows:
apache2-mpm-worker;
mod_wsgi in daemon mode (1 process, 15 threads).
The background tasks have the following characteristics:
they need to operate in a regular interval (every 5 minutes or so);
they require the application context (i.e. the application packages need to be available in memory);
they do not need any input other than database access, in order to perform some not-so-heavy tasks such as sending out e-mail and updating the state of the database.
Now I was thinking that the most simple approach to this problem would be simply to piggyback on the existing application process (as spawned by mod_wsgi). By implementing the task as part of the application and providing an HTTP interface for it, I would prevent the overhead of another process that is holding all of the application into memory. A simple cronjob can be setup that sends a request to this HTTP interface every 5 minutes and that would be it. Since the application process provides 15 threads and the tasks are quite lightweight and only running every 5 minutes, I figure they would not be hindering the performance of the web application's user-facing operations.
Yet... I have done some online research and I have seen nobody advocating this approach. Many articles suggest a significantly more complex approach based on a full-blown messaging component (such as Celery, which uses RabbitMQ). Although that's sexy, it sounds like overkill to me. Some articles suggest setting up a cronjob that executes a script which performs the tasks. But that doesn't feel very attractive either, as it results in creating a new process that loads the entire application into memory, performs some tiny task, and destroys the process again. And this is repeated every 5 minutes. Does not sound like an elegant solution.
So, I'm looking for some feedback on my suggested approach as described in the paragraph before the preceeding paragraph. Is my reasoning correct? Am I overlooking (potential) problems? What about my assumption that application's performance will not be impeded?
All are reasonable approaches depending on your specific requirements.
Another is to fire up a background thread within the process when the WSGI script is loaded. This background thread could simply sleep and wake up occasionally to perform required work and then go back to sleep.
This method necessitates though that you have at most one Django process which the background thread runs in to avoid different processing doing the same work on any database etc.
Using daemon mode with a single process as you are would satisfy that criteria. There are potentially other ways you could achieve that though even in a multiprocess configuration.
Note that celery works without RabbitMQ as well. It can use a ghetto queue (SQLite, MySQL, Postgres, etc, and Redis, MongoDB), which is useful in testing or for simple setups where RabbitMQ seems overkill.
See http://ask.github.com/celery/tutorials/otherqueues.html
(Using Celery with Redis/Database as the messaging queue.)