My goal is to create an application that will be able to do long-lasting mainly system tasks, such as:
checking out code from the repositories,
copying directories between various localizations,
etc.
The problem is I need to prepare it somehow independently from the web browser. I mean that for example after starting the checkout/copy action, closing the web browser will not interrupt the action. So after going back to that site I can see that the copying goes on or another action started when the browser was closed...
I was searching through various tools, like RabbitMQ + Celery, Twisted, Pyro, XML-RPC but I don't know if any of these will be suitable for me. Has anyone encountered similar needs when creating Django app? Please let me know if there are any methods/packages that I should know. Code samples also will be more than welcome!
Thank you in advance for your suggestions!
(And sorry for my bad English. I'm working on it.)
Basically you need to have a process that runs outside of the request. The absolute simplest way to do this (on a Unix-like operating system, at least) is to fork():
if os.fork() == 0:
do_long_thing()
sys.exit(0)
… continue with request …
This has some downsides, though (ex, if the server crashes, the “long thing” will be lost)… Which is where, ex, Celery can come in handy. It will keep track of the jobs that need to be done, the results of jobs (success/failure/whatever) and make it easy to run the jobs on other machines.
Using Celery with a Redis backend (see Kombu's Redis transport) is very simple, so I would recommend looking there first.
You might need to have a process outside the request / response cycle. If that is the case, Celery with a Redis backend is what I would suggest looking into, as that integrates nicely with Django (as David Wolever suggested).
Another option is to create Django management commands, and then use cron to execute them at scheduled intervals.
Related
I am sorry if its basics but I did not find any answers on the Internet comparing these two technologies. How should I decide when to use which as both can be used to schedule and process periodic tasks.
This is what an article says:
Django-celery :
Jobs are essential part of any application that does
some processing for you in the background. If your job is real time
Django application celery can be used.
Django-cronjobs :
django-cronjobs can be used to schedule periodic_task which is a
valid job. django-cronjobs is a simple Django app that runs registered
cron jobs via a management command.
can anyone explain me the difference between when should I choose which one and Why? Also I need to know why celery is used when the computing is distributed and why not cron jobs
The two things can be used for the same goal (background execution). However, if you are going to choose wisely, you should really understand that they are actually completely different things.
Here's what I wish someone had told me back when I was a noob (instead of the novice level that I have achieved today :)).
cron
The concept of a cron job is that we want a command / process to be executed on some schedule. Furthermore, we want that process to receive x,y,z parameters, run with a,b,c environment variables, and as user id 123.
Some cron systems may facilitate a few extra features, such as:
catching up on missed tasks (e.g. the server was off for a power outage all night and as soon as we turn it on, it runs the 8 instances of the command we normally run hourly).
might help you with the type of locking you normally do using a pid file in order to avoid parallel runs of the same command.
For the most part, cron systems are meant to be dumb: "just run this command at this time, thanks!".
Celery
The concept of Celery is much more sophisticated. It works with tasks, chains & chords of tasks, error handling, and (in most cases) collection of work result. It has a queue (or many queues) of work and a worker (or many). When a task (really just a message describing requested work) enters the queue it waits there until a worker is available to handle it. Much the same way as 1 or more employees at the DMV service a room full of waiting customers.
Furthermore, Celery can facilitate distributed work. That's a bit like (if I may torture the analogy a bit) - the difference between a DMV office where every worker shares the same phone, computer, copier, etc and a DMV where workers have dedicated resources and are never blocked by other workers.
Celery for web apps
In web applications, Celery is often used when a bit of web access results in a thing to be done that should be handled out of band of the conversation with the web browser. For example:
the web user just did something which should result in an email being sent. In order to send an email, your web server will need to contact a mail server. This could take time, the server could be busy, etc - we cant make the web user just wait, seeing nothing on their browser while we do this. Well, you can but it won't work reliably. So, we do that email send as a bit of work in the queue. That way, it can happen "whenever" and the web server can get back to communicating with the browser.
the user just submitted a credit card as payment. You're going to need to contact the card processor, but that might take several seconds. You might even have to contact them multiple times (e.g. they are really busy there right now). Again, you don't want your user's web browser to just sit blankly and you don't want a web server process or thread of execution tied up. Instead, you use Celery to create a job, you tell the browser to check back in a few seconds (or use a "web socket"), and your web server moves on and talks to other web users. When the browser checks back later, you lookup the task id and find out from celery whether it is finished and what the outcome was (card declined, etc).
Using Celery as cron
When you use Celery as a "cron system" all you are really doing is saying: "hey, can someone please generate work of X type on Y schedule". A process is created that runs continuously which sleeps most of the time and wakes up occasionally to inject a bit of work into the queue on the schedule you requested.
Usually the "hey someone" that you ask to do that for you is: celery beat and beat gets the schedule you want from somewhere in the database or from your settings file.
I searched for celery vs cron and found a few results that might be helpful to you.
https://www.reddit.com/r/Python/comments/m2dg8/explain_like_im_five_why_or_why_not_would_celery/
Why would running scheduled tasks with Celery be preferable over crontab?
Distributed task queues (Ex. Celery) vs crontab scripts
I'm working on a long request to a django app (nginx reverse proxy, mysql db, celery-rabbitMQ-redis set) and have some doubts about the solution i should apply :
Functionning : One functionality of the app allows users to migrate thousands of objects from one system to another. Each migration is logged into a db, and the users are provided the possibility to get in a csv format the history of the migration : which objects have been migrated, which status (success, errors, ...)
To get the history, a get request is sent to a django view, which returns, after serialization and rendering into csv, the download response.
Problem : the serialisation and rendering processes, for a large set of objects (e.g. 160 000) are quite long and the request times out.
Some solutions I was thinking about/found thanks to pervious search are :
Increasing the amount of time before timeout : easy, but I saw everywhere that this is a global nginx setting and would affect every requests on the server.
Using an asynchronous task handled by celery : the concept would be to make an initial request to the server, which would launch the serializing and rendering task with celery, and give a special httpresponse to the client. Then the client would regularly ask the server if the job is done, and the server would deliver the history at the end of processing. I like this one but I'm not sure about how to technically implement that.
Creating and temporarily storing the csv file on the server, and give the user a way to access it & to download it. I'm not a big fan of that one.
So my question is : has anyone already faced a similar question ? Do you have advises for the technical implementation of the solution (#2), or a better solution to propose me ?
Thqnks !
Clearly you should use Celery + RabbitMQ/REDIS. If you look at the docs it´s not that hard to setup.
The first question is whether to use RabbitMQ or Redis. There are many SO questions about this with good information about pros/cons.
The implementation in django is really simple. You can just wrap django functions with celery tasks (with #task attribute) and it´ll become async, so this is the easy part.
The problem I see in your project is that the server who is handling http traffic is the same server running the long process. That can affect performance and user experience even if celery is running on the background. Of course that depends on how much traffic you are expecting on that machine and how many migrations can run at the same time.
One of the things you setup on Celery is the number of workers (concurrent processing units) available. So the number of cores in your machine will matter.
If you need to handle http calls quickly I would suggest to delegate the migration process to another machine. Celery/REDIS can be configured that way. Let´s say you´ve got 2 servers. One would handle only normal django calls (no celery) and trigger celery tasks on the other server (the one who actually runs the migration process). Both servers can connect to the same database.
But this is just an infrastructure optimization and you may not need it.
I hope this answers your question. If you have specific Celery issues it would be better to create another question.
I managed to get Django and RabbitMQ and Celery work on single machine. I have followed instructions from here. Now I want to make them work together but in situation when they are on different servers. I do not want Django knows anything about Celery nor Celery about Django.
So, basically I just want in Django to send some message to RabbitMQ queue (probably id, type of task, maybe some other info), and then I want RabbitMQ to publish that message (when its possible) to Celery on another server. Celery/Django should not know about each other, basically I want architecture where it is easy to replace any one of them.
Right now I have in my Django several calls like
create_project.apply_async(args, countdown=10)
I want to replace that with similar calls directly to RabbitMQ (as I said Django should not depend on Celery). Then, RabbitMQ should notify Celery (when it is possible) and Celery will do its job (probably interact with Django but through REST interface).
Also, I have need to have Celery workers on two or more servers and I want RabbitMQ to notify only one of them depending on some field in message. If this is to complicated I could just check in every task (on different machines) something like: is this is something you should do (like checking ip address field in message) and if its not than just stop with execution of task.
How can I achieve this? if possible I would prefer code + configuration examples not just theoretical explanation.
Edit:
I think that for my use case celery is total overhead. Simple RabbitMQ
routing with custom clients will do the job. I already tried simple
use case (one server) and it works perfectly. It should be easy to
make communication multi-server ready. I do not like celery. It is
"magical", hides too many details and it is not easy to configure. But I will leave this question alive, because I am interested in others opinions.
The short of it
How can I achieve this?
Celery only sends the task name and a serialized set of parameters as the message body. That is your scenario is absolutely in line with how Celery operates.
if possible I would prefer code + configuration examples not just theoretical explanation.
For the client app, i.e. your Django app, define stub tasks, like so:
#task
def foo():
pass
For the Celery processing, on your remote server, define the actual tasks to be executed.
#task
def foo():
pass
It is important that the tasks live in the same Python package in both sides (i.e. app.tasks.py, otherwise Celery won't be able to match the message to the actual task.
Note that this also means your Django app becomes untestable if you have set CELERY_ALWAYS_EAGER=True, unless you make the Celery apps's tasks.py available locally to the Django app.
Even Simpler Alternative
An alternative to the above stub tasks is to send tasks by name:
>>> app.send_task('tasks.add', args=[2, 2], kwargs={})
<AsyncResult: 373550e8-b9a0-4666-bc61-ace01fa4f91d>
On Message Patterns
Also, I have need to have Celery workers on two or more servers and I want RabbitMQ to notify only one of them depending on some field in message.
RabbitMQ offers several messaging patterns, their tutorials are quite well written and to the point. What you want (one message processed by one worker) is trivially achieved with a simple queue/exchange setup, which (with Celery at least) is the default if you don't do anything else. If you need specific workers to attend to specific tasks/respond to specific messages, use Celery's task routing which works hand-in-hand with RabbitMQ's concept of queues and exchanges.
Trade-Offs
I think that for my use case celery is total overhead. Simple RabbitMQ routing with custom clients will do the job. I already tried simple use case (one server) and it works perfectly.
Of course, you may use RabbitMQ out of the box, at the cost of having to deal with the lower-level API that RabbitMQ provides. Celery adds a task abstraction that makes it very straight forward to build any producer/consumer scenario, essentially using just plain Python functions or methods. Note that this is not a better/worse judgement of either RabbitMQ or Celery -- as always with engineering decisions, there is trade-off involved:
If you use Celery, you probably loose some of the flexibility of the RabbitMQ API, but you gain ease of development, while gaining in development speed and a lower deployment complexity -- it basically just works.
If you use RabbitMQ directly, you gain flexibility, but with this comes deployment complexity that you need to manage yourself.
Depending on the requirements of your project, either approach may be valid - your call, really.
Any sufficiently advanced technology is indistinguishable from magic ;-)
I do not like celery. It is "magical", hides too many details and it is not easy to configure.
I would choose to disagree. It may be "magical" in Arthur C. Clarke's sense, but it certainly is rather easy to configure if you compare it to a plain RabbitMQ setup. Of course if you're also the guy who does the RabbitMQ setup, it may just add a layer of abstraction that you don't really gain anything from. Maybe your developers will?
I'm trying to do a Django application with an asynchronous part: Websockets. Just as a little challenge, I want to mount everything in the same process. Tried Socket.IO but couldn't manage to actually use sockets, instead of longpolling (which killed my browser several times, until I gave up).
What I then tried was a not-so-maintained library based on gevent-websocket. However, had many errors and was not easy to debug.
Now I am trying a Tornado approach but AFAIK (please correct me if I'm wrong) integrating async with a regular django app wrapped by WSGIContainer (websockets would go through Tornado, regular connections through Django) will be a true server killer if a resource is heavy or, somehow, the Django ORM goes slow into heavy operations.
I was thinking on moving to Twisted/Cyclone. Before I move from one architecture with such issue to ANOTHER architecture with such issue, i'd like to ask:
Does Tornado (and/or Twisted) have an architecture of scheduling tasks in the same way Gevent does? (this means: when certain greenlets "block", they schedule themselves to other threads, at least until the operation finishes). I'm asking this because (please correct me if I'm wrong) a regular django view will not be suitable for stuff like #inlineCallbacks, and will cause the whole server to be blocked (incl. the websockets).
I'm new to async programming in python, so there's a huge change I have misinformation about more than one concept. Please help me clarifying this before I switch.
Neither Tornado nor Twisted have anything like gevent's magic to run (some) blocking code with the performance characteristics of asynchronous code. Idiomatic use of either Tornado or Twisted will be visible throughout your app in the form of callbacks and/or Futures/Deferreds.
In general, since you'll need to run multiple python processes anyway due to the GIL, it's usually best to dedicate some processes to websockets with Tornado/Twisted and other processes to Django with the WSGI container of your choice (and then put nginx or haproxy in front so it looks like a single service to the outside world).
If you still want to combine django and an asynchronous service in the same process, the next best solution is to use threads. If you want the two to share one listening port, the listener must be a websocket-aware HTTP server that can spawn other threads for WSGI requests. Tornado does not yet have a solution for this, although one is planned for version 4.1 (https://github.com/tornadoweb/tornado/pull/1075). I believe Twisted's WSGI container does support running the WSGI workers in threads, but I don't have any experience with it myself. If you need them in the same process but do not need to share the same port, then you can simply run the IOLoop or Reactor in one thread and the WSGI container of your choice in another (with its associated worker threads).
I have a Django web application and I have some tasks that should operate (or actually: be initiated) on the background.
The application is deployed as follows:
apache2-mpm-worker;
mod_wsgi in daemon mode (1 process, 15 threads).
The background tasks have the following characteristics:
they need to operate in a regular interval (every 5 minutes or so);
they require the application context (i.e. the application packages need to be available in memory);
they do not need any input other than database access, in order to perform some not-so-heavy tasks such as sending out e-mail and updating the state of the database.
Now I was thinking that the most simple approach to this problem would be simply to piggyback on the existing application process (as spawned by mod_wsgi). By implementing the task as part of the application and providing an HTTP interface for it, I would prevent the overhead of another process that is holding all of the application into memory. A simple cronjob can be setup that sends a request to this HTTP interface every 5 minutes and that would be it. Since the application process provides 15 threads and the tasks are quite lightweight and only running every 5 minutes, I figure they would not be hindering the performance of the web application's user-facing operations.
Yet... I have done some online research and I have seen nobody advocating this approach. Many articles suggest a significantly more complex approach based on a full-blown messaging component (such as Celery, which uses RabbitMQ). Although that's sexy, it sounds like overkill to me. Some articles suggest setting up a cronjob that executes a script which performs the tasks. But that doesn't feel very attractive either, as it results in creating a new process that loads the entire application into memory, performs some tiny task, and destroys the process again. And this is repeated every 5 minutes. Does not sound like an elegant solution.
So, I'm looking for some feedback on my suggested approach as described in the paragraph before the preceeding paragraph. Is my reasoning correct? Am I overlooking (potential) problems? What about my assumption that application's performance will not be impeded?
All are reasonable approaches depending on your specific requirements.
Another is to fire up a background thread within the process when the WSGI script is loaded. This background thread could simply sleep and wake up occasionally to perform required work and then go back to sleep.
This method necessitates though that you have at most one Django process which the background thread runs in to avoid different processing doing the same work on any database etc.
Using daemon mode with a single process as you are would satisfy that criteria. There are potentially other ways you could achieve that though even in a multiprocess configuration.
Note that celery works without RabbitMQ as well. It can use a ghetto queue (SQLite, MySQL, Postgres, etc, and Redis, MongoDB), which is useful in testing or for simple setups where RabbitMQ seems overkill.
See http://ask.github.com/celery/tutorials/otherqueues.html
(Using Celery with Redis/Database as the messaging queue.)