Django and Celery Confusion - django

After reading a lot of blogposts, I decided to switch from crontab to Celery for my middle-scale Django project. I have a few things I didn't understand:
1- I'm planning to start a micro EC2 instance which will be dedicated to RabbitMQ, would this be sufficient for a small-to-medium heavy tasking? (Such as dispatching periodical e-mails to Amazon SES).
2- Computing of tasks, does compution of tasks occur on the Django server or the rabbitMQ server (assuming the rabbitMQ is on a seperate server)?
3- When I need to grow my system and have 2 or more application servers behind a load balancer, do these two celery machines need to connect to the same rabbitMQ vhost? Assuming application servers are the carbon copy and tasks are same and everything is sync on the database level.

I don't know the answer to this question, but you can definitely configure it to be suitable (e.g. use -c1 for a single process worker to avoid using much memory, or eventlet/gevent pools), see also the --autoscale option. The choice of broker transport also matters here, the ones that are not polling are more CPU effective (rabbitmq/redis/beanstalk).
Computing happens on the workers, the broker is only responsible for accepting, routing and delivering messages (and persisting messages to disk when necessary).
To add additional workers these should indeed connect to the same virtual host. You would
only use separate virtual hosts if you would want applications to have separate message buses.

Related

How to create one docker container per web service request?

I have a quite heavy batch process (a python script called "run_simulation.py") on which I have very little control, it can be launched by a single user through a web api but it read and writes from disk so it wouldn't handle parallel requests.
Now, I'd like to have one docker container instanciated per request so that all requests can be handled in parallel, what would be the way to do this ? Is this even doable with Docker ? What would be the module responsible to instanciate the container and pass the http request to it ?
Generally you don’t do this. There are two good reasons for that: if you unconditionally launch a container per request it becomes very easy to swamp your system with these background jobs to the point where none can progress; and the setup that would allow you to launch more Docker containers would also give you unlimited root-level access to the host, which you don’t want in a process that accepts network requests.
A better approach is to set up a job queue system. RabbitMQ is popular and open-source, but by no means the only option. When you receive a request that needs background work, you add a job to the queue and return immediately. Meanwhile, you have some number of worker processes which accept jobs from the queue and do the work.
This gives you a couple of benefits. You control how much work can be done in parallel (by controlling the number of worker containers). If you need to do more work by setting up a second server (or even more), they can all connect back to the same queue server, without requiring a complex multi-host container setup. If your workers crash (or get OOM-killed) their jobs will be returned to the queue and can be picked up and retried by other workers. If you decide Docker doesn’t work for you, or that you need a different orchestrator (Nomad, Kubernetes) you can run this exact same setup without making any code changes, just changing the deployment configuration.

Should I use a task queue (Celery), ayncio or neither for an API that polls other APIs?

I have written an API with Django which purpose is to operate as a bridge between a website back-end and external services we use, so that the website doesn't have to handle many requests to external APIs (CRM, calendar events, email providers etc.).
The API mainly polls other services, parses the results and forwards them to the website backend.
I initially went for a Celery-based task queue, as it seemed to me like the right tool to offload that processing to another instance, but I'm starting to think it doesn't really fit the purpose.
As the website expects synchronous responses, my code contains a lot of :
results = my_task.delay().get()
or
results = chain(fetch_results.s(), parse_results.s()).delay().get()
Which doesn't feel like the proper way to use Celery tasks.
It is efficient when pulling dozens of requests and processing the results in parallel - a periodic refresh task for example - but adds a lot of overhead for simple requests (fetch - parse - forward), which represent most of the traffic.
Should I go full synchronous for those "simple requests" and keep Celery tasks for specific scenarios ? Is there an alternative design (maybe involving asyncio) that would better suit the purpose of my API ?
Using Django, Celery (w/ Amazon SQS) on an EBS EC2 instance.
You could consider using Gevent with your Django webserver to allow it to operate efficiently for the "simple requests" you've mentioned without being blocked. If you proceed with this approach, be sure to pool database connections with PgBouncer or Pgpool-II or a Python library since each greenlet will make its own connection.
Once you've implemented that, it's possible to also use Gevent instead of Celery to handle asynchronous processing by joining on multiple Greenlets that each make an external API request, rather than incur the overhead of passing messages to an external celery worker.
Your implementation is similar to what we've done at Kloudless, which provides a single API to access multiple other APIs, including CRM, calendar, storage, etc.

Not sure if I should use celery

I have never used celery before and I'm also a django newbie so I'm not sure if I should use celery in my project.
Brief description of my project:
There is an API for sending (via SSH) jobs to scientific computation clusters. The API is an abstraction to the different scientific job queue vendors out there. http://saga-project.github.io/saga-python/
My project is basically about doing a web GUI for this API with django.
So, my concern is that, if I use celery, I would have a queue in the local web server and another one in each of the remote clusters. I'm afraid this might complicate the implementation needlessly.
The API is still in development and some of the features aren't fully finished. There is a function for checking the state of the remote job execution (running, finished, etc.) but the callback support for state changes is not ready. Here is where I think celery might be appropriate. I would have one or several periodic task(s) monitoring the job states.
Any advice on how to proceed please? No celery at all? celery for everything? celery just for the job states?
I use celery for similar purpose and it works well. Basically I have one node running celery workers that manage the entire cluster. These workers generate input data for the cluster nodes, assign tasks, process the results for reporting or generating dependent tasks.
Each cluster node is running a very small python server which takes a db id of it's assigned job. It then calls into the main (http) server to request the data it needs and finally posts the data back when complete. In my case, the individual nodes don't need to message each other and run time of each task is very long (hours). This makes the delays introduced by central management and polling insignificant.
It would be possible to run a celery worker on each node taking tasks directly from the message queue. That approach is appealing. However, I have complex dependencies that are easier to work out from a centralized control. Also, I sometimes need to segment the cluster and centralized control makes this possible to do on the fly.
Celery isn't good at managing priorities or recovering lost tasks (more reasons for central control).
Thanks for calling my attention to SAGA. I'm looking at it now to see if it's useful to me.
Celery is useful for execution of tasks which are too expensive to be executed in the handler of HTTP request (i.e. Django view). Consider making an HTTP request from Django view to some remote web server and think about latencies, possible timeouts, time for data transfer, etc. It also makes sense to queue computation intensive tasks taking much time for background execution with Celery.
We can only guess what web GUI for API should do. However Celery fits very well for queuing requests to scientific computation clusters. It also allows to track the state of background task and their results.
I do not understand your concern about having many queues on different servers. You can have Django, Celery broker (implementing queues for tasks) and worker processes (consuming queues and executing Celery tasks) all on the same server.

Django Celery in production

I have everything I want to do with django-celery working on my development machine locally. I run Django, djcelery, cellery and the broker (Amazon SQS). It sends tasks and just works.
I can set this all up like I have done locally (i.e. all on one
machine), but what happens when I want to distribute tasks to another
machine/share tasks etc.? Is this a copy of the current machine (with Django, djcelery and celery) and all connection to the same SQS? How does this work? If they all connection to the same broker do they just 'know'? or does it not work like this?
Is it ok to start off with everything on one machine like I did in development (I will daemonize celery in production)?
Amazon SQS is a Simple Queueing Service, jobs go in wait to be run and then removed from the queue once complete. Celery simply reads off of this queue.
Celery can scale both horizontally and vertically. You need celery to process more jobs faster? Give your machine more resources, up the worker count, thats vertical scaling, or boot more smaller machines which is horizontal scaling. Either way your celery workers are all consuming the same queue on SQS. It does depend on what your celery jobs are doing as to how the rest of your infrastructure will be affected. If they are writing to a DB the more workers you have the higher the load on your DB so you would need to look at scaling that too.
It is OK to start off with the "all" on one machine approach. As the demand on your app grows you can start looking at moving celery workers off to more machines or give your all in one server more resources.
Does this help? :)

Django-celery on multiple computers

I got everything I wanted to do with django-celery working on my development machine. More specifically, the app accepts photo urls which are then turned into tasks that the same machine downloads.
Now what I want to do is put the django code on heroku and the celery tasks on a dedicated computer that will be kept in the office.
I don't know what the next step is though. How do I tell the django app to connect to the office computer? What is the process for setting up the office computer to accepts tasks from the django app? How do I give the local computer login credentials to the django app so that it can connect to the database to update the models?
Ideally, I am looking to put something like this in my setting.py file:
remote_worker = '123.2.4.23:1234'
and on the office computer
tasks = 'photos/tasks.py'
remote_app = 'herokuapp123.com/myapp'
username = 'me'
password = 'pw'
I know there are a lot of questions. Any help or pointers would be appreciated!
This largely depends on what AMQP backend you are using for celery. If you are using the default (RabbitMQ) you will need do one of the following:
Install RabbitMQ on heroku server, expose its port to your business IP through firewall and configure your office computer to connect to it
Install RabbitMQ locally on your business computer and configure celery on Heroku to connect to it
Install RabbitMQ on both sides and bridge them.
Alternatively you can integrate the heroku server in your own business network using a VPN solution and have them directly talk to each other (because after all you probably don't want to transmit AMQP packets bare over the interwebz).
Scenario 1 is probably the easiest to set up as Heroku already provides you the plugin infrastructure to do so. Scenario 2 is probably not what you want as you will have to punch a hole in your business firewall for that. Both scenarios 1 and 2 will have latency and reliability issues as routing AMQP traffic over the internet is not going to be expedient or reliable. You will have dropped messages and celery will keep retrying until it succeeds or reaches the max number of failures. However AMQP was designed to handle network issues, they just may inadvertently affect your performance if that is critical. But then again in that case you should reconsider putting the celery workers on a business desktop.
Scenario C is probably best in terms of reliability but also more difficult to set up. Choose based on your needs.