Handling long requests - django

I'm working on a long request to a django app (nginx reverse proxy, mysql db, celery-rabbitMQ-redis set) and have some doubts about the solution i should apply :
Functionning : One functionality of the app allows users to migrate thousands of objects from one system to another. Each migration is logged into a db, and the users are provided the possibility to get in a csv format the history of the migration : which objects have been migrated, which status (success, errors, ...)
To get the history, a get request is sent to a django view, which returns, after serialization and rendering into csv, the download response.
Problem : the serialisation and rendering processes, for a large set of objects (e.g. 160 000) are quite long and the request times out.
Some solutions I was thinking about/found thanks to pervious search are :
Increasing the amount of time before timeout : easy, but I saw everywhere that this is a global nginx setting and would affect every requests on the server.
Using an asynchronous task handled by celery : the concept would be to make an initial request to the server, which would launch the serializing and rendering task with celery, and give a special httpresponse to the client. Then the client would regularly ask the server if the job is done, and the server would deliver the history at the end of processing. I like this one but I'm not sure about how to technically implement that.
Creating and temporarily storing the csv file on the server, and give the user a way to access it & to download it. I'm not a big fan of that one.
So my question is : has anyone already faced a similar question ? Do you have advises for the technical implementation of the solution (#2), or a better solution to propose me ?
Thqnks !

Clearly you should use Celery + RabbitMQ/REDIS. If you look at the docs it´s not that hard to setup.
The first question is whether to use RabbitMQ or Redis. There are many SO questions about this with good information about pros/cons.
The implementation in django is really simple. You can just wrap django functions with celery tasks (with #task attribute) and it´ll become async, so this is the easy part.
The problem I see in your project is that the server who is handling http traffic is the same server running the long process. That can affect performance and user experience even if celery is running on the background. Of course that depends on how much traffic you are expecting on that machine and how many migrations can run at the same time.
One of the things you setup on Celery is the number of workers (concurrent processing units) available. So the number of cores in your machine will matter.
If you need to handle http calls quickly I would suggest to delegate the migration process to another machine. Celery/REDIS can be configured that way. Let´s say you´ve got 2 servers. One would handle only normal django calls (no celery) and trigger celery tasks on the other server (the one who actually runs the migration process). Both servers can connect to the same database.
But this is just an infrastructure optimization and you may not need it.
I hope this answers your question. If you have specific Celery issues it would be better to create another question.

Related

Difference between usage of Django celery and Django cron-jobs?

I am sorry if its basics but I did not find any answers on the Internet comparing these two technologies. How should I decide when to use which as both can be used to schedule and process periodic tasks.
This is what an article says:
Django-celery :
Jobs are essential part of any application that does
some processing for you in the background. If your job is real time
Django application celery can be used.
Django-cronjobs :
django-cronjobs can be used to schedule periodic_task which is a
valid job. django-cronjobs is a simple Django app that runs registered
cron jobs via a management command.
can anyone explain me the difference between when should I choose which one and Why? Also I need to know why celery is used when the computing is distributed and why not cron jobs
The two things can be used for the same goal (background execution). However, if you are going to choose wisely, you should really understand that they are actually completely different things.
Here's what I wish someone had told me back when I was a noob (instead of the novice level that I have achieved today :)).
cron
The concept of a cron job is that we want a command / process to be executed on some schedule. Furthermore, we want that process to receive x,y,z parameters, run with a,b,c environment variables, and as user id 123.
Some cron systems may facilitate a few extra features, such as:
catching up on missed tasks (e.g. the server was off for a power outage all night and as soon as we turn it on, it runs the 8 instances of the command we normally run hourly).
might help you with the type of locking you normally do using a pid file in order to avoid parallel runs of the same command.
For the most part, cron systems are meant to be dumb: "just run this command at this time, thanks!".
Celery
The concept of Celery is much more sophisticated. It works with tasks, chains & chords of tasks, error handling, and (in most cases) collection of work result. It has a queue (or many queues) of work and a worker (or many). When a task (really just a message describing requested work) enters the queue it waits there until a worker is available to handle it. Much the same way as 1 or more employees at the DMV service a room full of waiting customers.
Furthermore, Celery can facilitate distributed work. That's a bit like (if I may torture the analogy a bit) - the difference between a DMV office where every worker shares the same phone, computer, copier, etc and a DMV where workers have dedicated resources and are never blocked by other workers.
Celery for web apps
In web applications, Celery is often used when a bit of web access results in a thing to be done that should be handled out of band of the conversation with the web browser. For example:
the web user just did something which should result in an email being sent. In order to send an email, your web server will need to contact a mail server. This could take time, the server could be busy, etc - we cant make the web user just wait, seeing nothing on their browser while we do this. Well, you can but it won't work reliably. So, we do that email send as a bit of work in the queue. That way, it can happen "whenever" and the web server can get back to communicating with the browser.
the user just submitted a credit card as payment. You're going to need to contact the card processor, but that might take several seconds. You might even have to contact them multiple times (e.g. they are really busy there right now). Again, you don't want your user's web browser to just sit blankly and you don't want a web server process or thread of execution tied up. Instead, you use Celery to create a job, you tell the browser to check back in a few seconds (or use a "web socket"), and your web server moves on and talks to other web users. When the browser checks back later, you lookup the task id and find out from celery whether it is finished and what the outcome was (card declined, etc).
Using Celery as cron
When you use Celery as a "cron system" all you are really doing is saying: "hey, can someone please generate work of X type on Y schedule". A process is created that runs continuously which sleeps most of the time and wakes up occasionally to inject a bit of work into the queue on the schedule you requested.
Usually the "hey someone" that you ask to do that for you is: celery beat and beat gets the schedule you want from somewhere in the database or from your settings file.
I searched for celery vs cron and found a few results that might be helpful to you.
https://www.reddit.com/r/Python/comments/m2dg8/explain_like_im_five_why_or_why_not_would_celery/
Why would running scheduled tasks with Celery be preferable over crontab?
Distributed task queues (Ex. Celery) vs crontab scripts

Django + RabbitMQ + Celery all on different machines (Servers)

I managed to get Django and RabbitMQ and Celery work on single machine. I have followed instructions from here. Now I want to make them work together but in situation when they are on different servers. I do not want Django knows anything about Celery nor Celery about Django.
So, basically I just want in Django to send some message to RabbitMQ queue (probably id, type of task, maybe some other info), and then I want RabbitMQ to publish that message (when its possible) to Celery on another server. Celery/Django should not know about each other, basically I want architecture where it is easy to replace any one of them.
Right now I have in my Django several calls like
create_project.apply_async(args, countdown=10)
I want to replace that with similar calls directly to RabbitMQ (as I said Django should not depend on Celery). Then, RabbitMQ should notify Celery (when it is possible) and Celery will do its job (probably interact with Django but through REST interface).
Also, I have need to have Celery workers on two or more servers and I want RabbitMQ to notify only one of them depending on some field in message. If this is to complicated I could just check in every task (on different machines) something like: is this is something you should do (like checking ip address field in message) and if its not than just stop with execution of task.
How can I achieve this? if possible I would prefer code + configuration examples not just theoretical explanation.
Edit:
I think that for my use case celery is total overhead. Simple RabbitMQ
routing with custom clients will do the job. I already tried simple
use case (one server) and it works perfectly. It should be easy to
make communication multi-server ready. I do not like celery. It is
"magical", hides too many details and it is not easy to configure. But I will leave this question alive, because I am interested in others opinions.
The short of it
How can I achieve this?
Celery only sends the task name and a serialized set of parameters as the message body. That is your scenario is absolutely in line with how Celery operates.
if possible I would prefer code + configuration examples not just theoretical explanation.
For the client app, i.e. your Django app, define stub tasks, like so:
#task
def foo():
pass
For the Celery processing, on your remote server, define the actual tasks to be executed.
#task
def foo():
pass
It is important that the tasks live in the same Python package in both sides (i.e. app.tasks.py, otherwise Celery won't be able to match the message to the actual task.
Note that this also means your Django app becomes untestable if you have set CELERY_ALWAYS_EAGER=True, unless you make the Celery apps's tasks.py available locally to the Django app.
Even Simpler Alternative
An alternative to the above stub tasks is to send tasks by name:
>>> app.send_task('tasks.add', args=[2, 2], kwargs={})
<AsyncResult: 373550e8-b9a0-4666-bc61-ace01fa4f91d>
On Message Patterns
Also, I have need to have Celery workers on two or more servers and I want RabbitMQ to notify only one of them depending on some field in message.
RabbitMQ offers several messaging patterns, their tutorials are quite well written and to the point. What you want (one message processed by one worker) is trivially achieved with a simple queue/exchange setup, which (with Celery at least) is the default if you don't do anything else. If you need specific workers to attend to specific tasks/respond to specific messages, use Celery's task routing which works hand-in-hand with RabbitMQ's concept of queues and exchanges.
Trade-Offs
I think that for my use case celery is total overhead. Simple RabbitMQ routing with custom clients will do the job. I already tried simple use case (one server) and it works perfectly.
Of course, you may use RabbitMQ out of the box, at the cost of having to deal with the lower-level API that RabbitMQ provides. Celery adds a task abstraction that makes it very straight forward to build any producer/consumer scenario, essentially using just plain Python functions or methods. Note that this is not a better/worse judgement of either RabbitMQ or Celery -- as always with engineering decisions, there is trade-off involved:
If you use Celery, you probably loose some of the flexibility of the RabbitMQ API, but you gain ease of development, while gaining in development speed and a lower deployment complexity -- it basically just works.
If you use RabbitMQ directly, you gain flexibility, but with this comes deployment complexity that you need to manage yourself.
Depending on the requirements of your project, either approach may be valid - your call, really.
Any sufficiently advanced technology is indistinguishable from magic ;-)
I do not like celery. It is "magical", hides too many details and it is not easy to configure.
I would choose to disagree. It may be "magical" in Arthur C. Clarke's sense, but it certainly is rather easy to configure if you compare it to a plain RabbitMQ setup. Of course if you're also the guy who does the RabbitMQ setup, it may just add a layer of abstraction that you don't really gain anything from. Maybe your developers will?

Tornado DB notifications from Django

I have Django server that servers HTTP requests and updates Postgres DB. There is also Tornado TCP server that should push new data from Database.
The problem is: best way to notify Tornado about DB changes. That's it.
What options did I think of:
Celery. I've read a lot of docs and it seems that this is just a queue for tasks, not for communications. Moreover, those communications should be established between different processes.
Redis. It seems like a good solution that also has brukva asynchronous client for Tornado. I'll need to introduce additional dependency on a project though.
Socket. It seems the most lightweight solution, besides the Tornado TCP server serves lots of client sockets anyways, so adding one more shouldn't be a big deal. I don't want to go this low though.
DB notifications. This was a solution I came up at first, but disliked later. Postgres has native notifications support by mechanisms of NOTIFY and LISTEN that are also asynchronous and can be easily added to Tornado IOLoop (io_loop.add_handler with connection file descriptor). This is low level too, but more important is that this solution violates Data Independency Principle.
Now I need to mention a subtle detail that can make a big difference with what I've described:
99% of changes in DB occur not in Django process, but in separate scripts (and hence separate processes) that are launched by scheduler and use Django ORM.
Now this description more or less reflects the actual state of project.
So is Celery really not suited for solving this kind of problem? Is Redis the best solution? Are there any other that I don't know about? Any thoughts and suggestions are appreciated.
After evaluating all these options, I decided to go with Redis+Tornado approach. It showed itself quite robust and stable (it was 5 years ago, just for the reference). I laid out the whole project with a working example here: https://www.databrawl.com/2017/04/27/real-time-python-via-tcp/

Getting "idle in transaction" for postgresql with django

We are using Django 1.3.1 and Postgres 9.1
I have a view which just fires multiple selects to get data from the database.
In Django documents it is mentioned, that when a request is completed then ROLLBACK is issued if only select statements were fired during a call to a view. But, I am seeing lot of "idle in transaction" in the log, especially when I have more than 200 requests. I don't see any commit or rollback statements in the postgres log.
What could be the problem? How should I handle this issue?
First, I would check out the related post What does it mean when a PostgreSQL process is “idle in transaction”? which covers some related ground.
One cause of "Idle in transaction" can be developers or sysadmins who
have entered "BEGIN;" in psql and forgot to "commit" or "rollback".
I've been there. :)
However, you mentioned your problem is related to have a lot of
concurrent connections. It sounds like investigating the "locks" tip
from the post above may be helpful to you.
A couple more suggestions: this problem may be secondary. The primary
problem might be that 200 connections is more than your hardware and
tuning can comfortably handle, so everything gets slow, and when things
get slow, more things are waiting for other things to finish.
If you don't have a reverse proxy like Nginx in front of your web app,
considering adding one. It can run on the same host without additional
hardware. The reverse proxy will serve to regulate the number of
connections to the backend Django web server, and thus the number of
database connections-- I've been here before with having too many
database connections and this is how I solved it!
With Apache's prefork model, there is 1=1 correspondence between the
number of Apache workers and the number of database connections,
assuming something like Apache::DBI is in use. Imagine someone connects
to the web server over a slow connection. The web and database server
take care of the request relatively quickly, but then the request is
held open on the web server unnecessarily long as the content is
dribbled back to the client. Meanwhile, the database connection slot is
tied up.
By adding a reverse proxy, the backend server can quickly delivery a
repliy back to the reverse proxy and then free the backend worker and
database slot.. The reverse proxy is then responsible for getting the
content back to the client, possibly holding open it's own connection
for longer. You may have 200 connections to the reverse proxy up front,
but you'll need far fewer workers and db slots on the backend.
If you graph the db slots with MRTG or similar, you'll see how many
slots you are actually using, and can tune down max_connections in
PostgreSQL, freeing those resources for other things.
You might also look at pg_top to
help monitor what your database is up to.
I understand this is an older question, but this article may describe the problem of idle transactions in django.
Essentially, Django's TransactionMiddleware will not explicitly COMMIT a transaction if it is not marked dirty (usually triggered by writing data). Yet, it still BEGINs a transaction for all queries even if they're read only. So, pg is left waiting to see if any more commands are coming and you get idle transactions.
The linked article shows a small modification to the transaction middleware to always commit (basically remove the condition that checks if the transaction is_dirty). I'll be trying this fix in a production environment shortly.

Long running tasks with Django

My goal is to create an application that will be able to do long-lasting mainly system tasks, such as:
checking out code from the repositories,
copying directories between various localizations,
etc.
The problem is I need to prepare it somehow independently from the web browser. I mean that for example after starting the checkout/copy action, closing the web browser will not interrupt the action. So after going back to that site I can see that the copying goes on or another action started when the browser was closed...
I was searching through various tools, like RabbitMQ + Celery, Twisted, Pyro, XML-RPC but I don't know if any of these will be suitable for me. Has anyone encountered similar needs when creating Django app? Please let me know if there are any methods/packages that I should know. Code samples also will be more than welcome!
Thank you in advance for your suggestions!
(And sorry for my bad English. I'm working on it.)
Basically you need to have a process that runs outside of the request. The absolute simplest way to do this (on a Unix-like operating system, at least) is to fork():
if os.fork() == 0:
do_long_thing()
sys.exit(0)
… continue with request …
This has some downsides, though (ex, if the server crashes, the “long thing” will be lost)… Which is where, ex, Celery can come in handy. It will keep track of the jobs that need to be done, the results of jobs (success/failure/whatever) and make it easy to run the jobs on other machines.
Using Celery with a Redis backend (see Kombu's Redis transport) is very simple, so I would recommend looking there first.
You might need to have a process outside the request / response cycle. If that is the case, Celery with a Redis backend is what I would suggest looking into, as that integrates nicely with Django (as David Wolever suggested).
Another option is to create Django management commands, and then use cron to execute them at scheduled intervals.