Asynchronous message queues and processing like Amazon Simple Queue service in django - django

There are many activities on an application that need things like:
Send email, Post to twitter
thumbnail an image, into several sizes
call a cron to find connected relationships
A good way to do these tasks is to write into an asynchronous queue on which operations are performed.
What django application can be used to implement such functionality, as the one Amazon Simple Queue service offers, locally?
I have come across celery. Right thing? Anything else that exists, like this?

Beanstalkd can also do what you want, and I've used it (though not from Python) to do some similar things (resizing images, and running other background tasks). There are a couple of Python client libraries to interface with it.

Related

Django - Update a Model every couple of minutes

I am building an app, where I need to fetch some data from an API and update all the models with that data every few minutes.
What would be a clean way to accomplish something like this?
Well, it's a quite open question.
You'll need to create a task that runs every few minutes, you can do this with Celery. Celery has a task schedluer http://docs.celeryproject.org/en/latest/userguide/periodic-tasks.html wich will launch a certain function at a configured time similar to a crontab
The task then would fetch the data, http://docs.python-requests.org/en/master/ is a very good library to make http requests.
And lastly but no less important you would need to serialize the fetched data and save it to your model. Django rest framework serializing capabilities are a great starting point, but if data structure is simple enough you can just use JSON python library json.loads(data) and create a function that translate the fields on the API to the fields of the model.
By the way, I'm supposing a REST API.
You can use a task management tool that has the feature of running periodic tasks in the intervals you specify, like Periodic Tasks in Celery.
Also, If you run your code on a Unix-like system, you can stick with the core django functionality. Just write your functionality as a Django Management Command and set a cronjob to run it in your preferred interval.

Django background tasks with Twisted

I'm making a web app using Django.
I'd like events to trigger 'background' tasks that run parallel to the Django application. (by parallel I just mean that they don't impact the speed of the users experience)
Types of tasks I'm talking about
a user logs in and an event is trigged to start populating that users cache in anticipation of future requests based on their use habits.
a user posts some data to the database but that post triggers an api call to another website where the returned data will be parsed, aggregated and used to supplement that users post.
rolling updates of data used in the app through api calls to other websites
aggregating data and running general maintenance tasks.
After a few days of research I'm thinking that I should use twisted to accomplish this. Which has lead me to my question:
Is twisted overkill for what I'm trying to accomplish?
Many of these tasks are far more i/o bound than cpu bound. So I'm thinking asynchronous is best.
Anyone advice would be appreciated.
Thank you
Yes, I think it's overkill.
Rather than fold in a full async framework such as Twisted, with all the technical overhead that brings, you might be better off using a task queue to be able to do what you want as a background process.
When your app needs to do a background task (anything that would otherwise block the request/response cycle), put the task in the queue and let a separate worker process pick things off the queue and deal with them as fast as it can. (You can always add more workers).
Two of the most popular queue libraries for Python/Django are celery and rq. They're especially good with Redis as a backend, but there are other backend options, too.
Personally, I much prefer rq over celery, in terms of its API and its clean setup, but both are used by a lot of people.
And both are definitely easier to get your head around than something like Twisted, IMO.

AWS celery and database

I am writing a webapp thats runs on AWS. My app requires users to upload their pdf files. I will convert them into Images using the "convert" utility in linux.
Here is my setup on Ubuntu 12.04:
Django
Celery
Django Celery
Boto
I am using apache as my webserver.
The work flow is as follows:
Three are three asynchronous tasks and two queues for handling all the processing and S3 for storing input and Output files.
A user uploads a pdf then:
accept_file_task is called: This task takes the user uploaded pdf and stores it in my S3 storage and then inserts a message into the input_queue(SQS)
check_queue_and_launch_instance_task: A periodic task that keeps monitoring the number of messages in the input_queue and launches instances whenever the queue has more messages than the no of Ec2 instances
The instances have a bootstrap script which is a while True: loop. Any of the instances can pick the message from the input_queue and do a Subprocess.Popen("convert "+input+ouput) and write the processed stated to output_queue and also upload the image generated into S3 output bucket and make it available as a download link
output_process_task: another periodic task that keeps polling the output_queue and whenever a message is available it will update the status in the table mentioned below.
I am using a model called Document to store all the status information. I also have users registering and hence a table to store all user information. Also Celery created a lot of tables to store all its task information. Right now I am using a single instance and the sqlite3 database (that comes with python) on that instance.
I am unsure about the following things
How do I scale up the database? Should I go for a RDS or a simpleDB or AmazonDB. If not celery, I could have easily used simpleDB. I am really stuck on this one
How do I get rid of the two periodic tasks check_queue_and_launch_instance_task and output_process_task. My idea is that Autoscaling must be used in some way so that if need at a later stage an Elastic Load Balancer can be used.
If any of you have designed something similar please help me on how to go about it
How do I scale up the database? Should I go for a RDS or a simpleDB or AmazonDB. If not celery, I could have easily used simpleDB. I am really stuck on this one
Keep in mind that premature optimization is the root of all evil. The question of RDS (which is really just MySQL, Oracle, or MS SQL) vs. SimpleDB is more of an application design decision than one based on scalability. SimpleDB is just a simple key-value store. RDS, on the other hand, will give you full ACID functionality. If your data is relational, then you should be using a relational database. If you just need a place to store simple strings or integers, then something like SimpleDB would make more sense.
Right now I am using a single instance and the sqlite3 database (that comes with python) on that instance.
Make sure that you understand the consequences of a) creating a single point-of-failure in your design and b) SQLite's limitations compared to using a standalone RDBMS in this application. (You can use it, but it's really intended for single-user applications).
How do I get rid of the two periodic tasks check_queue_and_launch_instance_task and output_process_task. My idea is that Autoscaling must be used in some way so that if need at a later stage an Elastic Load Balancer can be used.
If you're willing to replace Celery with SQS, you can tie together SQS + SNS + Cloudwatch to simplify this portion of your app. Though what you're doing doesn't sound like a bad choice, especially if it's working well already. Your time is probably better spent working on the problems in front of you rather than those that might occur down the road.

Running continually tasks alongside the django app

I'm building a django app which lists the hot(according to a specific algorithm) twitter trending topics.
I'd like to run some processes indefinitely to make twitter API calls and update the database(postgre) with the new information. This way the hot trending topic list gets updated asynchronously.
At first it seemed to me that celery+rabbitmq were the solution to my problem, but from what I understand they are used within django to launch scheduled or user triggered tasks, not indefinitely running tasks.
The solution that comes to my mind is write a .py file to continually put trending topics in a queue and write independent .py files continually running, making get queue requests and saving the data in the db used by django with raw SQL or SQLAlchemy. I think that this could work, but I'm pretty sure there is a much better way to do it.
If you just need to keep some processes running continually, supervisor is a nice solution.
You can combine it with any queuing technology you like to push things into your queues.

How to best launch an asynchronous job request in Django view?

One of my view functions is a very long processing job and clearly needs to be handled differently.
Instead of making the user wait for long time, it would be best if I were able to lunch the processing job which would email the results, and without waiting for completion notify the user that their request is being processed and let them browse on.
I know I can use os.fork, but I was wondering if there is a 'right way' in terms of Django. Perhaps I can return the HTTP response, and than go on with this job somehow?
There are a couple of solutions to this problem, and the best one depends a bit on how heavy your workload will be.
If you have a light workload you can use the approach used by django-mailer which is to define a "jobs" model, save new jobs into the database, then have cron run a stand-alone script every so often to process the jobs stored in the database (deleting them when done). You can use something like django-chronograph to manage the job scheduling easier
If you need help understanding how to write a script to process the job see James Bennett's article Standalone Django Scripts for help.
If you have a very high workload, meaning you'll need more than a single server to process the jobs, then you want to use a real distribute task queue. There is a lot of competition here so I can't really detail all the options, but a good one to use with for Django apps is celery.
Why not simply start a thread to do the processing and then go on to send the response?
Before you select a solution, you need to determine how the process will be run. I.e is it the same process for every single user, the data is the same and can be scheduled regularly? or does each user request something and the results are slightly different ?
As an example, if the data will be the same for every single user and can be run on a schedule you could use cron.
See: http://www.b-list.org/weblog/2007/sep/22/standalone-django-scripts/
or
http://docs.djangoproject.com/en/dev/howto/custom-management-commands/
However if the requests will be adhoc and you need something scalable that can handle high load and is asynchronous: what you are actually looking for is a message queing system. Your view will add a request to the queue which will then get acted upon.
There are a few options to implement this in Django:
Django Queue service is purely django & python and simple, though the last commit was in April and it seems the project has been abandoned.
http://code.google.com/p/django-queue-service/
The second option if you need something that scales, is distributed and makes use of open source message queing servers: celery is what you need
http://ask.github.com/celery/introduction.html
http://github.com/ask/celery/tree