Using celery for data migration. Is it a good idea? - django

I am doing data migration which deals with images/videos and such being downloaded and then sent to dropbox by using its api.
I'm using python-django for the entire web app but I imagine this will take a lot of bandwidth and there might be lot of issues where a failure of one image being saved shouldn't stop the entire migration.
Thus, is celery a good idea? Or Twisted?
I'm a bit confused about how this would help me. What I've in mind is to spawn a server/thread for the process of dealing with a single image or a small set of images and thus being able to do it on multiple threads.

The short answer to your question "is Celery a good idea?" is "Yes". I've used Celery to achieve a similar process whereby user submission of a form initiates, amongst other things, asynchronous calls to the Twitter API which then write back to saved objects in my database. I've found Celery outstanding for this task (no pun intended).
Celery would allow you to initiate pre-defined tasks (which, in part, can be thought of as "normal" Python functions with a #task decorator added to them), each time a user indicates they'd like to download an image or images. Celery gives you granular, per-task control over errors and retries, and tasks can be submitted singly or as chains, chords, or groups, all of which means you can definitely achieve your requirement of migration continuing even when a single image fails to download.
I would recommend spending some time with the Celery tutorial here and the Celery-Django tutorial here, which will give you an introduction to the basic work flow with Celery and Django.
I can't speak to the merits of Twisted, but if you are looking for opinions on the relative strengths and weaknesses of each, these look like a good start:
Twisted or Celery? Which is right for my application with lots of
SOAP calls?
sync spawing of processes: design question - Celery or Twisted

Related

Is there a way to compute this amount of data and still serve a responsive website?

Currently I am developing a django + react website, that will (I hope) serve a decent number of users. The project demo is mostly complete, and I am starting to think about the scale required to put this thing into production
The website essentially does three things:
Grab data from external APIs (i.e. Twitter) for 50,000 unique keywords (the keywords dont change). This process happens every 30 minutes
Run computation on all of the data, and save that computation to the database. Assume that the algorithm is as optimized as possible
When a user visits the website it should serve a pretty graph/chart of all of the computational data per keyword
The issue being, this is far too intense a task to be done by the same application that serves the website, users would be waiting decades to see their data. My current plan is to have a separate API made that services the website with the data, that the website can then store in it's database. This separate API would process the data without fear of affecting users, and it should be able to finish its current computation in under 30 minutes, in time for the next round of data.
Can anyone help me understand how I can better equip my project to handle the scale? I'd love some ideas.
As a 4th year CS Student I figured it's time to put a real project out into the world and I am very excited about it and the progress I've made so far. My main worry is that the end users will be negatively effected, if I don't figure out some kind of pipeline to make this process happen.
To re-iterate my idea:
Django + React - This is the forward facing website
External API - Grabs the data off the internet and processes it, and waits for a GET request from the website
Is there a better way to do this? Or on the other hand am I severely overestimating how computationally heavy this is.
Edit: Including current research
Handling computationally intensive tasks in a Django webapp
Separation of business logic and data access in django
What you want is to have the computation task to be executed by a different process in the "background".
The most straight-forward and popular solution is to use Celery, see here.
The Celery worker(s) - which performs the background task - can either run on the same machine as the web-application or (when scale becomes an issue), you can change the configuration so that it will run on an entirely different machine.

How to skip waiting for response from db part when saving data in Django?

The challenge is that I need to recompute all the data I have in db, after saving a new instance. The computation takes not more than 2 mins, which is fine for my problem. I have custom save method and all I need to do is to go through all items and item.save(), but as I said It takes more than 30 sec, so I have issues with 'request timeout'(using Heroku btw). Any ideas on how to deal with this?
#Uladzislau Malinouski, you can use the asynchronous tools like celery.
The way it works is that the task that is taking a significant amount of time can be wrapped as an asynchronous task and the computing can be done in the background. You can setup celery on Heroku by following this guide - https://devcenter.heroku.com/articles/celery-heroku.
If you are using free account and have not provided your card details on Heroku, it'd not be possible for you to add the addon for the broker which are used along with celery-like Redis, Rabbitmq etc.
In such cases, you may follow this guide to schedule an asynchronous task on Heroku - https://devcenter.heroku.com/articles/clock-processes-python

Django background tasks with Twisted

I'm making a web app using Django.
I'd like events to trigger 'background' tasks that run parallel to the Django application. (by parallel I just mean that they don't impact the speed of the users experience)
Types of tasks I'm talking about
a user logs in and an event is trigged to start populating that users cache in anticipation of future requests based on their use habits.
a user posts some data to the database but that post triggers an api call to another website where the returned data will be parsed, aggregated and used to supplement that users post.
rolling updates of data used in the app through api calls to other websites
aggregating data and running general maintenance tasks.
After a few days of research I'm thinking that I should use twisted to accomplish this. Which has lead me to my question:
Is twisted overkill for what I'm trying to accomplish?
Many of these tasks are far more i/o bound than cpu bound. So I'm thinking asynchronous is best.
Anyone advice would be appreciated.
Thank you
Yes, I think it's overkill.
Rather than fold in a full async framework such as Twisted, with all the technical overhead that brings, you might be better off using a task queue to be able to do what you want as a background process.
When your app needs to do a background task (anything that would otherwise block the request/response cycle), put the task in the queue and let a separate worker process pick things off the queue and deal with them as fast as it can. (You can always add more workers).
Two of the most popular queue libraries for Python/Django are celery and rq. They're especially good with Redis as a backend, but there are other backend options, too.
Personally, I much prefer rq over celery, in terms of its API and its clean setup, but both are used by a lot of people.
And both are definitely easier to get your head around than something like Twisted, IMO.

Running continually tasks alongside the django app

I'm building a django app which lists the hot(according to a specific algorithm) twitter trending topics.
I'd like to run some processes indefinitely to make twitter API calls and update the database(postgre) with the new information. This way the hot trending topic list gets updated asynchronously.
At first it seemed to me that celery+rabbitmq were the solution to my problem, but from what I understand they are used within django to launch scheduled or user triggered tasks, not indefinitely running tasks.
The solution that comes to my mind is write a .py file to continually put trending topics in a queue and write independent .py files continually running, making get queue requests and saving the data in the db used by django with raw SQL or SQLAlchemy. I think that this could work, but I'm pretty sure there is a much better way to do it.
If you just need to keep some processes running continually, supervisor is a nice solution.
You can combine it with any queuing technology you like to push things into your queues.

How to best launch an asynchronous job request in Django view?

One of my view functions is a very long processing job and clearly needs to be handled differently.
Instead of making the user wait for long time, it would be best if I were able to lunch the processing job which would email the results, and without waiting for completion notify the user that their request is being processed and let them browse on.
I know I can use os.fork, but I was wondering if there is a 'right way' in terms of Django. Perhaps I can return the HTTP response, and than go on with this job somehow?
There are a couple of solutions to this problem, and the best one depends a bit on how heavy your workload will be.
If you have a light workload you can use the approach used by django-mailer which is to define a "jobs" model, save new jobs into the database, then have cron run a stand-alone script every so often to process the jobs stored in the database (deleting them when done). You can use something like django-chronograph to manage the job scheduling easier
If you need help understanding how to write a script to process the job see James Bennett's article Standalone Django Scripts for help.
If you have a very high workload, meaning you'll need more than a single server to process the jobs, then you want to use a real distribute task queue. There is a lot of competition here so I can't really detail all the options, but a good one to use with for Django apps is celery.
Why not simply start a thread to do the processing and then go on to send the response?
Before you select a solution, you need to determine how the process will be run. I.e is it the same process for every single user, the data is the same and can be scheduled regularly? or does each user request something and the results are slightly different ?
As an example, if the data will be the same for every single user and can be run on a schedule you could use cron.
See: http://www.b-list.org/weblog/2007/sep/22/standalone-django-scripts/
or
http://docs.djangoproject.com/en/dev/howto/custom-management-commands/
However if the requests will be adhoc and you need something scalable that can handle high load and is asynchronous: what you are actually looking for is a message queing system. Your view will add a request to the queue which will then get acted upon.
There are a few options to implement this in Django:
Django Queue service is purely django & python and simple, though the last commit was in April and it seems the project has been abandoned.
http://code.google.com/p/django-queue-service/
The second option if you need something that scales, is distributed and makes use of open source message queing servers: celery is what you need
http://ask.github.com/celery/introduction.html
http://github.com/ask/celery/tree