What are the advantages of celerybeat over cron? - django

I see many people preferring celerybeat over cron jobs for periodic tasks. I see the documentation for celerybeat and I can see information on how to use it, but not why (or when) I should prefer it over cronjobs.
http://docs.celeryproject.org/en/latest/userguide/periodic-tasks.html#introduction

I have used both and have come to the conclusion that beat is better at control than cron.
You can wire it up so that your control is via django admin instead of sshing in and changing the crontab. Also, there is an implicit portability when using beat - meaning you can move it from machine to machine by way of configuration instead of a login.
Of course, there are disadvantages as well but they are few. We used to use pid files to control the singleton aspect of a job but now we use a generic database semaphore table (other people have used memcache, but I just don't feel comfortable with that).

Related

What is the best way to transfer large files through Django app hosted on HEROKU?

HEROKU gives me H12 error on transferring the file to an API from my Django application (Understood it's a long running process and there is some memory/worker tradeoff I guess so). I am on one single hobby Dyno right now.
The function just runs smoothly for around 50MB file. The file itself is coming from a different source ( requests python package )
The idea is to build a file transfer utility using Django app on HEROKU. The file will not gets stored in my app side. Its just getting from point A and sending to point B.
Went through multiple discussions along with the standard HEROKU documentations, however I am struggling in between in some concepts:
Will this problem be solved by background tasks really? (If YES, I am finding explanation of the process than the direct way to do it such that I can optimize my flow)
As mentioned in standard docs, they recommend background tasks using RQ package for python, I am using Postgre SQL at moment. Will I need to install and manage Redis Database as well for this. Is this even related to Database?
Some recommend using extra Worker other than the WEB worker we have by default. How does this relate to my problem?
Some say to add multiple workers, not sure how this solve it. Let's say today it starts working for large files using background tasks, what if the load of users at same time increases. How this will impact my solution and how should I plan the mitigation plan around the risks.
If someone here has a strong understanding with respect to the architecture, I am here to listen your experiences and thoughts. Also, let me know if there is other way than HEROKU from a solution standpoint which will make this more easy for me.
Have you looked at using celery to run this as a background task?
This is a very standard way of dealing with requests which take a large time to complete.
Will this problem be solved by background tasks really? ( If YES, I am finding explanation of the process than the direct way to do it such that I can optimise my flow )
Yes it can be solved by background tasks. If you are using something like Celery which has direct support for django, you will be running another instance of your Django application but with a different startup command for Celery. It then keeps reading for new tasks to execute and reads the task name from the redis queue (or rabbitmq - whichever you use as the broker) and then executes that task and updates the status back to redis (or the broker you use).
You can also use flower along with celery so that you have a dashboard to see how many tasks are being executed and what are their statuses etc.
As mentioned in standard docs, they recommend background tasks using RQ package for python, I am using Postgre SQL at moment. Will I need to install and manage Redis Database as well for this. Is this even related to Database?
To use background task with Celery you will need to set up some sort of message broker like Redis or RabbitMQ
Some recommend using extra Worker other than the WEB worker we have by default. How does this relate to my problem?
I dont think that would help for your use case
Some say to add multiple workers, not sure how this solve it. Let's say today it starts working for large files using background tasks, what if the load of users at same time increases. How this will impact my solution and how should I plan the mitigation plan around the risks.
When you use celery, you will have to start few workers for that Celery instance, these workers are the ones who execute your background tasks. Celery documentation will help you with exact count calculation of these workers based on your instance CPU and memory etc.
If someone here has a strong understanding with respect to the architecture, I am here to listen your experiences and thoughts. Also, let me know if there is other way than HEROKU from a solution standpoint which will make this more easy for me.
I have worked on few projects where we used Celery with background tasks to upload large files. It has worked well for our use cases.
Here is my final take on this after full evaluation, trials and earlier recommendations made here, thanks #arun.
HEROKU needs a web worker to deliver the website runtime which hold 512MB of memory, operations your perform if are below this limits should be fine.
Beyond that let's say you have scenarios like mentioned above where a large file is coming from one source api and going into another target api with Django app, you will have to:
First, you will have to run the file upload function as a background process since it will take time more than 30 seconds to respond which HEROKU expects to return. If not H12 Error is waiting for you. Solution to this is implementing Django Background tasks, Celery worked in my case. So here Celery is your same Django app functionality running as a background handler which needs its own app Dyno ( The Worker ) This can be scaled as needed in future.
To make your Django WSGI ( Frontend App ) talk to the Celery ( Background App), you need a message broker in between which can be HEROKU Redis, RabbitMQ, etc.
Second, the problems doesn't gets solved here even though you have a new Worker dedicated for the Celery app, the memory limits will still apply as its also a Dyno with its own memory.
To overcome this, your Python requests module should download the file in stream instead of direct downloading complete file in a single memory buffer. Iterate and load the stream data in chunks and send the file chunks to target endpoint.
Even chunk size plays here an important role. I will not put exact number here since it depends on various factors:
Should not be too small, else it will take more time to transfer.
Should not be too big to be handled by either of the source/target endpoint servers.

Dynamic periodic tasks - alternatives to Celery beat

If one wants to set up dynamic and standard periodic tasks scheduling and in a Django project, are there any stable alternatives to celery and celery beat? (when I say dynamic I mean such thing as described here)
Would Dramatiq or any other schedulers, for instance, allow for such customization as dynamic user-launched schedulling of periodic tasks?
Or are there any other strategies for creating some kind of dynamic schedule with periodic tasks for Django in general ?
There is a way to configure jobs in Django.
There is a good and helpful extension called django-crontab GitHub Repo. This could be something that will allow you to do what you need. This will tie in with Django specific as requested. Hope this is some help you on your alternative to Celery beats.
Have a good day.
Yes I had the same problem with django-celery-beat the problem is you cannot manage the periodic tasks dynamically (changing the schedule or adding the task itself on a running celery worker), to overcome this issue you can use this library djang-redbeat which does exactly what you want. The only difference is the CELERY_BEAT_SCHEDULER this library uses Redis database to store the tasks and their results.
https://pypi.org/project/django-redbeat/

Django & Celery: How do I schedule a job to run only once using Celery(similar to "at" command in linux)?

I looked at django-celery tutorial and I think it will really help me running the background tasks without letting the users to wait. However, I have a specific requirement in the program such that when user enters a date, django should be able to do the scheduling and defer the execution to a later time. I have used at program before but it gives a lot of permission issues. But when I read the documentation for Celery, I can only see that Celery supports cron like tasks called #periodic_task. I'm sure that it also provides at like mechanism, but I couldn't find any documentation. Can anybody point me to some resources or simply tell me how to achieve that? Thanks.
The docs state that you can schedule tasks to execute at a specific time, using the eta argument.
You can supply the countdown or ETA argument to the apply_async() function. By doing so, you can define the earliest time that the task is gonna be executed, but not the exact one (it depends on your queue). For more details see here.

Updating global variable in django from external program

Application running in django, apache, python and mysql:
I want to create global variable which I should be able to update from external python script may be using a cron job.
The the solution I think of... is having a webpage which updates that variable and then call this through a python script.
Any cleaner solutions?
You can use a huge variety of inter-process communication ("IPC") mechanisms, but I don't believe any of them will be easier, safer or "cleaner" than going through HTTP (which is basically what "a webpage which updates that variable" amounts to!-) -- all of your settings are already prepared to handle HTTP ("web";-) interactions, after all! Alternatives such as having the cron job update MySQL directly are less "clean" and more complicated to integrate with the rest of your application (your app would have to be "polling" the DB regularly about that one variable -- eek!-).

How do you deploy cron jobs to production?

How do people deploy/version control cronjobs to production? I'm more curious about conventions/standards people use than any particular solution, but I happen to be using git for revision control, and the cronjob is running a python/django script.
If you are using Fabric for deploment you could add a function that edits your crontab.
def add_cronjob():
run('crontab -l > /tmp/crondump')
run('echo "#daily /path/to/dostuff.sh 2> /dev/null" >> /tmp/crondump')
run('crontab /tmp/crondump')
This would append a job to your crontab (disclaimer: totally untested and not very idempotent).
Save the crontab to a tempfile.
Append a line to the tmpfile.
Write the crontab back.
This is propably not exactly what you want to do but along those lines you could think about checking the crontab into git and overwrite it on the server with every deploy. (if there's a dedicated user for your project.)
Using Fabric, I prefer to keep a pristine version of my crontab locally, that way I know exactly what is on production and can easily edit entries in addition to adding them.
The fabric script I use looks something like this (some code redacted e.g. taking care of backups):
def deploy_crontab():
put('crontab', '/tmp/crontab')
sudo('crontab < /tmp/crontab')
You can also take a look at:
http://django-fab-deploy.readthedocs.org/en/0.7.5/_modules/fab_deploy/crontab.html#crontab_update
django-fab-deploy module has a number of convenient scripts including crontab_set and crontab_update
You can probably use something like CFEngine/Chef for deployment (it can deploy everything - including cron jobs)
However, if you ask this question - it could be that you have many production servers each running large number of scheduled jobs.
If this is the case, you probably want a tool that can not only deploy jobs, but also track success failure, allow you to easily look at logs from the last run, run statistics, allow you to easily change the schedule for many jobs and servers at once (due to planned maintenance...) etc.
I use a commercial tool called "UC4". I don't really recommend it, so I hope you can find a better program that can solve the same problem. I'm just saying that administration of jobs doesn't end when you deploy them.
There are really 3 options of manually deploying a crontab if you cannot connect your system up to a configuration management system like cfengine/puppet.
You could simply use crontab -u user -e but you run the risk of someone having an error in their copy/paste.
You could also copy the file into the cron directory but there is no syntax checking for the file and in linux you must run touch /var/spool/cron in order for crond to pickup the changes.
Note Everyone will forget the touch command at some point.
In my experience this method is my favorite manual way of deploying a crontab.
diff /var/spool/cron/<user> /var/tmp/<user>.new
crontab -u <user> /var/tmp/<user>.new
I think the method I mentioned above is the best because you don't run the risk of copy/paste errors which helps you maintain consistency with your version controlled file. It performs syntax checking of the cron tasks inside of the file, and you won't need to perform the touch command as you would if you were to simply copy the file.
Having your project under version control, including your crontab.txt, is what I prefer. Then, with Fabric, it is as simple as this:
#task
def crontab():
run('crontab deployment/crontab.txt')
This will install the contents of deployment/crontab.txt to the crontab of the user you connect to the server. If you dont have your complete project on the server, you'd want to put the crontab file first.
If you're using Django, take a look at the jobs system from django-command-extensions.
The benefits are that you can keep your jobs inside your project structure, with version control, write everything in Python and configure crontab only once.
I use Buildout to manage my Django projects. With Buildout, I use z3c.recipe.usercrontab to install cron jobs in deploy or update.
You said:
I'm more curious about conventions/standards people use than any particular solution
But, to be fair, the particular solution will depend in your environment and there is no universal elegant silver bullet. Given that you happen to be using Python/Django, I recommend Celery. It is an asynchronous task queue for Python, which integrates nicely with Django. And, on top of the features that it gives as an asynchronous task queue, it also has specific features for periodic tasks.
I have personally used the django-celery-beat integration and it integrates perfectly with Django settings and behaves correctly in distributed environments. If your periodic tasks are related to Django stuff, I strongly recommend to take a look at Celery I started using it only for certain asynchronous mailing and ended up using it for a lot of asynchronous tasks + periodic sanity checks and other web application maintenance stuff.