How to run multiple APScheduler tasks in flask in parallel? - flask

I am trying to build a Flask app to automate the process of scraping certain websites. I am using APScheduler to schedule the scraping functions.
The app works fine with just one scheduler, but when I add a second one the tasks won't run. There's no error messages, nothing.
Does anyone know what could be causing this? I have also read about about task queues... how would one go about setting up a task queue while using APScheduler?

Related

Difference between usage of Django celery and Django cron-jobs?

I am sorry if its basics but I did not find any answers on the Internet comparing these two technologies. How should I decide when to use which as both can be used to schedule and process periodic tasks.
This is what an article says:
Django-celery :
Jobs are essential part of any application that does
some processing for you in the background. If your job is real time
Django application celery can be used.
Django-cronjobs :
django-cronjobs can be used to schedule periodic_task which is a
valid job. django-cronjobs is a simple Django app that runs registered
cron jobs via a management command.
can anyone explain me the difference between when should I choose which one and Why? Also I need to know why celery is used when the computing is distributed and why not cron jobs
The two things can be used for the same goal (background execution). However, if you are going to choose wisely, you should really understand that they are actually completely different things.
Here's what I wish someone had told me back when I was a noob (instead of the novice level that I have achieved today :)).
cron
The concept of a cron job is that we want a command / process to be executed on some schedule. Furthermore, we want that process to receive x,y,z parameters, run with a,b,c environment variables, and as user id 123.
Some cron systems may facilitate a few extra features, such as:
catching up on missed tasks (e.g. the server was off for a power outage all night and as soon as we turn it on, it runs the 8 instances of the command we normally run hourly).
might help you with the type of locking you normally do using a pid file in order to avoid parallel runs of the same command.
For the most part, cron systems are meant to be dumb: "just run this command at this time, thanks!".
Celery
The concept of Celery is much more sophisticated. It works with tasks, chains & chords of tasks, error handling, and (in most cases) collection of work result. It has a queue (or many queues) of work and a worker (or many). When a task (really just a message describing requested work) enters the queue it waits there until a worker is available to handle it. Much the same way as 1 or more employees at the DMV service a room full of waiting customers.
Furthermore, Celery can facilitate distributed work. That's a bit like (if I may torture the analogy a bit) - the difference between a DMV office where every worker shares the same phone, computer, copier, etc and a DMV where workers have dedicated resources and are never blocked by other workers.
Celery for web apps
In web applications, Celery is often used when a bit of web access results in a thing to be done that should be handled out of band of the conversation with the web browser. For example:
the web user just did something which should result in an email being sent. In order to send an email, your web server will need to contact a mail server. This could take time, the server could be busy, etc - we cant make the web user just wait, seeing nothing on their browser while we do this. Well, you can but it won't work reliably. So, we do that email send as a bit of work in the queue. That way, it can happen "whenever" and the web server can get back to communicating with the browser.
the user just submitted a credit card as payment. You're going to need to contact the card processor, but that might take several seconds. You might even have to contact them multiple times (e.g. they are really busy there right now). Again, you don't want your user's web browser to just sit blankly and you don't want a web server process or thread of execution tied up. Instead, you use Celery to create a job, you tell the browser to check back in a few seconds (or use a "web socket"), and your web server moves on and talks to other web users. When the browser checks back later, you lookup the task id and find out from celery whether it is finished and what the outcome was (card declined, etc).
Using Celery as cron
When you use Celery as a "cron system" all you are really doing is saying: "hey, can someone please generate work of X type on Y schedule". A process is created that runs continuously which sleeps most of the time and wakes up occasionally to inject a bit of work into the queue on the schedule you requested.
Usually the "hey someone" that you ask to do that for you is: celery beat and beat gets the schedule you want from somewhere in the database or from your settings file.
I searched for celery vs cron and found a few results that might be helpful to you.
https://www.reddit.com/r/Python/comments/m2dg8/explain_like_im_five_why_or_why_not_would_celery/
Why would running scheduled tasks with Celery be preferable over crontab?
Distributed task queues (Ex. Celery) vs crontab scripts

Running into problems deploying a multithreated application in aws

So I have been working on a Django application and everything seems to work fine in my developing environment but not on a cloud instance.
The application's server must do two things in parallel. I initially used a subprocess to execute some task while the main process keeps collecting data.
Everything works as expected offline but when I deploy the application onto AWS it just does one thing at a time.
I made sure I deployed it on an instance with multiple cores. I even went back to change the subprocess task and spawned a new thread instead but that did not seem to work either.
Also not sure if this could be the issue but running the subtask involves running a shell command that starts up a docker container.
Has anyone ran into a similar issue or got any suggestions about what could be causing this behavior?
Thanks.

Remote Django application sending messages to RabbitMQ

I'm starting to get familiar with the RabbitMQ lingo so I'll try my best to explain. I'll be going into a public beta test in a few weeks and this is the set up I am hoping to achieve. I would like Django to be the producer; producing messages to a remote RabbitMQ box and another Celery box listening on the RabbitMQ queue for tasks. So in total there would be three boxes. Django, RabbitMQ & Celery. So far, from the Celery docs, I have successfully been able to run Django and Celery together and Rabbit MQ on another machine. Django simply calls the task in the view:
add.delay(3, 3)
And the message is sent over to RabbitMQ. RabbitMQ sends it back to the same machine that the task was sent from (since Django and celery share the same box) and celery processes the task.
This is great for development purposes. However, having Django and Celery running on the same box isn't a great idea since both will have to compete for memory and CPU. The whole goal here is to get clients in and out of the HTTP Request cycle and have celery workers process the tasks. But the machine will slow down considerably if it is accepting HTTP requests and also processing tasks.
So I was wondering is there was a way to make this all separate from one another. Have Django send the tasks, RabbitMQ forward them, and Celery process them (Producer, Broker, Consumer).
How can I go about doing this? Really simple examples would help!
What you need is to deploy the code of your application on the third machine and execute there only the command that handles the tasks. You need to have the code on that machine also.

Celery and Twitter streaming api with Django

I'm having a really hard time conceptualising how I can connect to the the twitter streaming api and process tweets via an admin interface provided by Django.
The main problem is starting a daemon from Django and having the ability to stop/start it, plus making sure there is provision for monitoring. I don't really want to use upstart for this purpose because I want to try and keep the project as self contained as possible.
I'm currently attempting the following and am unsure if it's perhaps the wrong way to go about things
Start a celery task from Django which establishes a persistent connection to the streaming API
The above task creates subtasks which will process tweets and store them
Because celeryd runs as a daemon it will automatically run the first task again if the connection breaks and the task fails - does this mean I don't need any additional monitoring?
Does the above make sense or have I misunderstood how celery works?

Getting the Twitter stream on Heroku with Django App

I need to constantly monitor a Twitter stream using Heroku. Basically what I want to do is start the monitoring process up and never have it stop. I was looking into celery, but from my understanding of it, it looks like a user initiated or short term process adds tasks to a queue that are then processed by another queue. This is different model than having a background process constantly monitoring a Twitter stream. What would be the best way to monitor a Twitter stream for a Django app on Heroku?
I'm not aware of anything in Django that can run in the background like that. It's certainly one of the limitations of living in the web-app sandbox.
If you have access to your server in Heroku (?) you could write your own script/application along the lines of this tutorial and daemonize using Supervisord.
if not:
Celery has a nice periodic scheduler. If you're okay with polling instead of streaming (API), I might just use the twitter REST API and the scheduler in Celery to periodically poll and update. It's helpful to match the scheduling with the rate limits as well.