I have a flask application. This application mimics routing of vehicles in a city and when a vehicle reaches a designated point, I have to wait for 30-180 seconds before starting it again. I am trying to use apscheduler for this.
When the vehicle arrives I start an apscheduler job (with the 'date' trigger for X seconds). When the job fires, I do my processing.
This works well on my dev machine when I am running the flask app standalone. But when I try to run it on my production server (where the app is running in uwsgi mode) the job never fires. I have already set --enable-threads=true for the app, so that doesn't seem to be the problem.
My relevant code is like this.
At my app initialization.
scheduler = BackgroundScheduler()
scheduler.start()
Whenever the trigger happens.
scheduler.add_job(func=myfunc, trigger='date', run_date=datetime.datetime.now() + datetime.timedelta(seconds=value)).
Anything I am missing in using apscheduler in uwsgi mode? Or any other options in flask to achieve what I want to?
Related
We have a web app made with pyramid and served through gunicorn+nginx. It works with 8 worker threads/processes
We needed to jobs, we have chosen apscheduler. here is how we launch it
from apscheduler.events import EVENT_JOB_EXECUTED, EVENT_JOB_ERROR
from apscheduler.scheduler import Scheduler
rerun_monitor = Scheduler()
rerun_monitor.start()
rerun_monitor.add_interval_job(job_to_be_run,\
seconds=JOB_INTERVAL)
The issue is that all the worker processes of gunicorn pick the scheduler up. We tried implementing a file lock but it does not seem like a good enough solution. What would be the best way to make sure at any given time only one of the worker process picks the scheduled event up and no other thread picks it up till next JOB_INTERVAL?
The solution needs to work even with mod_wsgi in case we decide to switch to apache2+modwsgi later. It needs to work with single process development server which is waitress.
Update from the bounty sponsor
I'm facing the same issue described by the OP, just with a Django app. I'm mostly sure adding this detail won't change much if the original question. For this reason, and to gain a bit more of visibility, I also tagged this question with django.
Because Gunicorn is starting with 8 workers (in your example), this forks the app 8 times into 8 processes. These 8 processes are forked from the Master process, which monitors each of their status & has the ability to add/remove workers.
Each process gets a copy of your APScheduler object, which initially is an exact copy of your Master processes' APScheduler. This results in each "nth" worker (process) executing each job a total of "n" times.
A hack around this is to run gunicorn with the following options:
env/bin/gunicorn module_containing_app:app -b 0.0.0.0:8080 --workers 3 --preload
The --preload flag tells Gunicorn to "load the app before forking the worker processes". By doing so, each worker is "given a copy of the app, already instantiated by the Master, rather than instantiating the app itself". This means the following code only executes once in the Master process:
rerun_monitor = Scheduler()
rerun_monitor.start()
rerun_monitor.add_interval_job(job_to_be_run,\
seconds=JOB_INTERVAL)
Additionally, we need to set the jobstore to be anything other than :memory:.This way, although each worker is its own independent process unable of communicating with the other 7, by using a local database (rather then memory) we guarantee a single-point-of-truth for CRUD operations on the jobstore.
from apscheduler.schedulers.background import BackgroundScheduler
from apscheduler.jobstores.sqlalchemy import SQLAlchemyJobStore
rerun_monitor = Scheduler(
jobstores={'default': SQLAlchemyJobStore(url='sqlite:///jobs.sqlite')})
rerun_monitor.start()
rerun_monitor.add_interval_job(job_to_be_run,\
seconds=JOB_INTERVAL)
Lastly, we want to use the BackgroundScheduler because of its implementation of start(). When we call start() in the BackgroundScheduler, a new thread is spun up in the background, which is responsible for scheduling/executing jobs. This is significant because remember in step (1), due to our --preload flag we only execute the start() function once, in the Master Gunicorn process. By definition, forked processes do not inherit the threads of their Parent, so each worker doesn't run the BackgroundScheduler thread.
from apscheduler.jobstores.sqlalchemy import SQLAlchemyJobStore
rerun_monitor = BackgroundScheduler(
jobstores={'default': SQLAlchemyJobStore(url='sqlite:///jobs.sqlite')})
rerun_monitor.start()
rerun_monitor.add_interval_job(job_to_be_run,\
seconds=JOB_INTERVAL)
As a result of all of this, every Gunicorn worker has an APScheduler that has been tricked into a "STARTED" state, but actually isn't running because it drops the threads of it's parent! Each instance is also capable of updating the jobstore database, just not executing any jobs!
Check out flask-APScheduler for a quick way to run APScheduler in a web-server (like Gunicorn), and enable CRUD operations for each job.
I found a fix that worked with a Django project having a very similar issue. I simply bind a TCP socket the first time the scheduler starts and check against it subsequently. I think the following code can work for you as well with minor tweaks.
import sys, socket
try:
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.bind(("127.0.0.1", 47200))
except socket.error:
print "!!!scheduler already started, DO NOTHING"
else:
from apscheduler.schedulers.background import BackgroundScheduler
scheduler = BackgroundScheduler()
scheduler.start()
print "scheduler started"
Short answer: You can't do it properly without consequences.
I'm using Gunicorn as an example, but it is essentially the same for uWSGI. There are various hacks when running multiple processes, to name a few:
use --preload option
use on_starting hook to start the APScheduler background scheduler
use when_ready hook to start the APScheduler background scheduler
They work to some extent but may get the following errors:
worker timing out frequently
scheduler hanging when there are no jobs https://github.com/agronholm/apscheduler/issues/305
APScheduler is designed to run in a single process where it has complete control over the process of adding jobs to job stores. It uses threading.Event's wait() and set() methods to coordinate. If they are run by different processes, the coordination wouldn't work.
It is possible to run it in Gunicorn in a single process.
use only one worker process
use the post_worker_init hook to start the scheduler, this will make sure the scheduler is run only in the worker process but not the master process
The author also pointed out sharing the job store amount multiple processes isn't possible. https://apscheduler.readthedocs.io/en/stable/faq.html#how-do-i-share-a-single-job-store-among-one-or-more-worker-processes He also provided a solution using RPyC.
While it's entirely doable to wrap APScheduler with a REST interface. You might want to consider serving it as a standalone app with one worker. In another word, if you have others endpoints, put them in another app where you can use multiple workers.
Everything is working except task autodiscovery under Django. (Celery 5.2.6, Django 3.0.6)
If I run celery worker, the worker triggers autodiscovery, finds all my tasks, and displays them in a list as part of it's startup process. However, if I run my Django app or Django shell, this never happens.
Additionally, even though the docs promise that accessing the task registry will trigger autodiscovery, it does not. I print app.tasks, I call app.tasks.keys(), and still no autodiscovery - it only shows the tasks that are built-in or were registered when the module containing them happened to be imported for other reasons.
What do I need to do to trigger task autodiscovery?
PS - If I try adding force=True to app.autodiscover_tasks(), it fails because the Django app registry hasn't finished loading at that time.
I dug into the code and tried a bunch of things. Here's what I learned.
How Autodiscovery is Performed
Autodiscovery is actually performed by app._autodiscover_tasks() (src), which is invoked by the celery.signals.import_modules signal (src), which is sent by app.loader.import_default_modules() (src). The signal is connected to app._autodiscover_tasks() in app.autodiscover_tasks (src).
In other words, the only way to get autodiscovery to trigger is to:
call app.autodiscover_tasks() to register app._autodiscover_tasks as a receiver for signal celery.signals.import_modules
call app.loader.import_default_modules() to send the signal
The docs instruct me to do the first, but not the second. Thus, in my Django app, autodiscovery does not happen.
How Autodiscovery is Performed in celery
If you run celery with the commands beat, report, shell, or worker, the resulting code path eventually calls app.loader.import_default_modules() directly. Thus, autodiscovery happens under those conditions - but not when you run your Django app, or open a Django shell, or any other context but those four commands, unless you explicitly call app.loader.import_default_modules().
On Finalize
The docs promise that accessing app.tasks will cause the app to finalize, which I assumed triggered autodiscovery, but it has nothing to do with that. All that finalize does is send an app.on_after_finalize signal, which does nothing.
My Solution
After much frustrating and experimentation, I settled on this snippet from my celery.py:
app = Celery("myproj")
app.config_from_object("django.conf:settings", namespace="CELERY")
app.autodiscover_tasks()
#app.on_after_configure.connect()
def trigger_autodiscovery(sender, **kwargs):
app.loader.import_default_modules()
I am using celery with Django. Redis is my broker. I am serving my Django app via Apache and WSGI. I am running celery in supervisor mode. I am starting up a celery task named run_forever from wsgi.py file of my Django project. My intention was to start a celery task when Django starts up and run it forever in the background (I don't know if it is the right way to achieve the same. I searched it but couldn't find appropriate implementation. If you have any better idea, kindly share). It is working as expected. Now due to certain issue, I have added maximum-requests-250 parameter in the virtual host of apache. By doing so when it gets 250 requests it restarts the WSGI process.
So when every time it restarts a celery task 'run_forever' is created and run in the background. Eventually, when the server gets 1000 requests WSGI process would have restarted 4 times and I end in having 4 copies of 'run_forever' task. I only want to have one copy of the task to run at any point in time. So I would like to kill all the currently running 'run_forever' task every time the Django starts.
I have tried
from project.celery import app
from project.tasks import run_forever
app.control.purge()
run_forever.delay()
in wsgi.py to kill all the running tasks before starting `run_forever'. But didn't work
I have to agree with Dave Smith here--why do you have a task that runs forever? That said, to the extent that you want to safeguard a task from running twice, there are multiple strategies you can use. The easiest for implementation is using a database entry (since databases can be transactional and if you re using django, presumably you are using at least one database). n.b., in the code snippet below, I did not put my model in the right place to be picked up by a migration--I just put it in the same snippet for ease of discussion.
import time
from myapp.celery import app
from django.db import models
class CeleryGuard(models.Model):
task_name = models.CharField(max_length=32)
task_id = models.CharField(max_length=32)
#app.task(bind=True)
def run_forever(self):
created, x = CeleryGuard.objects.get_or_create(
task_name='run_forever', defaults={
'task_id': self.request.id
})
if not created:
return
# do whatever you want to here
while True:
print('I am doing nothing')
time.sleep(1440)
# make sure to cleanup after you are done
CeleryGuard.objects.filter(task_name='run_forever').delete()
I have a web page where I need to run a long sql process (up to 20 mins or so) when the user clicks on a certain button. The script runs, but the user is then unable to continue browsing the rest of the website.
I would like to have it so that when the button is clicked, it goes into a queue that runs in the background.
I have looked inth django-background-tasks, but the problem is that it does not seem to be possible to start the queued tasks without running python manage.py process_tasks.
I have heard of Celery, but I am using a Windows system and it does not seem to be suitable.
I am new to django and website infrastructures, and am not sure if this is feasible. I have also seen in older response that the threading package can work to do this, but I am unsure if it is outdated.
You can use create_task provided by Asyncio that can run a background task for you without blocking the view for clients.
Python 3.7+
Asyncio create_task
Disclaimer: I'm not so sure if myfunc() needs to be async unless you are performing an async worthwhile task.
You could also have a while loop in myfunc() for periodic repeated operations.
import asyncio
async def myfunc():
await asyncio.sleep(5)
print("Hi, after 5 seconds.")
task = asyncio.create_task(myfunc(), )
When I inspect celery -A proj inspect active_queues I see two servers showing their queues they are listening to and they are pointing to same default queue name celery. Still the task issued by django app gets executed twice by both servers(Once by each celery server - so two times).
I can see the transport type is also direct - the default one.
On my local task gets executed once so I am sure that the task is called only once by my django app.
What can I be missing here?
Ok, i looked up the docs, i think you need to set celerybeat-scheduler in your settings.py which makes sure tasks are being scheduled by a single scheduler.
http://celery.readthedocs.org/en/latest/configuration.html#celerybeat-scheduler
On Redis you can set the current database for the application you are running, setting the database will separate the information to use different apps.
If you are using Django the configuration is
CELERY_BROKER_VHOST = {number of the database}
If you are not using Django i beleive the configuration is CELERY_REDIS_DB or redis_db depending on your celery version
For instance for your first application could be CELERY_BROKER_VHOST = 1
For the second application could be CELERY_BROKER_VHOST = 2
and for your local development could be CELERY_BROKER_VHOST = 99
http://docs.celeryproject.org/en/latest/userguide/configuration.html#id8