Execute task once when starting uwsgi-emperor app - django

I'm using uwsgi-emperor on Debian 8 in my production systems. For a specific Django project, I need to run a computing intensive setup task only once at the launch of the vassal. The vassal can have multiple workers/threads, but the task must be executed only one time, no matter how many workers/threads are spawned.
Currently, I'm executing this setup task every time a new worker is launched, but this is clearly suboptimal. The setup task is an invocation of a method from the same Django project, but I think that doesn't change the problem.
Is there any way of doing this from uWSGI?

You could try to use a singletone approach, this code at settings.py would call the startup_only_once() function only once:
from tendo.singleton import SingleInstance
def startup_only_once():
print("One time only")
try:
FIRST_START = SingleInstance()
startup_only_once()
except:
pass

Related

Scheduler duplicate email 8 times [duplicate]

We have a web app made with pyramid and served through gunicorn+nginx. It works with 8 worker threads/processes
We needed to jobs, we have chosen apscheduler. here is how we launch it
from apscheduler.events import EVENT_JOB_EXECUTED, EVENT_JOB_ERROR
from apscheduler.scheduler import Scheduler
rerun_monitor = Scheduler()
rerun_monitor.start()
rerun_monitor.add_interval_job(job_to_be_run,\
seconds=JOB_INTERVAL)
The issue is that all the worker processes of gunicorn pick the scheduler up. We tried implementing a file lock but it does not seem like a good enough solution. What would be the best way to make sure at any given time only one of the worker process picks the scheduled event up and no other thread picks it up till next JOB_INTERVAL?
The solution needs to work even with mod_wsgi in case we decide to switch to apache2+modwsgi later. It needs to work with single process development server which is waitress.
Update from the bounty sponsor
I'm facing the same issue described by the OP, just with a Django app. I'm mostly sure adding this detail won't change much if the original question. For this reason, and to gain a bit more of visibility, I also tagged this question with django.
Because Gunicorn is starting with 8 workers (in your example), this forks the app 8 times into 8 processes. These 8 processes are forked from the Master process, which monitors each of their status & has the ability to add/remove workers.
Each process gets a copy of your APScheduler object, which initially is an exact copy of your Master processes' APScheduler. This results in each "nth" worker (process) executing each job a total of "n" times.
A hack around this is to run gunicorn with the following options:
env/bin/gunicorn module_containing_app:app -b 0.0.0.0:8080 --workers 3 --preload
The --preload flag tells Gunicorn to "load the app before forking the worker processes". By doing so, each worker is "given a copy of the app, already instantiated by the Master, rather than instantiating the app itself". This means the following code only executes once in the Master process:
rerun_monitor = Scheduler()
rerun_monitor.start()
rerun_monitor.add_interval_job(job_to_be_run,\
seconds=JOB_INTERVAL)
Additionally, we need to set the jobstore to be anything other than :memory:.This way, although each worker is its own independent process unable of communicating with the other 7, by using a local database (rather then memory) we guarantee a single-point-of-truth for CRUD operations on the jobstore.
from apscheduler.schedulers.background import BackgroundScheduler
from apscheduler.jobstores.sqlalchemy import SQLAlchemyJobStore
rerun_monitor = Scheduler(
jobstores={'default': SQLAlchemyJobStore(url='sqlite:///jobs.sqlite')})
rerun_monitor.start()
rerun_monitor.add_interval_job(job_to_be_run,\
seconds=JOB_INTERVAL)
Lastly, we want to use the BackgroundScheduler because of its implementation of start(). When we call start() in the BackgroundScheduler, a new thread is spun up in the background, which is responsible for scheduling/executing jobs. This is significant because remember in step (1), due to our --preload flag we only execute the start() function once, in the Master Gunicorn process. By definition, forked processes do not inherit the threads of their Parent, so each worker doesn't run the BackgroundScheduler thread.
from apscheduler.jobstores.sqlalchemy import SQLAlchemyJobStore
rerun_monitor = BackgroundScheduler(
jobstores={'default': SQLAlchemyJobStore(url='sqlite:///jobs.sqlite')})
rerun_monitor.start()
rerun_monitor.add_interval_job(job_to_be_run,\
seconds=JOB_INTERVAL)
As a result of all of this, every Gunicorn worker has an APScheduler that has been tricked into a "STARTED" state, but actually isn't running because it drops the threads of it's parent! Each instance is also capable of updating the jobstore database, just not executing any jobs!
Check out flask-APScheduler for a quick way to run APScheduler in a web-server (like Gunicorn), and enable CRUD operations for each job.
I found a fix that worked with a Django project having a very similar issue. I simply bind a TCP socket the first time the scheduler starts and check against it subsequently. I think the following code can work for you as well with minor tweaks.
import sys, socket
try:
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.bind(("127.0.0.1", 47200))
except socket.error:
print "!!!scheduler already started, DO NOTHING"
else:
from apscheduler.schedulers.background import BackgroundScheduler
scheduler = BackgroundScheduler()
scheduler.start()
print "scheduler started"
Short answer: You can't do it properly without consequences.
I'm using Gunicorn as an example, but it is essentially the same for uWSGI. There are various hacks when running multiple processes, to name a few:
use --preload option
use on_starting hook to start the APScheduler background scheduler
use when_ready hook to start the APScheduler background scheduler
They work to some extent but may get the following errors:
worker timing out frequently
scheduler hanging when there are no jobs https://github.com/agronholm/apscheduler/issues/305
APScheduler is designed to run in a single process where it has complete control over the process of adding jobs to job stores. It uses threading.Event's wait() and set() methods to coordinate. If they are run by different processes, the coordination wouldn't work.
It is possible to run it in Gunicorn in a single process.
use only one worker process
use the post_worker_init hook to start the scheduler, this will make sure the scheduler is run only in the worker process but not the master process
The author also pointed out sharing the job store amount multiple processes isn't possible. https://apscheduler.readthedocs.io/en/stable/faq.html#how-do-i-share-a-single-job-store-among-one-or-more-worker-processes He also provided a solution using RPyC.
While it's entirely doable to wrap APScheduler with a REST interface. You might want to consider serving it as a standalone app with one worker. In another word, if you have others endpoints, put them in another app where you can use multiple workers.

Is celery-beats can only trigger a celery task or normal task (Django)?

I am workign on a django project with celery and celery-beats. My main use case is use celery-beats to set up a periodical task as a background task, instead of using a front-end request to trigger. I would save the results and put it inside model, then pull the model to front-end view as a view to user.
My current problem is, not matter how I change the way I am calling my task, it always throwing the task is not registered in the task list inside celery.
I am trying to trigger a non-celery task(inside, it will call a celery taskthe , using celery beats module,
Below is the pesudo-code.
tasks.py:
#app.shared_task
def longrunningtask(a):
res = APIcall(a)
return res
caller.py:
from .task import longrunningtask
def dosomething(input_list):
for ele in input_list:
res.append(longrunningtask.delay(ele))
return res
Periodical Task :
schedule, created = CrontabSchedule.objects.get_or_create(hour = 1, minute = 34)
task = PeriodicTask.objects.create(crontab=schedule, name="XXX_task_", task='app.caller.dosomething'))
return HttpResponse("Done")
Nothing special about the periodical task, but This never works for me. It errored that not detected tasks or not registered tasks if I do not make the dosomething() as celery task.
Problem is I do not want to make the caller function a celery task, the reason being, that
Inside for loop, I would make parameter passing into the task(), I would like to see multiple celery long runing task is running with the for loop passing it and kick it. so I would create mutliple sub-task instead of as one giant running task.
Not necessary since longrunningtask is the task I need it to be run as celery task, no need its parent to be inside celery task.
Can someone please help me out of this dilemma? It's super frustrating and has been blocking me for a while.
Any suggestion or idea of this use case is also superhelpful!

Celery what happen to running tasks when using app.control.purge()?

Currently i have a celery batch running with django like so:
Celery.py:
from __future__ import absolute_import, unicode_literals
import os
import celery
from celery import Celery
from celery.schedules import crontab
import django
load_dotenv(os.path.join(os.path.dirname(os.path.dirname(__file__)), '.env'))
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'base.settings')
django.setup()
app = Celery('base')
app.config_from_object('django.conf:settings', namespace='CELERY')
app.autodiscover_tasks()
#app.on_after_configure.connect
def setup_periodic_tasks(sender, **kwargs):
app.control.purge()
sender.add_periodic_task(30.0, check_loop.s())
recursion_function.delay() #need to use recursive because it need to wait for loop to finish(time can't be predict)
print("setup_periodic_tasks")
#app.task()
def check_loop():
.....
start = database start number
end = database end number
callling apis in a list from id=start to id=end
create objects
update database(start number = end, end number = end + 3)
....
#app.task()
def recursion_function(default_retry_delay=10):
.....
do some looping
....
#when finished, call itself again
recursion_function.apply_async(countdown=30)
My aim is whenever the celery file get edited then it would restart all the task -remove queued task that not yet execute(i'm doing this because recursion_function will run itself again if it finished it's job of checking each record of a table in my database so i'm not worry about it stopping mid way).
The check_loop function will call to an api that has paging to return a list of objects and i will compare it to by record in a table , if match then create a new custom record of another model
My question is when i purge all messages, will the current running task get stop midway or it gonna keep running ? because if the check_loop function stop midway looping through the api list then it will run the loop again and i will create new duplicate record which i don't want
EXAMPLE:
during ruuning task of check_loop() it created object midway (on api list from element id=2 to id=5), server restart -> run again, now check_loop() run from beginning(on api list from element id=2 to id=5) and created object from that list again(which 100% i don't want)
Is this how it run ? i just need a confirmation
EDIT:
https://docs.celeryproject.org/en/4.4.1/faq.html#how-do-i-purge-all-waiting-tasks
I added app.control.purge() because when i restart then recursion_function get called again in setup_periodic_tasks while previous recursion_function from recursion_function.apply_async(countdown=30) execute too so it multiplied itself
Yes, worker will continue execution of currently running task unless worker is also restarted.
Also, The Celery Way is to always expect tasks to be run in concurrent environment with following considerations:
there are many tasks running concurrently
there are many celery workers executing tasks
same task may run again
multiple instances of the same task may run at the same moment
any task may be terminated any time
even if you are sure that in your environment there is only one worker started / stopped manually and these do not apply - tasks should be created in such way to allow everything of this to happen.
Some useful techniques:
use database transactions
use locking
split long-running tasks into faster ones
if task has intermediate values to be saved or they are important (i.e. non-reproducible like some api calls) and their processing in next step takes time - consider splitting into several chained tasks
If you need to run only one instance of a task at a time - use some sort of locking - create / update lock-record in the database or in the cache so others (same tasks) can check and know this task is running and just return or wait for previous one to complete.
I.e. recursion_function can also be Periodic Task. Being periodic task will make sure it is run every interval, even if previous one fails for any reason (and thus fails to queue itself again as in regular non-periodic task). With locking you can make sure only one is running at a time.
check_loop():
First, it is recommended to save results in one transaction in the database to make sure all or nothing is saved / modified in the database.
You can also save some marker that indicates how many / status of saved objects, so future tasks can just check this marker, not each object.
Or somehow perform check for each element before creating it that it already exists in the database.
I am not going to write an essay like Oleg's excellent post above. The answer is simply - all running tasks will continue running. purge is all about the tasks that are in the queue(s), waiting to be picked by Celery workers.

How to achieve below objective.?

I am using celery with Django. Redis is my broker. I am serving my Django app via Apache and WSGI. I am running celery in supervisor mode. I am starting up a celery task named run_forever from wsgi.py file of my Django project. My intention was to start a celery task when Django starts up and run it forever in the background (I don't know if it is the right way to achieve the same. I searched it but couldn't find appropriate implementation. If you have any better idea, kindly share). It is working as expected. Now due to certain issue, I have added maximum-requests-250 parameter in the virtual host of apache. By doing so when it gets 250 requests it restarts the WSGI process.
So when every time it restarts a celery task 'run_forever' is created and run in the background. Eventually, when the server gets 1000 requests WSGI process would have restarted 4 times and I end in having 4 copies of 'run_forever' task. I only want to have one copy of the task to run at any point in time. So I would like to kill all the currently running 'run_forever' task every time the Django starts.
I have tried
from project.celery import app
from project.tasks import run_forever
app.control.purge()
run_forever.delay()
in wsgi.py to kill all the running tasks before starting `run_forever'. But didn't work
I have to agree with Dave Smith here--why do you have a task that runs forever? That said, to the extent that you want to safeguard a task from running twice, there are multiple strategies you can use. The easiest for implementation is using a database entry (since databases can be transactional and if you re using django, presumably you are using at least one database). n.b., in the code snippet below, I did not put my model in the right place to be picked up by a migration--I just put it in the same snippet for ease of discussion.
import time
from myapp.celery import app
from django.db import models
class CeleryGuard(models.Model):
task_name = models.CharField(max_length=32)
task_id = models.CharField(max_length=32)
#app.task(bind=True)
def run_forever(self):
created, x = CeleryGuard.objects.get_or_create(
task_name='run_forever', defaults={
'task_id': self.request.id
})
if not created:
return
# do whatever you want to here
while True:
print('I am doing nothing')
time.sleep(1440)
# make sure to cleanup after you are done
CeleryGuard.objects.filter(task_name='run_forever').delete()

How to setup my crontab in order to execute process_tasks in Django?

I decided to use a library for my Django project called django-background-tasks (link to the documentation: https://django-background-tasks.readthedocs.io/en/latest/). I want to deploy my Django application to a Linux server (Ubuntu 19.0.4). How should I write the crontab in order to call the command "process_tasks" every five seconds?
Here Running a cron every 30 seconds is a workaround to achieve the seconds part, but since I am new to this part of the job (deploying and automation of process), how should I create my crontab file in order to achieve my desired purpose?
I'll be using process_tasks for a lot of different functionalities like: do some analysis at night and send results in the morning, expire some codes, etc. So basically I'll need to be running it almost constantly.
Thank you in advance for any suggestion, if you need something more I would be happy to provide it to you.
Since you know cron only allows for a minimum of one minute but you want to run it every 5 seconds.
how about writing a shell script (running as a service via supervisor) that runs your task in an infinite loop. This script will sleeps after every run for 5 seconds.
The only difference between this service and a cron is:
A cron will fork a process every time it runs, a process runs your job and periodically check if the job is finished in order to clean it up, and this script (as service) will not work like cron but it'll do the job I believe.
#!/bin/bash
while true; do
# your code here,
# call your python file which initializes
# django variables (or whatever you want) and do the needful.
sleep 5;
done
You can configure a file process_django_tasks.sh(this file contains above code) with supervisor as a service so it runs on boot time and you have a control to start and stop with quick commands.
To test quickly you can easily run sh process_django_tasks.sh
If you could tell us why exactly you want your script to run every 5 seconds. Maybe we all suggest a better way than running a script every 5 sec.
I'd recommend using Supervisor (http://supervisord.org) to run python manage.py process_tasks, which will then monitor for the tasks you scheduled in your code and run them based on their repeat syntax:
To run your task every 5 seconds:
function_to_call(var, repeat=5, repeat_until=None)
I'd also recommend Supervisor to run your entire Django project.
Looking at the documentation the process_tasks command by default runs every 5 seconds checking for new tasks to run. Not sure if you just wanted to check every 5 seconds or actually run a specific task every 5 seconds.
Not knowing any details, this may be a completely inapplicable alternative. However, if there is a guarantee that there will be nothing to be done until (an) instance(s) of (a) particular model(s) is saved, and provided that this save doesn't happen far too frequently, then you might look at Django post_save signals instead. Whenever that save happens, either execute the task, or execute the task if it wasn't run already in the last five seconds.