I have recently created a version control page from my application to manage the deployment process.
(Yeah, I know, github + hooks are better than rewriting from zero. But we are in Iran and our beloved government has blocked all the ssh connections to outside of the country. :(( )
There is a merge + reload action in the page. the merge is working like the other parts, but the reload part fails without any message. I have added sudo row for kill command and the user of the worker process has enough permission. I even executed the code form django shell and it reloaded the process.
Is there any restriction for receiving signals, such as workers not being able to reload their master?
Here's the related codes:
def command(x):
return str(Popen(x.split(' '), stdout=PIPE).communicate()[0])
pid = open(PATH + "/logs/gunicorn.pid").readline().strip()
cmd = "sudo kill -HUP %s" % pid
content += command(cmd)
Guess off the top of my head is that the restart is not working because the process calling the reload is being killed. Maybe try to daemonize a subprocess that exits after calling the reload? Take a look at this post:
spawning process from python
Related
We have a flask script get_logs.py that uses APScheduler and contains following job
scheduler.add_job(id="create_recommendation_entries", trigger = 'interval',seconds=60*10,func=create_entries)
Someone ran the script and now the the logs show that this script is still running at 10 minutes interval even after terminating.
The process id is not listed nor does it show using grep and we don't know whether it was executed using nohup or gunicorn.
How do I kill this job based on id="create_recommendation_entries"because I don't know any of its stats(port,pid etc).
Rerunning the script creates a different thread and stops after Ctrl+C but the previous one remains still in process
I'm using Qt5.6, I have code which will restart the application, but I also want to limit the number of instances.
The code that limits the instances works and so does the code which restarts the application, but with the limiting code enabled, the application will not restart, it closes down but I'm guessing that the restart is being blocked because at the time it tries to launch the new instance the PID of the original hasn't cleared.
Question is, how to achieve the result of closing the application, whilst limiting the total number of instances to 1 ?
If this hasn't been solved by tomorrow I will post the code for restarting and limiting instances, I don't have it with me at the moment.
Code to restart the application:
qApp->quit();
QProcess::startDetached(qApp->arguments()[0], qApp->arguments());
These are just hints for the watchdog script:
1- you need to use QProcess::startDetached to run your script before quit your App. This will allow the script process to live after exiting your App.
QProcess::startDetached( "bash", QStringList() << "-c" << terminalCommand );
2- you need to pass the current App PID to your watchdog script via terminalCommand
to get the current App PID in Qt use
qApp->applicationPid();
3- in your watchdog script, have infinite loop that checks for the PID by doing
ps aux | grep -v 'grep' | grep $PID
once the PID is dieds, start your app again from your the watchdog script
In my Django project, I use Celery and Rabbitmq to run tasks in background.
I am using celery beat scheduler to run periodic tasks.
How can i check if celery beat is up and running, programmatically?
Make a task to HTTP requests to a Ping URL at regular intervals. When the URL is not pinged on time, the URL monitor will send you an alert.
import requests
from yourapp.celery_config import app
#app.task
def ping():
print '[healthcheck] pinging alive status...'
# healthchecks.io works for me:
requests.post("https://hchk.io/6466681c-7708-4423-adf0-XXXXXXXXX")
This celery periodic task is scheduled to run every minute, if it doesn't hit the ping, your beat service is down*, the monitor will kick in your mail (or webhook so you can zapier it to get mobile push notifications as well).
celery -A yourapp.celery_config beat -S djcelery.schedulers.DatabaseScheduler
*or overwhelmed, you should track tasks saturation, this is a nightmare with Celery and should be detected and addressed properly, happens frequently when the workers are busy with blocking tasks that would need optimization
Are you use upstart or supervison or something else to run celery workers + celery beat as a background tasks? In production you should use one of them to run celery workers + celery beat in background.
Simplest way to check celery beat is running: ps aux | grep -i '[c]elerybeat'. If you get text string with pid it's running. Also you can make output of this command more pretty: ps aux | grep -i '[c]elerybeat' | awk '{print $2}'. If you get number - it's working, if you get nothing - it's not working.
Also you can check celery workers status: celery -A projectname status.
If you intrested in advanced celery monitoring you can read official documentation monitoring guide.
If you have daemonized celery following the tutorial of the celery doc, checking if it's running or not can be done through
sudo /etc/init.d/celeryd status
sudo /etc/init.d/celerybeat status
You can use the return of such commands in a python module.
You can probably look up supervisor.
It provides a celerybeat conf which logs everything related to beat in /var/log/celery/beat.log.
Another way of going about this is to use Flower. You can set it up for your server (make sure its password protected), it somewhat becomes easier to notice in the GUI the tasks which are being queued and what time they are queued thus verifying if your beat is running fine.
I have recently used a solution similar to what #panchicore suggested, for the same problem.
Problem in my workplace was an important system working with celery beat, and once in a while, either due to RabbitMQ outage, or some connectivity issue between our servers and RabbitMQ server, due to which celery beat just stopped triggering crons anymore, unless restarted.
As we didn't have any tool handy, to monitor keep alive calls sent over HTTP, we have used statsd for the same purpose. There's a counter incremented on statsd server every minute(done by a celery task), and then we setup email & slack channel alerts on the grafana metrics. (no updates for 10 minutes == outage)
I understand it's not purely a programatic approach, but any production level monitoring/alerting isn't complete without a separate monitoring entity.
The programming part is as simple as it can be. A tiny celery task running every minute.
#periodic_task(run_every=timedelta(minutes=1))
def update_keep_alive(self):
logger.info("running keep alive task")
statsd.incr(statsd_tags.CELERY_BEAT_ALIVE)
A problem that I have faced with this approach, is due to STATSD packet losses over UDP. So use TCP connection to STATSD for this purpose, if possible.
You can check scheduler running or not by the following command
python manage.py celery worker --beat
While working on a project recently, I used this:
HEALTHCHECK CMD ["stat celerybeat.pid || exit 1"]
Essentially, the beat process writes a pid file under some location (usually the home location), all you have to do is to get some stats to check if the file is there.
Note: This worked while launching a standalone celery beta process in a Docker container
The goal of liveness for celery beat/scheduler is to check if the celery beat/scheduler is able to send the job to the message broker so that it can be picked up by the respective consumer. [Is it still working or in a hung state]. The celery worker and celery scheduler/beat may or may not be running in the same pod or instance.
To handle such scenarios, we can create a method update_scheduler_liveness with decorator #after_task_publish.connect which will be called every time when the scheduler successfully publishes the message/task to the message broker.
The method update_scheduler_liveness will update the current timestamp to a file every time when the task is published successfully.
In Liveness probe, we need to check the last updated timestamp of the file either using:
stat --printf="%Y" celery_beat_schedule_liveness.stat command
or we can explicitly try to read the file (read mode) and extract the timestamp and compare if the the timestamp is recent or not based on the liveness probe criteria.
In this approach, the more minute liveness criteria you need, the more frequent a job must be triggered from the celery beat. So, for those cases, where the frequency between jobs is pretty huge, a custom/dedicated liveness heartbeat job can be scheduled every 2-5 mins and the consumer can just process it. #after_task_publish.connect decorator provides multiple arguments that can be also used for filtering of liveness specific job that were triggered
If we don't want to go for file based approach, then we can rely on Redis like data-source with instance specific redis key as well which needs to be implemented on the same lines.
I'm trying to set up a daily task for my Django application on Elastic Beanstalk. There doesn't appear to be an accepted way to set this up, as celery beat is the go-to solution for periodic tasks in Django, but isn't great for load-balanced environments.
I've seen some solutions doing things like setting up celery beat with leader_only=True, to only run one instance, but that leaves a single point of failure. I've seen other solutions that allow many instances of celery beat and use locks to make sure only one task goes through, but wouldn't this still eventually fail completely unless the failed instances were restarted? Another suggestion I've seen is to have a separate instance for running celery beat, but this would still be a problem unless it had some way of restarting itself if it failed.
Are there any decent solutions to this problem? I would much rather not have to babysit a scheduler, as it would be pretty easy to not notice that my task was not being run until a while later.
If you're using redis as your broker, look into installing RedBeat as the celery beat scheduler: https://github.com/sibson/redbeat
This scheduler uses locking in redis to make sure only a single beat instance is running. With this you can enable beat on each node's worker process and remove the use of leader_only=True.
celery worker -B -S redbeat.RedBeatScheduler
Let's say you have Worker A with beat lock and Worker B. If Worker A dies, Worker B will attempt to acquire the beat lock after a configurable amount of time.
I would suggest making a management command that runs with cron.
Using this method, you have your full Django ORM, all methods, etc. to work with. Wrapping your script in a try/except, you have the option to log failures in any way that you wish - email notifications, external logging systems like Sentry, straight to the DB, etc.
I user supervisord to run cron and it works well. It relies on time-tested tools that won't let you down.
Finally, using a database singleton to keep track of if a batch job has been run or is currently running in an environment where you have multiple instances of Django running load-balanced isn't bad practice, even if you feel a little icky about it. The DB is a very reliable means of telling you if the DB is being processed.
The one annoying thing about cron is that it doesn't import environment variables you may need for Django. I solved this with a simple Python script.
It writes the crontab on startup with needed environment variables etc. included. This example is for Ubuntu on EBS but should be relevant.
#!/usr/bin/env python
# run-cron.py
# sets environment variable crontab fragments and runs cron
import os
from subprocess import call
from master.settings import IS_AWS
# read django's needed environment variables and set them in the appropriate crontab fragment
eRDS_HOSTNAME = os.environ["RDS_HOSTNAME"]
eRDS_DB_NAME = os.environ["RDS_DB_NAME"]
eRDS_PASSWORD = os.environ["RDS_PASSWORD"]
eRDS_USERNAME = os.environ["RDS_USERNAME"]
try:
eAWS_STAGING = os.environ["AWS_STAGING"]
except KeyError:
eAWS_STAGING = None
try:
eAWS_PRODUCTION = os.environ["AWS_PRODUCTION"]
except KeyError:
eAWS_PRODUCTION = None
eRDS_PORT = os.environ["RDS_PORT"]
if IS_AWS:
fto = '/etc/cron.d/stortrac-cron'
else:
fto = 'test_cron_file'
with open(fto,'w+') as file:
file.write('# Auto-generated cron tab that imports needed variables and runs a python script')
file.write('\nRDS_HOSTNAME=')
file.write(eRDS_HOSTNAME)
file.write('\nRDS_DB_NAME=')
file.write(eRDS_DB_NAME)
file.write('\nRDS_PASSWORD=')
file.write(eRDS_PASSWORD)
file.write('\nRDS_USERNAME=')
file.write(eRDS_USERNAME)
file.write('\nRDS_PORT=')
file.write(eRDS_PORT)
if eAWS_STAGING is not None:
file.write('\nAWS_STAGING=')
file.write(eAWS_STAGING)
if eAWS_PRODUCTION is not None:
file.write('\nAWS_PRODUCTION=')
file.write(eAWS_PRODUCTION)
file.write('\n')
# Process queue of gobs
file.write('\n*/8 * * * * root python /code/app/manage.py queue --process-queue')
# Every 5 minutes, double-check thing is done
file.write('\n*/5 * * * * root python /code/app/manage.py thing --done')
# Every 4 hours, do this
file.write('\n8 */4 * * * root python /code/app/manage.py process_this')
# etc.
file.write('\n3 */4 * * * root python /ode/app/manage.py etc --silent')
file.write('\n\n')
if IS_AWS:
args = ["cron","-f"]
call(args)
And in supervisord.conf:
[program:cron]
command = python /my/directory/runcron.py
autostart = true
autorestart = false
We use Celery with our Django webapp to manage offline tasks; some of these tasks can run up to 120 seconds.
Whenever we make any code modifications, we need to restart Celery to have it reload the new Python code. Our current solution is to send a SIGTERM to the main Celery process (kill -s 15 `cat /var/run/celeryd.pid`), then to wait for it to die and restart it (python manage.py celeryd --pidfile=/var/run/celeryd.pid [...]).
Because of the long-running tasks, this usually means the shutdown will take a minute or two, during which no new tasks are processed, causing a noticeable delay to users currently on the site. I'm looking for a way to tell Celery to shutdown, but then immediately launch a new Celery instance to start running new tasks.
Things that didn't work:
Sending SIGHUP to the main process: this caused Celery to attempt to "restart," by doing a warm shutdown and then relaunching itself. Not only does this take a long time, it doesn't even work, because apparently the new process launches before the old one dies, so the new one complains ERROR: Pidfile (/var/run/celeryd.pid) already exists. Seems we're already running? (PID: 13214) and dies immediately. (This looks like a bug in Celery itself; I've let them know about it.)
Sending SIGTERM to the main process and then immediately launching a new instance: same issue with the Pidfile.
Disabling the Pidfile entirely: without it, we have no way of telling which of the 30 Celery process are the main process that needs to be sent a SIGTERM when we want it to do a warm shutdown. We also have no reliable way to check if the main process is still alive.
celeryd has --autoreload option. If enabled, celery worker (main process) will detect changes in celery modules and restart all worker processes. In contrast to SIGHUP signal, autoreload restarts each process independently when the current executing task finishes. It means while one worker process is restarting the remaining processes can execute tasks.
http://celery.readthedocs.org/en/latest/userguide/workers.html#autoreloading
I've recently fixed the bug with SIGHUP: https://github.com/celery/celery/pull/662
rm *.pyc
This causes the updated tasks to be reloaded. I discovered this trick recently, I just hope there are no nasty side effects.
Well you using SIGHUP (1) for warm shutdown of celery. I am not sure if it actually causes a warm shutdown. But SIGINT (2) would cause a warm shutdown. Try SIGINT in place of SIGHUP and then start celery manually in your script (I guess).
Can you launch it with a custom pid file name. Possibly timestamped, and key off of that to know which PID to kill?
CELERYD_PID_FILE="/var/run/celery/%n_{timestamp}.pid"
^I dont know the timestamp syntax but maybe you do or you can find it?
then use the current system time to kill off any old pids and launch a new one?
A little late, but that can fixed by deleting the file called celerybeat.pid.
Worked for me.
I think you can try this:
kill -s HUP ``cat /var/run/celeryd.pid``
python manage.py celeryd --pidfile=/var/run/celeryd.pid
HUP may recycle every free worker and leave executing workers keep running and HUP will let these workers be trusted. Then you can safely restart a new celery worker main process and workers. Old workers may be killed itself when task has been finished.
I've use this way in our production and it seems safe now. Hope this can help you!