Celery with Django and MongoDB (mongoengine) - django

1) I am trying to build an application using Celery (with RabbitMQ as broker) and Django - using MongoDB (mongoengine) as the database for the model. So the requests received by the web server will be transformed in tasks and queued with the help of Celery to be executed by the workers.
I followed the following tutorials:
http://docs.celeryproject.org/en/master/django/first-steps-with-django.html#configuring-your-django-project-to-use-celery
and
https://mongoengine-odm.readthedocs.org/en/latest/django.html
but I still get the following error:
ImproperlyConfigured: settings.DATABASES is improperly configured. Please supply the ENGINE value.
As mentioned in both tutorials, settings.DATABASES should be commented and instead of it there should be only
mongoengine.connect('myDB')
and yet the error is exactly about not having the DATABASES configured.
(Apart from that, I have not configure any results-backend for Celery.)
Can anybody help me with an advice as to what I have to set and where?
2) And another question is: in the projects involving only Celery there is always a Celery instance. But in the tutorials about building web applications with Django and Celery I haven't seen any mention of this. Do I have to explicitly instantiate Celery or is this done somewhere else by default?

1) In case anybody is interested in the answer, I finally managed to get it working, but I am not sure if I understood correctly what happened.
So apparently the problem was that I haven't set the result backend for Celery. I got rid of the error as soon as I put the following line in settings.py:
CELERY_RESULT_BACKEND = "amqp"
2) My project (I am using djcelery) is working without me explicitly instantiating Celery. I assume this is done somewhere in the back by the framework.

What happens is that Celery attempts to use Django's default database (the one defined in settings.DATABASES) as the Celery Result database, but to use mongoengine as your primary Django database you have to bypass settings.DATABASES.
So just make sure you define both BROKER_URL and CELERY_RESULT_BACKEND properly so that Celery doesn't try to consult settings.DATABASES. I guess you want them to be the same, but you could chose to have them separate.
BROKER_URL = "amqp://guest:guest#localhost:5672//"
CELERY_RESULT_BACKEND = "amqp"
For other backends, consult this.
Part 2 of your question.
Do you have CELERY_ALWAYS_EAGER = True in your settings.py? That is how, typically, the Celery process doesn't need to be launched separately. But do not use this in production. See this question.

Related

Shutting down a plotly-dash server

This is a follow-up to this question: How to stop flask application without using ctrl-c . The problem is that I didn't understand some of the terminology in the accepted answer since I'm totally new to this.
import dash
import dash_core_components as dcc
import dash_html_components as html
app = dash.Dash()
app.layout = html.Div(children=[
html.H1(children='Dash Tutorials'),
dcc.Graph()
])
if __name__ == '__main__':
app.run_server(debug=True)
How do I shut this down? My end goal is to run a plotly dashboard on a remote machine, but I'm testing it out on my local machine first.
I guess I'm supposed to "expose an endpoint" (have no idea what that means) via:
from flask import request
def shutdown_server():
func = request.environ.get('werkzeug.server.shutdown')
if func is None:
raise RuntimeError('Not running with the Werkzeug Server')
func()
#app.route('/shutdown', methods=['POST'])
def shutdown():
shutdown_server()
return 'Server shutting down...'
Where do I include the above code? Is it supposed to be included in the first block of code that I showed (i.e. the code that contains app.run_server command)? Is it supposed to be separate? And then what are the exact steps I need to take to shut down the server when I want?
Finally, are the steps to shut down the server the same whether I run the server on a local or remote machine?
Would really appreciate help!
The method in the linked answer, werkzeug.server.shutdown, only works with the development server. Creating a view function, with an assigned URL ("exposing an endpoint") to implement this shutdown function is a convenience thing, which won't work when deployed with a WSGI server like gunicorn.
Maybe that creates more questions than it answers:
I suggest familiarising yourself with Flask's wsgi-standalone deployment docs.
And then probably the gunicorn deployment guide. The monitoring section has a number of different examples of service monitors, which you can use with gunicorn allowing you to run the app in the background, start on reboot, etc.
Ultimately, starting and stopping the WSGI server is the responsibility of the service monitor and logic to do this probably shouldn't be coded into your app.
What works in both cases of
app.run_server(debug=True)
and
app.run_server(debug=False)
anywhere in the code is:
os.kill(os.getpid(), signal.SIGTERM)
(don't forget to import os and signal)
SIGTERM should cause a clean exit of the application.

on heroku, celery beat database scheduler doesn’t run periodic tasks

I have an issue where django_celery_beat’s DatabaseScheduler doesn’t run periodic tasks. Or I should say where celery beat doesn’t find any tasks when the scheduler is DatabaseScheduler. In case I use the standard scheduler the tasks are executed regularly.
I setup celery on heroku by using a dyno for worker and one for beat (and one for web, obviously).
I know that beat and worker are connected to redis and to postgres for task results.
Every periodic task I run from django admin by selecting a task and “run selected task” gets executed.
However, it is about two days that I’m trying to figure out why there isn’t a way for beat/worker to find that I scheduled a task to execute every 10 seconds, or using a cron (even restarting beat and remot doesn’t change it).
I’m kind of desperate, and my next move would be to give redbeat a try.
Any help on how to how to troubleshoot this particular problem would be greatly appreciated. I suspect the problem is in the is_due method. I am using UTC (in celery and django), all cron are UTC based. All I see in the beat log is “writing entries..” every on and then.
I’ve tried changing celery version from 4.3 to 4.4 and django celery beat from 1.4.0 to 1.5.0 to 1.6.0
Any help would be greatly appreciated.
In case it helps someone who's having or will have a similar trouble as ours: to recreate this issue, it is possible to create a task as simple as:
#app.task(bind=True)
def test(self, arg):
print(kwargs.get("notification_id"))
then, in django admin, enter the task editing and put something in the extra args field. Or, viceversa, the task could be
#app.task(bind=True)
def test(self, **kwargs):
print(notification_id)
And try to pass a positional argument. While locally this breaks, on Heroku's beat and worker dyno, this somehow slips away unnoticed, and django_celery_beats stop processing any task whatsoever in the future. The scheduler is completely broken by a "wrong" task.

How to achieve below objective.?

I am using celery with Django. Redis is my broker. I am serving my Django app via Apache and WSGI. I am running celery in supervisor mode. I am starting up a celery task named run_forever from wsgi.py file of my Django project. My intention was to start a celery task when Django starts up and run it forever in the background (I don't know if it is the right way to achieve the same. I searched it but couldn't find appropriate implementation. If you have any better idea, kindly share). It is working as expected. Now due to certain issue, I have added maximum-requests-250 parameter in the virtual host of apache. By doing so when it gets 250 requests it restarts the WSGI process.
So when every time it restarts a celery task 'run_forever' is created and run in the background. Eventually, when the server gets 1000 requests WSGI process would have restarted 4 times and I end in having 4 copies of 'run_forever' task. I only want to have one copy of the task to run at any point in time. So I would like to kill all the currently running 'run_forever' task every time the Django starts.
I have tried
from project.celery import app
from project.tasks import run_forever
app.control.purge()
run_forever.delay()
in wsgi.py to kill all the running tasks before starting `run_forever'. But didn't work
I have to agree with Dave Smith here--why do you have a task that runs forever? That said, to the extent that you want to safeguard a task from running twice, there are multiple strategies you can use. The easiest for implementation is using a database entry (since databases can be transactional and if you re using django, presumably you are using at least one database). n.b., in the code snippet below, I did not put my model in the right place to be picked up by a migration--I just put it in the same snippet for ease of discussion.
import time
from myapp.celery import app
from django.db import models
class CeleryGuard(models.Model):
task_name = models.CharField(max_length=32)
task_id = models.CharField(max_length=32)
#app.task(bind=True)
def run_forever(self):
created, x = CeleryGuard.objects.get_or_create(
task_name='run_forever', defaults={
'task_id': self.request.id
})
if not created:
return
# do whatever you want to here
while True:
print('I am doing nothing')
time.sleep(1440)
# make sure to cleanup after you are done
CeleryGuard.objects.filter(task_name='run_forever').delete()

Multiple celery server but same redis broker executing task twice

When I inspect celery -A proj inspect active_queues I see two servers showing their queues they are listening to and they are pointing to same default queue name celery. Still the task issued by django app gets executed twice by both servers(Once by each celery server - so two times).
I can see the transport type is also direct - the default one.
On my local task gets executed once so I am sure that the task is called only once by my django app.
What can I be missing here?
Ok, i looked up the docs, i think you need to set celerybeat-scheduler in your settings.py which makes sure tasks are being scheduled by a single scheduler.
http://celery.readthedocs.org/en/latest/configuration.html#celerybeat-scheduler
On Redis you can set the current database for the application you are running, setting the database will separate the information to use different apps.
If you are using Django the configuration is
CELERY_BROKER_VHOST = {number of the database}
If you are not using Django i beleive the configuration is CELERY_REDIS_DB or redis_db depending on your celery version
For instance for your first application could be CELERY_BROKER_VHOST = 1
For the second application could be CELERY_BROKER_VHOST = 2
and for your local development could be CELERY_BROKER_VHOST = 99
http://docs.celeryproject.org/en/latest/userguide/configuration.html#id8

Using phantomjs for dynamic content with scrapy and selenium possible race condition

First off, this is a follow up question from here: Change number of running spiders scrapyd
I'm used phantomjs and selenium to create a downloader middleware for my scrapy project. It works well and hasn't really slowed things down when I run my spiders one at a time locally.
But just recently I put a scrapyd server up on AWS. I noticed a possible race condition that seems to be causing errors and performance issues when more than one spider is running at once. I feel like the problem stems from two separate issues.
1) Spiders trying to use phantomjs executable at the same time.
2) Spiders trying to log to phantomjs's ghostdriver log file at the same time.
Guessing here, the performance issue may be the spider trying to wait until the resources are available (this could be due to the fact that I also had a race condition for an sqlite database as well).
Here are the errors I get:
exceptions.IOError: [Errno 13] Permission denied: 'ghostdriver.log' (log file race condition?)
selenium.common.exceptions.WebDriverException: Message: 'Can not connect to GhostDriver' (executable race condition?)
My questions are:
Does my analysis of what the problem(s) are seem correct?
Are there any known solutions to this problem other than limiting the number of spiders that can be ran at a time?
Is there some other way I should be handling javascript? (if you think I should create an entirely new question to discuss the best way to handle javascript with scrapy let me know and I will)
Here is my downloader middleware:
class JsDownload(object):
#check_spider_middleware
def process_request(self, request, spider):
if _platform == "linux" or _platform == "linux2":
driver = webdriver.PhantomJS(service_log_path='/var/log/scrapyd/ghost.log')
else:
driver = webdriver.PhantomJS(executable_path=settings.PHANTOM_JS_PATH)
driver.get(request.url)
return HtmlResponse(request.url, encoding='utf-8', body=driver.page_source.encode('utf-8'))
note: the _platform code is a temporary work around until I get this source code deployed into a static environment.
I found solutions on SO for javascript problem but they were spider based. This bothered me because it meant every request had to be made once in the downloader handler and again in the spider. That is why I decided to implement mine as a downloader middleware.
try using webdriver to interface with phantomjs
https://github.com/brandicted/scrapy-webdriver