I feel a bit stupid for asking, but it doesn't appear to be in the documentation for RQ. I have a 'failed' queue with thousands of items in it and I want to clear it using the Django admin interface. The admin interface lists them and allows me to delete and re-queue them individually but I can't believe that I have to dive into the django shell to do it in bulk.
What have I missed?
The Queue class has an empty() method that can be accessed like:
import django_rq
q = django_rq.get_failed_queue()
q.empty()
However, in my tests, that only cleared the failed list key in Redis, not the job keys itself. So your thousands of jobs would still occupy Redis memory. To prevent that from happening, you must remove the jobs individually:
import django_rq
q = django_rq.get_failed_queue()
while True:
job = q.dequeue()
if not job:
break
job.delete() # Will delete key from Redis
As for having a button in the admin interface, you'd have to change django-rq/templates/django-rq/jobs.html template, who extends admin/base_site.html, and doesn't seem to give any room for customizing.
The redis-cli allows FLUSHDB, great for my local environment as I generate a bizzallion jobs.
With a working Django integration I will update. Just adding $0.02.
You can empty any queue by name using following code sample:
import django_rq
queue = "default"
q = django_rq.get_queue(queue)
q.empty()
or even have Django Command for that:
import django_rq
from django.core.management.base import BaseCommand
class Command(BaseCommand):
def add_arguments(self, parser):
parser.add_argument("-q", "--queue", type=str)
def handle(self, *args, **options):
q = django_rq.get_queue(options.get("queue"))
q.empty()
As #augusto-men method seems not to work anymore, here is another solution:
You can use the the raw connection to delete the failed jobs. Just iterate over rq:job keys and check the job status.
from django_rq import get_connection
from rq.job import Job
# delete failed jobs
con = get_connection('default')
for key in con.keys('rq:job:*'):
job_id = key.decode().replace('rq:job:', '')
job = Job.fetch(job_id, connection=con)
if job.get_status() == 'failed':
con.delete(key)
con.delete('rq:failed:default') # reset failed jobs registry
The other answers are outdated with the RQ updates implementing Registries.
Now, you need to do this to loop through and delete failed jobs. This would work any particular Registry as well.
import django_rq
from rq.registry import FailedJobRegistry
failed_registry = FailedJobRegistry('default', connection=django_rq.get_connection())
for job_id in failed_registry.get_job_ids():
try:
failed_registry.remove(job_id, delete_job=True)
except:
# failed jobs expire in the queue. There's a
# chance this will raise NoSuchJobError
pass
Source
You can empty a queue from the command line with:
rq empty [queue-name]
Running rq info will list all the queues.
Related
I'm scrapping a page successfully that returns me an unique item. I don't want neither to save the scrapped item in the database nor to a file. I need to get it inside a Django view.
My view is as follows:
def start_crawl(process_number, court):
"""
Starts the crawler.
Args:
process_number (str): Process number to be found.
court (str): Court of the process.
"""
runner = CrawlerRunner(get_project_settings())
results = list()
def crawler_results(sender, parse_result, **kwargs):
results.append(parse_result)
dispatcher.connect(crawler_results, signal=signals.item_passed)
process_info = runner.crawl(MySpider, process_number=process_number, court=court)
return results
I followed this solution but results list is always empty.
I read something as creating a custom middleware and getting the results at the process_spider_output method.
How can I get the desired result?
Thanks!
I managed to implement something like that in one of my projects. It is a mini-project and I was looking for a quick solution. You'll might need modify it or support multi-threading etc in case you put it in production environment.
Overview
I created an ItemPipeline that just add the items into a InMemoryItemStore helper. Then, in my __main__ code I wait for the crawler to finish, and pop all the items out of the InMemoryItemStore. Then I can manipulate the items as I wish.
Code
items_store.py
Hacky in-memory store. It is not very elegant but it got the job done for me. Modify and improve if you wish. I've implemented that as a simple class object so I can simply import it anywhere in the project and use it without passing its instance around.
class InMemoryItemStore(object):
__ITEM_STORE = None
#classmethod
def pop_items(cls):
items = cls.__ITEM_STORE or []
cls.__ITEM_STORE = None
return items
#classmethod
def add_item(cls, item):
if not cls.__ITEM_STORE:
cls.__ITEM_STORE = []
cls.__ITEM_STORE.append(item)
pipelines.py
This pipleline will store the objects in the in-memory store from the snippet above. All items are simply returned to keep the regular pipeline flow intact. If you don't want to pass some items down the to the other pipelines simply change process_item to not return all items.
from <your-project>.items_store import InMemoryItemStore
class StoreInMemoryPipeline(object):
"""Add items to the in-memory item store."""
def process_item(self, item, spider):
InMemoryItemStore.add_item(item)
return item
settings.py
Now add the StoreInMemoryPipeline in the scraper settings. If you change the process_item method above, make sure you set the proper priority here (changing the 100 down here).
ITEM_PIPELINES = {
...
'<your-project-name>.pipelines.StoreInMemoryPipeline': 100,
...
}
main.py
This is where I tie all these things together. I clean the in-memory store, run the crawler, and fetch all the items.
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
from <your-project>.items_store import InMemoryItemStore
from <your-project>.spiders.your_spider import YourSpider
def get_crawler_items(**kwargs):
InMemoryItemStore.pop_items()
process = CrawlerProcess(get_project_settings())
process.crawl(YourSpider, **kwargs)
process.start() # the script will block here until the crawling is finished
process.stop()
return InMemoryItemStore.pop_items()
if __name__ == "__main__":
items = get_crawler_items()
If you really want to collect all data in a "special" object.
Store the data in a separate pipeline like https://doc.scrapy.org/en/latest/topics/item-pipeline.html#duplicates-filter and in close_spider (https://doc.scrapy.org/en/latest/topics/item-pipeline.html?highlight=close_spider#close_spider) you open your django object.
I am currently developing a Django application based on django-tenants-schema. You don't need to look into the actual code of the module, but the idea is that it has a global setting for the current database connection defining which schema to use for the application tenant, e.g.
tenant = tenants_schema.get_tenant()
And for setting
tenants_schema.set_tenant(xxx)
For some of the tasks I would like them to remember the current global tenant selected during the instantiation, e.g. in theory:
class AbstractTask(Task):
'''
Run this method before returning the task future
'''
def before_submit(self):
self.run_args['tenant'] = tenants_schema.get_tenant()
'''
This method is run before related .run() task method
'''
def before_run(self):
tenants_schema.set_tenant(self.run_args['tenant'])
Is there an elegant way of doing it in celery?
Celery (as of 3.1) has signals you can hook into to do this. You can alter the kwargs that were passed in, and on the other side, undo your alterations before they're given to the actual task:
from celery import shared_task
from celery.signals import before_task_publish, task_prerun, task_postrun
from threading import local
current_tenant = local()
#before_task_publish.connect
def add_tenant_to_task(body=None, **unused):
body['kwargs']['tenant_middleware.tenant'] = getattr(current_tenant, 'id', None)
print 'sending tenant: {t}'.format(t=current_tenant.id)
#task_prerun.connect
def extract_tenant_from_task(kwargs=None, **unused):
tenant_id = kwargs.pop('tenant_middleware.tenant', None)
current_tenant.id = tenant_id
print 'current_tenant.id set to {t}'.format(t=tenant_id)
#task_postrun.connect
def cleanup_tenant(**kwargs):
current_tenant.id = None
print 'cleaned current_tenant.id'
#shared_task
def get_current_tenant():
# Here is where you would do work that relied on current_tenant.id being set.
import time
time.sleep(1)
return current_tenant.id
And if you run the task (not showing logging from the worker):
In [1]: current_tenant.id = 1234; ct = get_current_tenant.delay(); current_tenant.id = 5678; ct.get()
sending tenant: 1234
Out[1]: 1234
In [2]: current_tenant.id
Out[2]: 5678
The signals are not called if no message is sent (when you call the task function directly, without delay() or apply_async()). If you want to filter on the task name, it is available as body['task'] in the before_task_publish signal handler, and the task object itself is available in the task_prerun and task_postrun handlers.
I am a Celery newbie, so I can't really tell if this is the "blessed" way of doing "middleware"-type stuff in Celery, but I think it will work for me.
I'm not sure what you mean here, is before_submit executed before the task is called by a client?
In that case I would rather use a with statement here:
from contextlib import contextmanager
#contextmanager
def set_tenant_db(tenant):
prev_tenant = tenants_schema.get_tenant()
try:
tenants_scheme.set_tenant(tenant)
yield
finally:
tenants_schema.set_tenant(prev_tenant)
#app.task
def tenant_task(tenant=None):
with set_tenant_db(tenant):
do_actions_here()
tenant_task.delay(tenant=tenants_scheme.get_tenant())
You can of course create a base task that does this automatically,
you can apply the context in Task.__call__ for example, but I'm not sure
if that saves you much if you can just use the with statement explicitly.
I'm using Django and Celery and I'm trying to setup routing to multiple queues. When I specify a task's routing_key and exchange (either in the task decorator or using apply_async()), the task isn't added to the broker (which is Kombu connecting to my MySQL database).
If I specify the queue name in the task decorator (which will mean the routing key is ignored), the task works fine. It appears to be a problem with the routing/exchange setup.
Any idea what the problem could be?
Here's the setup:
settings.py
INSTALLED_APPS = (
...
'kombu.transport.django',
'djcelery',
)
BROKER_BACKEND = 'django'
CELERY_DEFAULT_QUEUE = 'default'
CELERY_DEFAULT_EXCHANGE = "tasks"
CELERY_DEFAULT_EXCHANGE_TYPE = "topic"
CELERY_DEFAULT_ROUTING_KEY = "task.default"
CELERY_QUEUES = {
'default': {
'binding_key':'task.#',
},
'i_tasks': {
'binding_key':'important_task.#',
},
}
tasks.py
from celery.task import task
#task(routing_key='important_task.update')
def my_important_task():
try:
...
except Exception as exc:
my_important_task.retry(exc=exc)
Initiate task:
from tasks import my_important_task
my_important_task.delay()
You are using the Django ORM as a broker, which means declarations are only stored in memory
(see the, inarguably hard to find, transport comparison table at http://readthedocs.org/docs/kombu/en/latest/introduction.html#transport-comparison)
So when you apply this task with routing_key important_task.update it will not be able
to route it, because it hasn't declared the queue yet.
It will work if you do this:
#task(queue="i_tasks", routing_key="important_tasks.update")
def important_task():
print("IMPORTANT")
But it would be much simpler for you to use the automatic routing feature,
since there's nothing here that shows you need to use a 'topic' exchange,
to use automatic routing simply remove the settings:
CELERY_DEFAULT_QUEUE,
CELERY_DEFAULT_EXCHANGE,
CELERY_DEFAULT_EXCHANGE_TYPE
CELERY_DEFAULT_ROUTING_KEY
CELERY_QUEUES
And declare your task like this:
#task(queue="important")
def important_task():
return "IMPORTANT"
and then to start a worker consuming from that queue:
$ python manage.py celeryd -l info -Q important
or to consume from both the default (celery) queue and the important queue:
$ python manage.py celeryd -l info -Q celery,important
Another good practice is to not hardcode the queue names into the
task and use CELERY_ROUTES instead:
#task
def important_task():
return "DEFAULT"
then in your settings:
CELERY_ROUTES = {"myapp.tasks.important_task": {"queue": "important"}}
If you still insist on using topic exchanges then you could
add this router to automatically declare all queues the first time
a task is sent:
class PredeclareRouter(object):
setup = False
def route_for_task(self, *args, **kwargs):
if self.setup:
return
self.setup = True
from celery import current_app, VERSION as celery_version
# will not connect anywhere when using the Django transport
# because declarations happen in memory.
with current_app.broker_connection() as conn:
queues = current_app.amqp.queues
channel = conn.default_channel
if celery_version >= (2, 6):
for queue in queues.itervalues():
queue(channel).declare()
else:
from kombu.common import entry_to_queue
for name, opts in queues.iteritems():
entry_to_queue(name, **opts)(channel).declare()
CELERY_ROUTES = (PredeclareRouter(), )
I have to run tasks on approximately 150k Django objects. What is the best way to do this? I am using the Django ORM as the Broker. The database backend is MySQL and chokes and dies during the task.delay() of all the tasks. Related, I was also wanting to kick this off from the submission of a form, but the resulting request produced a very long response time that timed out.
I would also consider using something other than using the database as the "broker". It really isn't suitable for this kind of work.
Though, you can move some of this overhead out of the request/response cycle by launching a task to create the other tasks:
from celery.task import TaskSet, task
from myapp.models import MyModel
#task
def process_object(pk):
obj = MyModel.objects.get(pk)
# do something with obj
#task
def process_lots_of_items(ids_to_process):
return TaskSet(process_object.subtask((id, ))
for id in ids_to_process).apply_async()
Also, since you probably don't have 15000 processors to process all of these objects
in parallel, you could split the objects in chunks of say 100's or 1000's:
from itertools import islice
from celery.task import TaskSet, task
from myapp.models import MyModel
def chunks(it, n):
for first in it:
yield [first] + list(islice(it, n - 1))
#task
def process_chunk(pks):
objs = MyModel.objects.filter(pk__in=pks)
for obj in objs:
# do something with obj
#task
def process_lots_of_items(ids_to_process):
return TaskSet(process_chunk.subtask((chunk, ))
for chunk in chunks(iter(ids_to_process),
1000)).apply_async()
Try using RabbitMQ instead.
RabbitMQ is used in a lot of bigger companies and people really rely on it, since it's such a great broker.
Here is a great tutorial on how to get you started with it.
I use beanstalkd ( http://kr.github.com/beanstalkd/ ) as the engine. Adding a worker and a task is pretty straightforward for Django if you use django-beanstalkd : https://github.com/jonasvp/django-beanstalkd/
It’s very reliable for my usage.
Example of worker :
import os
import time
from django_beanstalkd import beanstalk_job
#beanstalk_job
def background_counting(arg):
"""
Do some incredibly useful counting to the value of arg
"""
value = int(arg)
pid = os.getpid()
print "[%s] Counting from 1 to %d." % (pid, value)
for i in range(1, value+1):
print '[%s] %d' % (pid, i)
time.sleep(1)
To launch a job/worker/task :
from django_beanstalkd import BeanstalkClient
client = BeanstalkClient()
client.call('beanstalk_example.background_counting', '5')
(source extracted from example app of django-beanstalkd)
Enjoy !
I'm running multiple simulations as tasks through celery (version 2.3.2) from django. The simulations get set up by another task:
In views.py:
result = setup_simulations.delay(parameters)
request.session['sim'] = result.task_id # Store main task id
In tasks.py:
#task(priority=1)
def setup_simulations(parameters):
task_ids = []
for i in range(number_of_simulations):
result = run_simulation.delay(other_parameters)
task_ids.append(result.task_id)
return task_ids
After the initial task (setup_simulations) has finished, I try to revoke the simulation tasks as follows:
main_task_id = request.session['sim']
main_result = AsyncResult(main_task_id)
# Revoke sub tasks
from celery.task.control import revoke
for sub_task_id in main_result.get():
sub_result = AsyncResult(sub_task_id); sub_result.revoke() # Does not work
# revoke(sub_task_id) # Does not work neither
When I look at the output from "python manage.py celeryd -l info", the tasks get executed as if nothing had happened. Any ideas somebody what could have gone wrong?
As you mention in the comment, revoke is a remote control command so it's only currently supported by the amqp and redis transports.
You can accomplish this yourself by storing a revoked flag in your database, e.g:
from celery import states
from celery import task
from celery.exceptions import Ignore
from myapp.models import RevokedTasks
#task
def foo():
if RevokedTasks.objects.filter(task_id=foo.request.id).count():
if not foo.ignore_result:
foo.update_state(state=states.REVOKED)
raise Ignore()
If your task is working on some model you could even store a flag in that.