getting celery group results - python-2.7

So this is what i want to do. I have a scheduled task that runs every X minutes. In the task I create a group of tasks that i want them to run parallel to each other. After they all finish i want to log if the group has finished successfully or not. This is my code:
#shared_task(base=HandlersImplTenantTask, acks_late=True)
def my_scheduled_task():
try:
needed_ids = MyModel.objects.filter(some_field=False)\
.filter(some_other_field=True)\
.values_list("id", flat=True) \
.order_by("id")
if needed_ids:
tasks = [my_single_task.s(needed_id=id) for id in needed_ids]
job = group(tasks)
result = job.apply_async()
returned_values = result.get()
if result.ready():
if result.successful():
logger.info("SUCCESSFULLY FINISHED ALL THE SUBTASKS")
else:
returned_values = result.get()
logger.info("UNSUCCESSFULLY FINISHED ALL THE SUBTASKS WITH THE RESULTS %s" % returned_values)
else:
logger.info("no needed ids found")
except:
logger.exception("got an unexpected exception while running task")
This is my_single_task code:
#shared_task(base=HandlersImplTenantTask)
def my_single_task(needed_id):
logger.info("starting task for tenant: [%s]. got id [%s]", connection.tenant, needed_id)
return
This is how i run my celery:
manage.py celery worker -c 2 --broker=[my rabbitmq brocker url]
when i get to the line result.get() it hangs. i see a single log entry of the single tasks with the first id but i don't see the others. when i kill my celery process and restart it - it reruns the scheduled task and i see the second log entry with the second id (from the first time the task ran). any ideas on how to fix this?
EDIT - so to try and overcome this - I created a different queue called 'new_queue'. I started a different celery worker to listen to the new queue. I want to make the other worker take the tasks and work on them. I think this could solve the problem of the deadlock.
I have changed my code to look like this:
job = group(tasks)
job_result = job.apply_async(queue='new_queue')
results = job_result.get()
but I still get a deadlock and if i remove the results = job_result.get() line, i can see that the tasks are worked on by the main worker and nothing was published to the new_queue queue. Any thoughts?
This is my celery configurations:
tenant_celery_app.conf.update(CELERY_RESULT_BACKEND='djcelery.backends.database.DatabaseBackend'
CELERY_RESULT_DB_TABLENAMES = {
'task': 'tenantapp_taskmeta',
'group': 'tenantapp_groupmeta',
}
This is how i run the workers:
celery worker -c 1 -Q new_queue --broker=[amqp_brocker_url]/[vhost]
celery worker -c 1 --broker=[amqp_brocker_url]/[vhost]

So the solution i was looking for was indeed of the sort of creating a new queue and starting a new worker that processes the new queue. The only issue that i had was to send the group tasks to the new queue. This is the code that worked for me.
tasks = [my_single_task.s(needed_id=id).set(queue='new_queue') for id in needed_ids]
job = group(tasks)
job_result = job.apply_async()
results = job_result.get() # this will block until the tasks finish but it wont deadlock
And these are my celery workers
celery worker -c 1 -Q new_queue --broker=[amqp_brocker_url]/[vhost]
celery worker -c 1 --broker=[amqp_brocker_url]/[vhost]

you seem to be deadlocking your queue. Think about it. If you have a task that waits on other tasks, and the queue fills up then the first task will hang forever.
You need to refactor your code to avoid calling result.get() inside a task (you probably already have warnings in your logs about this)
I would recommend this:
#shared_task(base=HandlersImplTenantTask, acks_late=True)
def my_scheduled_task():
needed_ids = MyModel.objects.filter(some_field=False)\
.filter(some_other_field=True)\
.values_list("id", flat=True) \
.order_by("id")
if needed_ids:
tasks = [my_single_task.s(needed_id=id) for id in needed_ids]
job = group(tasks)
result = job.apply_async()
That's all you need.
Use logging to track if tasks fail.
If code elsewhere in your application needs to track whether the jobs fail or not then you can use celery's inspect api.

Related

Initializing Different Celery Workers with Different Values

I am using celery to run long running tasks on Hadoop. Each task executes a Pig script on Hadoop which runs for about 30 mins - 2 hours.
My current Hadoop setup has 4 queues a,b,c, and default. All tasks are currently being executed by a single worker which submits the job to a single queue.
I want to add 3 more workers which submit jobs to other queues, one worker per queue.
The problem is the queue is currently hard-coded and I wish to make this variable per worker.
I searched a lot but I am unable to find a way to pass each celery worker a different queue value and access it in my task.
I start my celery worker like so.
celery -A app.celery worker
I wish to pass some additional arguments in the command-line itself and access it in my task but celery complains that it doesn't understand my custom argument.
I plan to run all the workers on the same host by setting the --concurrency=3 parameter. Is there any solution to this problem?
Thanks!
EDIT
The current scenario is like this. Every I try to execute the task print_something by saying tasks.print_something.delay() it only prints queue C.
#celery.task()
def print_something():
print "C"
I need to have the workers print a variable letter based on what value I pass to them while starting them.
#celery.task()
def print_something():
print "<Variable Value Per Worker Here>"
Hope this helps someone.
Multiple problems needed solving for this problem.
The first step involved adding support in celery for the custom parameter. If this is not done, celery will complain that it doesn't understand the parameter.
Since I am running celery with Flask, I initialize celery like so.
def configure_celery():
app.config.update(
CELERY_BROKER_URL='amqp://:#localhost:5672',
RESULT_BACKEND='db+mysql://root:#localhost:3306/<database_name>'
)
celery = Celery(app.import_name, backend=app.config['RESULT_BACKEND'],
broker=app.config['CELERY_BROKER_URL'])
celery.conf.update(app.config)
TaskBase = celery.Task
class ContextTask(TaskBase):
abstract = True
def __call__(self, *args, **kwargs):
with app.app_context():
return TaskBase.__call__(self, *args, **kwargs)
celery.Task = ContextTask
return celery
I call this function to initialize celery and store it in a variable called celery.
celery = configure_celery()
To add the custom parameter you need to do the following.
def add_hadoop_queue_argument_to_worker(parser):
parser.add_argument(
'--hadoop-queue', help='Hadoop queue to be used by the worker'
)
The celery used below is the one we obtained from above steps.
celery.user_options['worker'].add(add_hadoop_queue_argument_to_worker)
The next step would be to make this argument accessible in the worker. To do that follow these steps.
class HadoopCustomWorkerStep(bootsteps.StartStopStep):
def __init__(self, worker, **kwargs):
worker.app.hadoop_queue = kwargs['hadoop_queue']
Inform celery to use this class for creating the workers.
celery.steps['worker'].add(HadoopCustomWorkerStep)
The tasks should now be able to access the variables.
#app.task(bind=True)
def print_hadoop_queue_from_config(self):
print self.app.hadoop_queue
Verify it by running the worker on the command-line.
celery -A app.celery worker --concurrency=1 --hadoop-queue=A -n aworker#%h
celery -A app.celery worker --concurrency=1 --hadoop-queue=B -n bworker#%h
celery -A app.celery worker --concurrency=1 --hadoop-queue=C -n cworker#%h
celery -A app.celery worker --concurrency=1 --hadoop-queue=default -n defaultworker#%h
What I usually do is, after starting the workers (the tasks are not executed) in another script (say manage.py) I add commands with parameters to start specific tasks or tasks with different arguments.
in manager.py:
from tasks import some_task
#click.command
def run_task(params):
some_task.apply_async(params)
And this will start the tasks as needed.

Checking the next run time for scheduled periodic tasks in Celery (with Django)

*Using celery 3.1.25 because django-celery-beat 1.0.1 has an issue with scheduling periodic tasks.
Recently I encountered an issue with celerybeat whereby periodic tasks with an interval of a day or longer appear to be 'forgotten' by the scheduler. If I change the interval to every 5 seconds the task executes normally (every 5 seconds) and the last_run_at attribute gets updated. This means celerybeat is responding to the scheduler to a certain degree, but if I reset the last_run_at i.e. PeriodicTask.objects.update(last_run_at=None), none of the tasks with an interval of every day run anymore.
Celerybeat crashed at one point and that may have corrupted something so I created a new virtualenv and database to see if the problem persists. I'd like to know if there is a way to retrieve the next run time so that I don't have to wait a day to know whether or not my periodic task has been executed.
I have also tried using inspect <active/scheduled/reserved> but all returned empty. Is this normal for periodic tasks using djcelery's database scheduler?
Here's the function that schedules the tasks:
def schedule_data_collection(request, project):
if (request.method == 'POST'):
interval = request.POST.get('interval')
target_project = Project.objects.get(url_path=project)
interval_schedule = dict(every=json.loads(interval), period='days')
schedule, created = IntervalSchedule.objects.get_or_create(
every=interval_schedule['every'],
period=interval_schedule['period'],
)
task_name = '{} data collection'.format(target_project.name)
try:
task = PeriodicTask.objects.get(name=task_name)
except PeriodicTask.DoesNotExist:
task = PeriodicTask.objects.create(
interval=schedule,
name=task_name,
task='myapp.tasks.collect_tool_data',
args=json.dumps([target_project.url_path])
)
else:
if task.interval != schedule:
task.interval = schedule
if task.enabled is False:
task.enabled = True
task.save()
return HttpResponse(task.interval)
else:
return HttpResponseForbidden()
You can see your scheduler by going into shell and looking at app.conf.CELERYBEAT_SCEDULE.
celery -A myApp shell
print(app.conf.CELERYBEAT_SCHEDULE)
This should show you all your Periodic Tasks.

Running celery task when celery beat starts

How do I schedule a task to run when I start celery beat then again in 1 hours and so.
Currently I have schedule in settings.py:
CELERYBEAT_SCHEDULE = {
'update_database': {
'task': 'myapp.tasks.update_database',
'schedule': timedelta(seconds=60),
},
}
I saw a post from 1 year here on stackoverflow asking the same question:
How to run celery schedule instantly?
However this does not work for me, because my celery worker get 3-4 requests for the same task, when I run django server
I'm starting my worker and beat like this:
celery -A dashboard_web worker -B --loglevel=INFO --concurrency=10
Crontab schedule
You could try to use a crontab schedule instead which will run every hour and start 1 min after initialization of the scheduler. Warning: you might want to do it a couple of minutes later in case it takes longer to start, otherwise you might need to wait the full hour.
from celery.schedules import crontab
from datetime import datetime
CELERYBEAT_SCHEDULE = {
'update_database': {
'task': 'myapp.tasks.update_database',
'schedule': crontab(minute=(datetime.now().minute + 1) % 60),
},
}
Reference: http://docs.celeryproject.org/en/latest/userguide/periodic-tasks.html#crontab-schedules
Ready method of MyAppConfig
In order to ensure that your task is run right away, you could use the same method as before to create the periodic task without adding 1 to the minute. Then, you call your task in the ready method of MyAppConfig which is called whenever your app is ready.
#myapp/apps.py
class MyAppConfig(AppConfig):
name = "myapp"
def ready(self):
from .tasks import update_database
update_database.delay()
Please note that you could also create the periodic task directly in the ready method if you were to use django_celery_beat.
Edit: Didn't see that the second method was already covered in the link you mentioned. I'll leave it here in case it is useful for someone else arriving here.
Try setting the configuration parameter CELERY_ALWAYS_EAGER = True
Something like this
app.conf.CELERY_ALWAYS_EAGER = True

How to flush tasks with countdown timers from celery queue

My Celery queue has hundreds of tasks with countdowns that will make them trigger over the next few hours. Is there a way to have these tasks run immediately such that the queue is effectively flushed?
I'm currently planning an upgrade to our server and I want to make sure that there are no background tasks running while the upgrade completes. If I have to wait for these countdowns, that's OK, but I'd rather force the tasks to run instead.
Another option could be to pause processing of the queue until the upgrade is complete, but flushing seems like a better option.
EDIT: I've figured out how to find a list of tasks that are scheduled:
from celery.task.control import inspect
i = inspect()
tasks = i.scheduled()
Now I just need to sort out how to force their execution.
OK, I'm fairly certain I've sorted out roughly how to do this. I'm making this answer a wiki and putting down my notes, in case anybody wants to tune up the general process here.
The general idea is this:
Stop adding new items to the queue.
Determine any tasks that are queued.
Revoke all those tasks using result.revoke().
Re-start those tasks using some saved state.
Note that this doesn't support adding an eta to the items once you re-queue them, as that's probably implementation-specific.
So, to figure out what tasks are queued, you do:
from celery.task.control import inspect
i = inspect()
scheduled_tasks = i.scheduled()
Which returns a dict, like so:
{u'w1.courtlistener.com': [{u'eta': 1414435210.198864,
u'priority': 6,
u'request': {u'acknowledged': False,
u'args': u'(2745724,)',
u'delivery_info': {u'exchange': u'celery',
u'priority': None,
u'routing_key': u'celery'},
u'hostname': u'w1.courtlistener.com',
u'id': u'99bc8650-3be1-4d24-81d6-a882d77a8b25',
u'kwargs': u'{}',
u'name': u'citations.tasks.update_document_by_id',
u'time_start': None,
u'worker_pid': None}}]}
The next step is to revoke all those tasks, with something like:
from celery.task.control import revoke
with open('revoked_tasks.csv', 'w') as f:
for worker, tasks in scheduled_tasks.iteritems():
print "Now processing worker: %s" % worker
for task in tasks:
print "Now revoking task: %s. %s with args: %s and kwargs: %s" % \
(task['request']['id'], task['request']['name'], task['request']['args'], task['request']['kwargs'])
f.write('%s|%s|%s|%s|%s\n' % (worker, task['request']['name'], task['request']['id'], task['request']['args'], task['request']['kwargs']))
revoke(task['request']['id'], terminate=True)
Then, finally, re-run the tasks as you would normally, loading them from your CSV file:
with open('revoked_tasks', 'r') as f:
for line in f:
worker, command, id, args, kwargs = line.split("|")
# Impost task here, something like...
package, module = command.rsplit('.', 1)
mod = __import__(package, globals(), locals(), [module])
# Run the commands, something like...
mod.__get_attribute__(module).delay(args*, kwargs**)

celery - Tasks that need to run in priority

In my website users can UPDATE they profile (manual) every time he want, or automatic once a day.
This task is being distributed with celery now.
But i have a "problem" :
Every day, in automatic update, a job put ALL users (+-6k users) on queue:
from celery import group
from tasks import *
import datetime
from lastActivityDate.models import UserActivity
today = datetime.datetime.today()
one_day = datetime.timedelta(days=5)
today -= one_day
print datetime.datetime.today()
user_list = UserActivity.objects.filter(last_activity_date__gte=today)
g = group(update_user_profile.s(i.user.auth.username) for i in user_list)
print datetime.datetime.today()
print g(user_list.count()).get()
If someone try to do the manual update, they will enter on te queue and last forever to be executed.
Is there a way to set this manual task to run in a piority way?
Or make a dedicated for each separated queue: manual and automatic?
Celery does not support task priority. (v3.0)
http://docs.celeryproject.org/en/master/faq.html#does-celery-support-task-priorities
You may solve this problem by routing tasks.
http://docs.celeryproject.org/en/latest/userguide/routing.html
Prepare default and priority_high Queue.
from kombu import Queue
CELERY_DEFAULT_QUEUE = 'default'
CELERY_QUEUES = (
Queue('default'),
Queue('priority_high'),
)
Run two daemon.
user#x:/$ celery worker -Q priority_high
user#y:/$ celery worker -Q default,priority_high
And route task.
your_task.apply_async(args=['...'], queue='priority_high')
If you use RabbitMQ transport then configure your queues the following way:
settings.py
from kombu import Queue
...
CELERY_TASK_QUEUES = (
Queue('default', routing_key='task_default.#', max_priority=10),
...)
Then run your tasks:
my_low_prio_task.apply_async(args=(...), priority=1)
my_high_prio_task.apply_async(args=(...), priority=10)
Presently this code works for kombu==4.6.11, celery==4.4.6.