I have stepA and stepB which needs to be applied to ~100 sets of data. Since these sets are CPU intensive I want them to happen sequentially.
I am using Celery to create a chain with ~100 chains with (stepA and StepB).
It celery execution works fine, I want to show the status of each of the chain in a django page as each of the stepA and stepB may take an hour to complete.
I want to show the status of each of the sub chains in a django page. The problem is the AsyncResult sees all the children of the parent chain as tasks.
The following is a sample code snippet.
test_celery.py
from __future__ import absolute_import, division, print_function
from celery import Celery, chord, chain, group
from datetime import datetime
import time
app = Celery('tasks', backend='redis', broker='redis://')
#app.task
def ident(x):
print("Guru: inside ident: {}".format(datetime.now()))
print(x)
return x
#app.task
def tsum(numbers):
print("Guru: inside tsum: {}".format(datetime.now()))
print(numbers)
time.sleep(5)
return sum(numbers)
test.py
import test_celery
from celery import chain
from celery.result import AsyncResult
from celery.utils.graph import DependencyGraph
def method():
async_chain = chain(chain(test_celery.tsum.si([1, 2]), test_celery.ident.si(2)), chain(test_celery.tsum.si([2,3]), test_celery.ident.si(3)))
chain_task_names = [task.task for task in async_chain.tasks]
# run the chain
chain_results_tasks = async_chain.apply_async()
print("async_chain=", dir(chain_results_tasks))
print("result.get={}".format(chain_results_tasks.status))
# create a list of names and tasks in the chain
chain_tasks = zip(chain_task_names, reversed(list(get_chain_nodes(chain_results_tasks))))
xx = list(get_chain_nodes(chain_results_tasks))
#
print(dir(chain_results_tasks))
for task in async_chain.tasks:
print("dir task={}".format(dir(task)))
print("task_name={} task_id={}".format(task.task, task.parent_id))
for i in xx:
res = AsyncResult(i)
# print("res={}".format(dir(res)))
parent = get_parent_node(i)
print(parent.build_graph(intermediate=True))
print("parent_task={}".format(dir(parent)))
print(xx[-1].build_graph(intermediate=True))
method()
Any help is appreciated.
Related
I am building an interval based scheduler using apscheduler. Here's the code:
from flask import Flask
import time
from apscheduler.schedulers.background import BackgroundScheduler
def job1():
print('performed job1')
def job2():
print('performed job2')
sched = BackgroundScheduler(daemon=True)
sched.add_job(lambda : sched.print_jobs(),'interval',minutes=1)
sched.add_job(job1,'interval', minutes=1)
sched.add_job(job2, 'interval',minutes=2)
try:
sched.start()
except (KeyboardInterrupt, SystemExit):
pass
app = Flask(__name__)
if __name__ == "__main__":
app.run()
whenever two jobs are triggered simultaneously, apscheduler performs the second one. I just want to know, on what basis does apscheduler decide which job to perform out of the two clashed jobs. And is it possible to change that criteria as i want to perform the job which has the higher priority.I'm defining priorities explicitly.
In my views.py I am using celery to run a shared task present in tasks.py.
Here is how I call from views.py
task = task_addnums.delay()
task_id = task.id
tasks.py looks as
from celery import shared_task
from celery.result import AsyncResult
#shared_task
def task_addnums():
# print self.request.id
# do something
return True
Now, as we can see we already have task_id from task.id in views.py . But, Let's say If I want to fetch task id from the shared_task itself how can I ? The goal is to get task id from the task_addnums itself so I can use that to pass into some other function.
I tried using self.request.id considering the first param is self . But it didn't worked.
Solved.
This answer is a gem Getting task_id inside a Celery task
You can do function_name.request.id to get task id.
current_task from celery will get the current task.Code like this:
from celery import shared_task, current_task
#shared_task
def task_addnums():
print(current_task.request)
# do something
return True
I have to run tasks on approximately 150k Django objects. What is the best way to do this? I am using the Django ORM as the Broker. The database backend is MySQL and chokes and dies during the task.delay() of all the tasks. Related, I was also wanting to kick this off from the submission of a form, but the resulting request produced a very long response time that timed out.
I would also consider using something other than using the database as the "broker". It really isn't suitable for this kind of work.
Though, you can move some of this overhead out of the request/response cycle by launching a task to create the other tasks:
from celery.task import TaskSet, task
from myapp.models import MyModel
#task
def process_object(pk):
obj = MyModel.objects.get(pk)
# do something with obj
#task
def process_lots_of_items(ids_to_process):
return TaskSet(process_object.subtask((id, ))
for id in ids_to_process).apply_async()
Also, since you probably don't have 15000 processors to process all of these objects
in parallel, you could split the objects in chunks of say 100's or 1000's:
from itertools import islice
from celery.task import TaskSet, task
from myapp.models import MyModel
def chunks(it, n):
for first in it:
yield [first] + list(islice(it, n - 1))
#task
def process_chunk(pks):
objs = MyModel.objects.filter(pk__in=pks)
for obj in objs:
# do something with obj
#task
def process_lots_of_items(ids_to_process):
return TaskSet(process_chunk.subtask((chunk, ))
for chunk in chunks(iter(ids_to_process),
1000)).apply_async()
Try using RabbitMQ instead.
RabbitMQ is used in a lot of bigger companies and people really rely on it, since it's such a great broker.
Here is a great tutorial on how to get you started with it.
I use beanstalkd ( http://kr.github.com/beanstalkd/ ) as the engine. Adding a worker and a task is pretty straightforward for Django if you use django-beanstalkd : https://github.com/jonasvp/django-beanstalkd/
It’s very reliable for my usage.
Example of worker :
import os
import time
from django_beanstalkd import beanstalk_job
#beanstalk_job
def background_counting(arg):
"""
Do some incredibly useful counting to the value of arg
"""
value = int(arg)
pid = os.getpid()
print "[%s] Counting from 1 to %d." % (pid, value)
for i in range(1, value+1):
print '[%s] %d' % (pid, i)
time.sleep(1)
To launch a job/worker/task :
from django_beanstalkd import BeanstalkClient
client = BeanstalkClient()
client.call('beanstalk_example.background_counting', '5')
(source extracted from example app of django-beanstalkd)
Enjoy !
I'm running multiple simulations as tasks through celery (version 2.3.2) from django. The simulations get set up by another task:
In views.py:
result = setup_simulations.delay(parameters)
request.session['sim'] = result.task_id # Store main task id
In tasks.py:
#task(priority=1)
def setup_simulations(parameters):
task_ids = []
for i in range(number_of_simulations):
result = run_simulation.delay(other_parameters)
task_ids.append(result.task_id)
return task_ids
After the initial task (setup_simulations) has finished, I try to revoke the simulation tasks as follows:
main_task_id = request.session['sim']
main_result = AsyncResult(main_task_id)
# Revoke sub tasks
from celery.task.control import revoke
for sub_task_id in main_result.get():
sub_result = AsyncResult(sub_task_id); sub_result.revoke() # Does not work
# revoke(sub_task_id) # Does not work neither
When I look at the output from "python manage.py celeryd -l info", the tasks get executed as if nothing had happened. Any ideas somebody what could have gone wrong?
As you mention in the comment, revoke is a remote control command so it's only currently supported by the amqp and redis transports.
You can accomplish this yourself by storing a revoked flag in your database, e.g:
from celery import states
from celery import task
from celery.exceptions import Ignore
from myapp.models import RevokedTasks
#task
def foo():
if RevokedTasks.objects.filter(task_id=foo.request.id).count():
if not foo.ignore_result:
foo.update_state(state=states.REVOKED)
raise Ignore()
If your task is working on some model you could even store a flag in that.
I`ve followed the guidelines in http://celeryq.org/docs/django-celery/getting-started/first-steps-with-django.html and created a view that calls my test method in tasks.py:
import time
from celery.decorators import task
#task()
def add(x, y):
time.sleep(10)
return x + y
But if my add-method takes a long time to respond, how can I store the result-object I got when calling add.delay(1,2) and use it to check the progress/success/result using get later?
You only need the task-id:
result = add.delay(2, 2)
result.task_id
With this you can poll the status of the task (using e.g. AJAX)
Django-celery comes with a view that returns results and status in JSON:
http://celeryq.org/docs/django-celery/reference/djcelery.views.html