I am using celery to run long running tasks on Hadoop. Each task executes a Pig script on Hadoop which runs for about 30 mins - 2 hours.
My current Hadoop setup has 4 queues a,b,c, and default. All tasks are currently being executed by a single worker which submits the job to a single queue.
I want to add 3 more workers which submit jobs to other queues, one worker per queue.
The problem is the queue is currently hard-coded and I wish to make this variable per worker.
I searched a lot but I am unable to find a way to pass each celery worker a different queue value and access it in my task.
I start my celery worker like so.
celery -A app.celery worker
I wish to pass some additional arguments in the command-line itself and access it in my task but celery complains that it doesn't understand my custom argument.
I plan to run all the workers on the same host by setting the --concurrency=3 parameter. Is there any solution to this problem?
Thanks!
EDIT
The current scenario is like this. Every I try to execute the task print_something by saying tasks.print_something.delay() it only prints queue C.
#celery.task()
def print_something():
print "C"
I need to have the workers print a variable letter based on what value I pass to them while starting them.
#celery.task()
def print_something():
print "<Variable Value Per Worker Here>"
Hope this helps someone.
Multiple problems needed solving for this problem.
The first step involved adding support in celery for the custom parameter. If this is not done, celery will complain that it doesn't understand the parameter.
Since I am running celery with Flask, I initialize celery like so.
def configure_celery():
app.config.update(
CELERY_BROKER_URL='amqp://:#localhost:5672',
RESULT_BACKEND='db+mysql://root:#localhost:3306/<database_name>'
)
celery = Celery(app.import_name, backend=app.config['RESULT_BACKEND'],
broker=app.config['CELERY_BROKER_URL'])
celery.conf.update(app.config)
TaskBase = celery.Task
class ContextTask(TaskBase):
abstract = True
def __call__(self, *args, **kwargs):
with app.app_context():
return TaskBase.__call__(self, *args, **kwargs)
celery.Task = ContextTask
return celery
I call this function to initialize celery and store it in a variable called celery.
celery = configure_celery()
To add the custom parameter you need to do the following.
def add_hadoop_queue_argument_to_worker(parser):
parser.add_argument(
'--hadoop-queue', help='Hadoop queue to be used by the worker'
)
The celery used below is the one we obtained from above steps.
celery.user_options['worker'].add(add_hadoop_queue_argument_to_worker)
The next step would be to make this argument accessible in the worker. To do that follow these steps.
class HadoopCustomWorkerStep(bootsteps.StartStopStep):
def __init__(self, worker, **kwargs):
worker.app.hadoop_queue = kwargs['hadoop_queue']
Inform celery to use this class for creating the workers.
celery.steps['worker'].add(HadoopCustomWorkerStep)
The tasks should now be able to access the variables.
#app.task(bind=True)
def print_hadoop_queue_from_config(self):
print self.app.hadoop_queue
Verify it by running the worker on the command-line.
celery -A app.celery worker --concurrency=1 --hadoop-queue=A -n aworker#%h
celery -A app.celery worker --concurrency=1 --hadoop-queue=B -n bworker#%h
celery -A app.celery worker --concurrency=1 --hadoop-queue=C -n cworker#%h
celery -A app.celery worker --concurrency=1 --hadoop-queue=default -n defaultworker#%h
What I usually do is, after starting the workers (the tasks are not executed) in another script (say manage.py) I add commands with parameters to start specific tasks or tasks with different arguments.
in manager.py:
from tasks import some_task
#click.command
def run_task(params):
some_task.apply_async(params)
And this will start the tasks as needed.
Related
I use docker to run celery + redis + flask, and I want to know how many tasks is waiting to be executed by celery. I tried to find the information in redis with command
keys *
and I get result:
127.0.0.1:6379> keys *
1) "unacked_mutex"
2) "_kombu.binding.celeryev"
3) "unacked_index"
4) "_kombu.binding.celery.pidbox"
5) "_kombu.binding.celery"
6) "unacked"
Non of these items seem to contain the celery queue information. How can I read the celery queue size?
This is the celery code:
from celery import Celery
import time
app = Celery('tasks', broker='redis://redis:6379')
#app.task
def sleeptest():
time.sleep(100)
This is how I submit the celery job:
import tasks
import time
tasks.sleeptest.delay()
time.sleep(1)
tasks.sleeptest.delay()
time.sleep(1)
tasks.sleeptest.delay()
time.sleep(1)
tasks.sleeptest.delay()
time.sleep(1)
tasks.sleeptest.delay()
time.sleep(1)
When I post 100 tasks, the celery queue occurs. But when I post only 5 tasks, the celery queue donnot show, even though I set concurrency to 1 and 4 tasks is actually awaiting.
The information is buried deep in the Celery documentation - monitoring redis queues. If you see nothing, that means that the tasks are either finished, running or already reserved (look for details about the worker_prefetch_multiplier).
in my Django project with celery, I have celery task function that needs to be received all incoming tasks but starts it step by step like Singleton.
I can do this like:
#shared_task(bind=True)
def make_some_task(self, event_id):
lock_name = os.path.join(settings.BASE_DIR, 'create_lock')
is_exists = os.path.exists(lock_name)
while is_exists:
time.sleep(10)
with open('create_lock', 'w') as file:
file.write('locked')
..... do some staff.....
os.remove(lock_name)
but I think this is not the correct way to use this inside Celery, I think must be the better way to implement this
So this is what i want to do. I have a scheduled task that runs every X minutes. In the task I create a group of tasks that i want them to run parallel to each other. After they all finish i want to log if the group has finished successfully or not. This is my code:
#shared_task(base=HandlersImplTenantTask, acks_late=True)
def my_scheduled_task():
try:
needed_ids = MyModel.objects.filter(some_field=False)\
.filter(some_other_field=True)\
.values_list("id", flat=True) \
.order_by("id")
if needed_ids:
tasks = [my_single_task.s(needed_id=id) for id in needed_ids]
job = group(tasks)
result = job.apply_async()
returned_values = result.get()
if result.ready():
if result.successful():
logger.info("SUCCESSFULLY FINISHED ALL THE SUBTASKS")
else:
returned_values = result.get()
logger.info("UNSUCCESSFULLY FINISHED ALL THE SUBTASKS WITH THE RESULTS %s" % returned_values)
else:
logger.info("no needed ids found")
except:
logger.exception("got an unexpected exception while running task")
This is my_single_task code:
#shared_task(base=HandlersImplTenantTask)
def my_single_task(needed_id):
logger.info("starting task for tenant: [%s]. got id [%s]", connection.tenant, needed_id)
return
This is how i run my celery:
manage.py celery worker -c 2 --broker=[my rabbitmq brocker url]
when i get to the line result.get() it hangs. i see a single log entry of the single tasks with the first id but i don't see the others. when i kill my celery process and restart it - it reruns the scheduled task and i see the second log entry with the second id (from the first time the task ran). any ideas on how to fix this?
EDIT - so to try and overcome this - I created a different queue called 'new_queue'. I started a different celery worker to listen to the new queue. I want to make the other worker take the tasks and work on them. I think this could solve the problem of the deadlock.
I have changed my code to look like this:
job = group(tasks)
job_result = job.apply_async(queue='new_queue')
results = job_result.get()
but I still get a deadlock and if i remove the results = job_result.get() line, i can see that the tasks are worked on by the main worker and nothing was published to the new_queue queue. Any thoughts?
This is my celery configurations:
tenant_celery_app.conf.update(CELERY_RESULT_BACKEND='djcelery.backends.database.DatabaseBackend'
CELERY_RESULT_DB_TABLENAMES = {
'task': 'tenantapp_taskmeta',
'group': 'tenantapp_groupmeta',
}
This is how i run the workers:
celery worker -c 1 -Q new_queue --broker=[amqp_brocker_url]/[vhost]
celery worker -c 1 --broker=[amqp_brocker_url]/[vhost]
So the solution i was looking for was indeed of the sort of creating a new queue and starting a new worker that processes the new queue. The only issue that i had was to send the group tasks to the new queue. This is the code that worked for me.
tasks = [my_single_task.s(needed_id=id).set(queue='new_queue') for id in needed_ids]
job = group(tasks)
job_result = job.apply_async()
results = job_result.get() # this will block until the tasks finish but it wont deadlock
And these are my celery workers
celery worker -c 1 -Q new_queue --broker=[amqp_brocker_url]/[vhost]
celery worker -c 1 --broker=[amqp_brocker_url]/[vhost]
you seem to be deadlocking your queue. Think about it. If you have a task that waits on other tasks, and the queue fills up then the first task will hang forever.
You need to refactor your code to avoid calling result.get() inside a task (you probably already have warnings in your logs about this)
I would recommend this:
#shared_task(base=HandlersImplTenantTask, acks_late=True)
def my_scheduled_task():
needed_ids = MyModel.objects.filter(some_field=False)\
.filter(some_other_field=True)\
.values_list("id", flat=True) \
.order_by("id")
if needed_ids:
tasks = [my_single_task.s(needed_id=id) for id in needed_ids]
job = group(tasks)
result = job.apply_async()
That's all you need.
Use logging to track if tasks fail.
If code elsewhere in your application needs to track whether the jobs fail or not then you can use celery's inspect api.
How do I schedule a task to run when I start celery beat then again in 1 hours and so.
Currently I have schedule in settings.py:
CELERYBEAT_SCHEDULE = {
'update_database': {
'task': 'myapp.tasks.update_database',
'schedule': timedelta(seconds=60),
},
}
I saw a post from 1 year here on stackoverflow asking the same question:
How to run celery schedule instantly?
However this does not work for me, because my celery worker get 3-4 requests for the same task, when I run django server
I'm starting my worker and beat like this:
celery -A dashboard_web worker -B --loglevel=INFO --concurrency=10
Crontab schedule
You could try to use a crontab schedule instead which will run every hour and start 1 min after initialization of the scheduler. Warning: you might want to do it a couple of minutes later in case it takes longer to start, otherwise you might need to wait the full hour.
from celery.schedules import crontab
from datetime import datetime
CELERYBEAT_SCHEDULE = {
'update_database': {
'task': 'myapp.tasks.update_database',
'schedule': crontab(minute=(datetime.now().minute + 1) % 60),
},
}
Reference: http://docs.celeryproject.org/en/latest/userguide/periodic-tasks.html#crontab-schedules
Ready method of MyAppConfig
In order to ensure that your task is run right away, you could use the same method as before to create the periodic task without adding 1 to the minute. Then, you call your task in the ready method of MyAppConfig which is called whenever your app is ready.
#myapp/apps.py
class MyAppConfig(AppConfig):
name = "myapp"
def ready(self):
from .tasks import update_database
update_database.delay()
Please note that you could also create the periodic task directly in the ready method if you were to use django_celery_beat.
Edit: Didn't see that the second method was already covered in the link you mentioned. I'll leave it here in case it is useful for someone else arriving here.
Try setting the configuration parameter CELERY_ALWAYS_EAGER = True
Something like this
app.conf.CELERY_ALWAYS_EAGER = True
In my website users can UPDATE they profile (manual) every time he want, or automatic once a day.
This task is being distributed with celery now.
But i have a "problem" :
Every day, in automatic update, a job put ALL users (+-6k users) on queue:
from celery import group
from tasks import *
import datetime
from lastActivityDate.models import UserActivity
today = datetime.datetime.today()
one_day = datetime.timedelta(days=5)
today -= one_day
print datetime.datetime.today()
user_list = UserActivity.objects.filter(last_activity_date__gte=today)
g = group(update_user_profile.s(i.user.auth.username) for i in user_list)
print datetime.datetime.today()
print g(user_list.count()).get()
If someone try to do the manual update, they will enter on te queue and last forever to be executed.
Is there a way to set this manual task to run in a piority way?
Or make a dedicated for each separated queue: manual and automatic?
Celery does not support task priority. (v3.0)
http://docs.celeryproject.org/en/master/faq.html#does-celery-support-task-priorities
You may solve this problem by routing tasks.
http://docs.celeryproject.org/en/latest/userguide/routing.html
Prepare default and priority_high Queue.
from kombu import Queue
CELERY_DEFAULT_QUEUE = 'default'
CELERY_QUEUES = (
Queue('default'),
Queue('priority_high'),
)
Run two daemon.
user#x:/$ celery worker -Q priority_high
user#y:/$ celery worker -Q default,priority_high
And route task.
your_task.apply_async(args=['...'], queue='priority_high')
If you use RabbitMQ transport then configure your queues the following way:
settings.py
from kombu import Queue
...
CELERY_TASK_QUEUES = (
Queue('default', routing_key='task_default.#', max_priority=10),
...)
Then run your tasks:
my_low_prio_task.apply_async(args=(...), priority=1)
my_high_prio_task.apply_async(args=(...), priority=10)
Presently this code works for kombu==4.6.11, celery==4.4.6.