Stopping a function in Python using a timeout - django

I'm writing a healthcheck endpoint for my web service.
The end point calls a series of functions which return True if the component is working correctly:
The system is considered to be working if all the components are working:
def is_health():
healthy = all(r for r in (database(), cache(), worker(), storage()))
return healthy
When things aren't working, the functions may take a long time to return. For example if the database is bogged down with slow queries, database() could take more than 30 seconds to return.
The healthcheck endpoint runs in the context of a Django view, running inside a uWSGI container. If the request / response cycle takes longer than 30 seconds, the request is harakiri-ed!
This is a huge bummer, because I lose all contextual information that I could have logged about which component took a long time.
What I'd really like, is for the component functions to run within a timeout or a deadline:
with timeout(seconds=30):
database_result = database()
cache_result = cache()
worker_result = worker()
storage_result = storage()
In my imagination, as the deadline / harakiri timeout approaches, I can abort the remaining health checks and just report the work I've completely.
What's the right way to handle this sort of thing?
I've looked at threading.Thread and Queue.Queue - the idea being that I create a work and result queue, and then use a thread to consume the work queue while placing the results in result queue. Then I could use the thread's Thread.join function to stop processing the rest of the components.
The one challenge there is that I'm not sure how to hard exit the thread - I wouldn't want it hanging around forever if it didn't complete it's run.
Here is the code I've got so far. Am I on the right track?
import Queue
import threading
import time
class WorkThread(threading.Thread):
def __init__(self, work_queue, result_queue):
super(WorkThread, self).__init__()
self.work_queue = work_queue
self.result_queue = result_queue
self._timeout = threading.Event()
def timeout(self):
self._timeout.set()
def timed_out(self):
return self._timeout.is_set()
def run(self):
while not self.timed_out():
try:
work_fn, work_arg = self.work_queue.get()
retval = work_fn(work_arg)
self.result_queue.put(retval)
except (Queue.Empty):
break
def work(retval, timeout=1):
time.sleep(timeout)
return retval
def main():
# Two work items that will take at least two seconds to complete.
work_queue = Queue.Queue()
work_queue.put_nowait([work, 1])
work_queue.put_nowait([work, 2])
result_queue = Queue.Queue()
# Run the `WorkThread`. It should complete one item from the work queue
# before it times out.
t = WorkThread(work_queue=work_queue, result_queue=result_queue)
t.start()
t.join(timeout=1.1)
t.timeout()
results = []
while True:
try:
result = result_queue.get_nowait()
results.append(result)
except (Queue.Empty):
break
print results
if __name__ == "__main__":
main()
Update
It seems like in Python you've got a few options for timeouts of this nature:
Use SIGALARMS which work great if you have full control of the signals used by the process but probably are a mistake when you're running in a container like uWSGI.
Threads, which give you limited timeout control. Depending on your container environment (like uWSGI) you might need to set options to enable them.
Subprocesses, which give you full timeout control, but you need to be conscious of how they might change how your service consumes resources.
Use existing network timeouts. For example, if part of your healthcheck is to use Celery workers, you could rely on AsyncResult's timeout parameter to bound execution.
Do nothing! Log at regular intervals. Analyze later.
I'm exploring the benefits of these different options more.
Update #2
I put together a GitHub repo with quite a bit more information on the topic:
https://github.com/johnboxall/pytimeout
I'll type it up into a answer one day but the TLDR is here:
https://github.com/johnboxall/pytimeout#recommendations

Related

Django Celery shared_task singleton pattern

I have a Django based site that has several background processes that are executed in Celery workers. I have one particular task that can run for a few seconds with several read/writes to the database that are subject to a race condition if a second task tries to access the same rows.
I'm trying to prevent this by ensuring the task is only ever running on a single worker at a time but I'm running into issues getting this to work correctly. I've used this Celery Task Cookbook Recipe as inspiration, trying to make my own version that works for my scenario of ensuring that this specific task is only running on one worker at a time, but it still seems to be possible to encounter situations where it's executed across more than one worker.
So far, in tasks.py I have:
class LockedTaskInProgress(Exception):
"""The locked task is already in progress"""
silent_variable_failure = True
#shared_task(autoretry_for=[LockedTaskInProgress], default_retry_delay=30)
def create_or_update_payments(things=None):
"""
This takes a list of `things` that we want to process payments on. It will
read the thing's status, then apply calculations to make one or more payments
for various users that are owed money for the thing.
`things` - The list of things we need to process payments on.
"""
lock = cache.get('create_or_update_payments') # Using Redis as our cache backend
if not lock:
logger.debug('Starting create/update payments processing. Locking task.')
cache.set('create_or_update_payments', 'LOCKED')
real_create_or_update_payments(things) # Long running function w/ lots of DB read/writes
cache.delete('create_or_update_payments')
logger.debug('Completed create/update payments processing. Lock removed.')
else:
logger.debug('Unable to process create/update payments at this time. Lock detected.')
raise LockedTaskInProgress
The above seems to almost work but there still looks to be a possible race condition between the cache.get and cache.set that has shown up in my testing.
I'd love to get suggestions on how to improve this to make it more robust.
Think I've found a way of doing this, inspired by an older version of the Celery Task Cookbook Recipe I was using earlier.
Here's my implementation:
class LockedTaskInProgress(Exception):
"""The locked task is already in progress"""
silent_variable_failure = True
#shared_task(autoretry_for=[LockedTaskInProgress], default_retry_delay=30)
def create_or_update_payments(things=None):
"""
This takes a list of `things` that we want to process payments on. It will
read the thing's status, then apply calculations to make one or more payments
for various users that are owed money for the thing.
`things` - The list of things we need to process payments on.
"""
LOCK_EXPIRE = 60 * 5 # 5 Mins
lock_id = 'create_or_update_payments'
acquire_lock = lambda: cache.add(lock_id, 'LOCKED', LOCK_EXPIRE)
release_lock = lambda: cache.delete(lock_id)
if acquire_lock():
try:
logger.debug('Starting create/update payments processing. Locking task.')
real_create_or_update_payments(things) # Long running function w/ lots of DB read/writes
finally:
release_lock()
logger.debug('Completed create/update payments processing. Lock removed.')
else:
logger.debug('Unable to process create/update payments at this time. Lock detected.')
raise LockedTaskInProgress
It's very possible that there's a better way of doing this but this seems to work in my tests.

Is there any way to define task quota in celery?

I have requirements:
I have few heavy-resource-consume task - exporting different reports that require big complex queries, sub queries
There are lot users.
I have built project in django, and queue task using celery
I want to restrict user so that they can request 10 report per minute. The idea is they can put hundreds of request 10 minute, but I want celery to execute 10 task for a user. So that every user gets their turn.
Is there any way so that celery can do this?
Thanks
Celery has a setting to control the RATE_LIMIT (http://celery.readthedocs.org/en/latest/userguide/tasks.html#Task.rate_limit), it means, the number of task that could be running in a time frame.
You could set this to '100/m' (hundred per second) maning your system allows 100 tasks per seconds, its important to notice, that setting is not per user neither task, its per time frame.
Have you thought about this approach instead of limiting per user?
In order to have a 'rate_limit' per task and user pair you will have to do it. I think (not sure) you could use a TaskRouter or a signal based on your needs.
TaskRouters (http://celery.readthedocs.org/en/latest/userguide/routing.html#routers) allow to route tasks to a specify queue aplying some logic.
Signals (http://celery.readthedocs.org/en/latest/userguide/signals.html) allow to execute code in few well-defined points of the task's scheduling cycle.
An example of Router's logic could be:
if task == 'A':
user_id = args[0] # in this task the user_id is the first arg
qty = get_task_qty('A', user_id)
if qty > LIMIT_FOR_A:
return
elif task == 'B':
user_id = args[2] # in this task the user_id is the seconds arg
qty = get_task_qty('B', user_id)
if qty > LIMIT_FOR_B:
return
return {'queue': 'default'}
With the approach above, every time a task starts you should increment by one in some place (for example Redis) the pair user_id/task_type and
every time a task finishes you should decrement that value in the same place.
Its seems kind of complex, hard to maintain and with few failure points for me.
Other approach, which i think could fit, is to implement some kind of 'Distributed Semaphore' (similar to distributed lock) per user and task, so in each task which needs to limit the number of task running you could use it.
The idea is, every time a task which should have 'concurrency control' starts it have to check if there is some resource available if not just return.
You could imagine this idea as below:
#shared_task
def my_task_A(user_id, arg1, arg2):
resource_key = 'my_task_A_{}'.format(user_id)
available = SemaphoreManager.is_available_resource(resource_key)
if not available:
# no resources then abort
return
try:
# the resourse could be acquired just before us for other
if SemaphoreManager.acquire(resource_key):
#execute your code
finally:
SemaphoreManager.release(resource_key)
Its hard to say which approach you SHOULD take because that depends on your application.
Hope it helps you!
Good luck!

Prioritize some workflow executions over others

I've been using the flow framework for amazon swf and I want to be able to run priority workflow executions and normal workflow executions. If there are priority tasks, then activities should pick up the priority tasks ahead of normal priority tasks. What is the best way to accomplish this?
I'm thinking that the following might work but I wonder if there's a better/recommended approach.
I'll define two Activity Workers and two activity lists for the activity. One priority list and one normal list. Each worker will be using the same activity class.
Both workers will be run on the same host (ec2 instance).
On the workflow, I'll define two methods: startNormalWorkflow and startHighWorkflow. In the startHighWorkflow method, I can use ActivitySchedulingOptions to put the task on the high priority list.
Problem with this approach is that there is no guarantee that the high priority task is scheduled before normal tasks.
It's a good question, it had me scratching my head for a while.
Of course, there is more than one way to skin this cat and there exists a number of valid solutions. I focused here on the simplest possible that I could conceive of, namely, execution of tasks in order of priority within a single workflow.
The scenario goes as follows: I define one activity worker serving two task lists, default_tasks and urgent_tasks, with a trivial logic:
If there are pending tasks on the urgent_tasks list, then pick one from there,
Otherwise, pick a task from default_tasks
Execute any task selected.
The question is how to check if any high priority tasks are pending? CountPendingActivityTasks API comes to the rescue!
I know you use Flow for development. My example is written using boto.swf.layer2 as Python is so much easier for prototyping - but the idea remains the same and can be extended to a more complex scenario with high and low priority workflow executions.
So, to accomplish the above using boto.swf follow these steps:
Export credentials to the environment
$ export AWS_ACCESS_KEY_ID=your access key
$ export AWS_SECRET_ACCESS_KEY= your secret key
Get the code snippets
For convenience, you can fork it from github:
$ git clone git#github.com:oozie/stackoverflow.git
$ cd stackoverflow/amazon-swf/priority_tasks/
To bootstrap the domain and the workflow:
# domain_setup.py
import boto.swf.layer2 as swf
DOMAIN = 'stackoverflow'
VERSION = '1.0'
swf.Domain(name=DOMAIN).register()
swf.ActivityType(domain=DOMAIN, name='SomeActivity', version=VERSION, task_list='default_tasks').register()
swf.WorkflowType(domain=DOMAIN, name='MyWorkflow', version=VERSION, task_list='default_tasks').register()
Decider implementation:
# decider.py
import boto.swf.layer2 as swf
DOMAIN = 'stackoverflow'
ACTIVITY = 'SomeActivity'
VERSION = '1.0'
class MyWorkflowDecider(swf.Decider):
domain = DOMAIN
task_list = 'default_tasks'
version = VERSION
def run(self):
history = self.poll()
print history
if 'events' in history:
# Get a list of non-decision events to see what event came in last.
workflow_events = [e for e in history['events']
if not e['eventType'].startswith('Decision')]
decisions = swf.Layer1Decisions()
last_event = workflow_events[-1]
last_event_type = last_event['eventType']
if last_event_type == 'WorkflowExecutionStarted':
# At the start, get the worker to fetch the first assignment.
decisions.schedule_activity_task(ACTIVITY+'1', ACTIVITY, VERSION, task_list='default_tasks')
decisions.schedule_activity_task(ACTIVITY+'2', ACTIVITY, VERSION, task_list='urgent_tasks')
decisions.schedule_activity_task(ACTIVITY+'3', ACTIVITY, VERSION, task_list='default_tasks')
decisions.schedule_activity_task(ACTIVITY+'4', ACTIVITY, VERSION, task_list='urgent_tasks')
decisions.schedule_activity_task(ACTIVITY+'5', ACTIVITY, VERSION, task_list='default_tasks')
elif last_event_type == 'ActivityTaskCompleted':
# Complete workflow execution after 5 completed activities.
closed_activity_count = sum(1 for wf_event in workflow_events if wf_event.get('eventType') == 'ActivityTaskCompleted')
if closed_activity_count == 5:
decisions.complete_workflow_execution()
self.complete(decisions=decisions)
return True
Prioritizing worker implementation:
# worker.py
import boto.swf.layer2 as swf
DOMAIN = 'stackoverflow'
VERSION = '1.0'
class PrioritizingWorker(swf.ActivityWorker):
domain = DOMAIN
version = VERSION
def run(self):
urgent_task_count = swf.Domain(name=DOMAIN).count_pending_activity_tasks('urgent_tasks').get('count', 0)
if urgent_task_count > 0:
self.task_list = 'urgent_tasks'
else:
self.task_list = 'default_tasks'
activity_task = self.poll()
if 'activityId' in activity_task:
print urgent_task_count, 'urgent tasks in the queue. Executing ' + activity_task.get('activityId')
self.complete()
return True
Run the workflow from three instances of an interactive Python shell
Run the decider:
$ python -i decider.py
>>> while MyWorkflowDecider().run(): pass
...
Start an execution:
$ python -i decider.py
>>> swf.WorkflowType(domain='stackoverflow', name='MyWorkflow', version='1.0', task_list='default_tasks').start()
Finally, kick off the worker and watch the tasks as they're getting executed:
$ python -i worker.py
>>> while PrioritizingWorker().run(): pass
...
2 urgent tasks in the queue. Executing SomeActivity2
1 urgent tasks in the queue. Executing SomeActivity4
0 urgent tasks in the queue. Executing SomeActivity5
0 urgent tasks in the queue. Executing SomeActivity1
0 urgent tasks in the queue. Executing SomeActivity3
It turns out that using a separate task list that you have to check first doesn't work well.
There's a couple of problems.
First, the count API doesn't update reliably. So you may get 0 tasks even when there are urgent tasks in the queue.
Second, the call that polls for tasks hangs if there are no tasks available. So when you poll for the non-urgent tasks, that will "stick" for either 2 minutes, or until you have a non-urgent task to do.
So this can cause all kinds of problems in your workflow.
For this to work, SWF would have to implement a polling API that could return the first task from a list of task lists. Then it would be much easier.

How should I implement callback for taskset in celery

Question
I use celery to launch task sets that look like this:
I perform a batch of tasks that can be run in parallel, number of tasks in this batch varies from tens to couple thousands.
I aggregate results of these tasks into single answer, then do something with this answer --- like store to the database, save to special result file and so on. Basically after tasks done executing I have to call function that has following signature:
def callback(result_file_name, task_result_list):
#store in file
def callback(entity_key, task_result_list):
#store in db
For now step 1. is done in Celery queue and step 2 is done outside celery:
tasks = []
# add taksks to tasks list
task_group = group()
task_group.tasks = tasks
result = task_group.apply_async()
res = result.join()
# Aggregate results
# Save results to file, database whatever
This approach is cumbersome since I have to stop a single thread until all tasks are performed (which can take couple of hours).
I would like to somehow move step 2 to celery also --- esentially I would need to add a callback to entire taskset (as far as I know it is unsupported in Celery) or submit a task that is executed after all these subtasks.
Does anyone have idea how to do it? I use it in the django enviorment so I can store some state in the database.
To sum up my recent findings
Chords won't do
I'cant use chords straight forwardly because chords enable me to create callbacks that look this way:
def callback(task_result_list):
#store in file
there is no obvious way to pass additional parameters to callback (especially because these callbacks can't be local functions).
Using the database either
I can store results using TaskSetMeta but this entity has no status field --- so even if I would add a signal to TaskSetMeta i'd have to pool task results which could have siginificant overhead.
Well answer was really straightforward, and I can indeed use chords --- and additional parameters (like report file name and so on) must be passed as kwargs.
Here is chord task:
#task
def print_and_sum(to_sum, file_name):
print file_name
print sum(to_sum)
return file_name, sum(to_sum)
Here is how to instantiate it:
subtasks = [...]
result = chord(subtasks)(print_and_sum.subtask(kwargs={'file_name' : 'report_file.csv'}))

Django Celery: Execute only one instance of a long-running process

I have a long-running process that must run every five minutes, but more than one instance of the processes should never run at the same time. The process should not normally run past five min, but I want to be sure that a second instance does not start up if it runs over.
Per a previous recommendation, I'm using Django Celery to schedule this long-running task.
I don't think a periodic task will work, because if I have a five minute period, I don't want a second task to execute if another instance of the task is running already.
My current experiment is as follows: at 8:55, an instance of the task starts to run. When the task is finishing up, it will trigger another instance of itself to run at the next five min mark. So if the first task finished at 8:57, the second task would run at 9:00. If the first task happens to run long and finish at 9:01, it would schedule the next instance to run at 9:05.
I've been struggling with a variety of cryptic errors when doing anything more than the simple example below and I haven't found any other examples of people scheduling tasks from a previous instance of itself. I'm wondering if there is maybe a better approach to doing what I am trying to do. I know there's a way to name one's tasks; perhaps there's a way to search for running or scheduled instances with the same name? Does anyone have any advice to offer regarding running a task every five min, but ensuring that only one task runs at a time?
Thank you,
Joe
In mymodule/tasks.py:
import datetime
from celery.decorators import task
#task
def test(run_periodically, frequency):
run_long_process()
now = datetime.datetime.now()
# Run this task every x minutes, where x is an integer specified by frequency
eta = (
now - datetime.timedelta(
minutes = now.minute % frequency , seconds = now.second,
microseconds = now.microsecond ) ) + datetime.timedelta(minutes=frequency)
task = test.apply_async(args=[run_periodically, frequency,], eta=eta)
From a ./manage.py shell:
from mymodule import tasks
result = tasks.test.apply_async(args=[True, 5])
You can use periodic tasks paired with a special lock which ensures the tasks are executed one at a time. Here is a sample implementation from Celery documentation:
http://ask.github.com/celery/cookbook/tasks.html#ensuring-a-task-is-only-executed-one-at-a-time
Your described method with scheduling task from the previous execution can stop the execution of tasks if there will be failure in one of them.
I personally solve this issue by caching a flag by a key like task.name + args
def perevent_run_duplicate(func):
"""
this decorator set a flag to cache for a task with specifig args
and wait to completion, if during this task received another call
with same cache key will ignore to avoid of conflicts.
and then after task finished will delete the key from cache
- cache keys with a day of timeout
"""
#wraps(func)
def outer(self, *args, **kwargs):
if cache.get(f"running_task_{self.name}_{args}", False):
return
else:
cache.set(f"running_task_{self.name}_{args}", True, 24 * 60 * 60)
try:
func(self, *args, **kwargs)
finally:
cache.delete(f"running_task_{self.name}_{args}")
return outer
this decorator will manage task calls to prevent duplicate calls for a task by same args.
We use Celery Once and it has resolved similar issues for us. Github link - https://github.com/cameronmaske/celery-once
It has very intuitive interface and easier to incorporate that the one recommended in celery documentation.