Celery: clean way of revoking the entire chain from within a task - django

My question is probably pretty basic but still I can't get a solution in the official doc. I have defined a Celery chain inside my Django application, performing a set of tasks dependent from eanch other:
chain( tasks.apply_fetching_decision.s(x, y),
tasks.retrieve_public_info.s(z, x, y),
tasks.public_adapter.s())()
Obviously the second and the third tasks need the output of the parent, that's why I used a chain.
Now the question: I need to programmatically revoke the 2nd and the 3rd tasks if a test condition in the 1st task fails. How to do it in a clean way? I know I can revoke the tasks of a chain from within the method where I have defined the chain (see thisquestion and this doc) but inside the first task I have no visibility of subsequent tasks nor of the chain itself.
Temporary solution
My current solution is to skip the computation inside the subsequent tasks based on result of the previous task:
#shared_task
def retrieve_public_info(result, x, y):
if not result:
return []
...
#shared_task
def public_adapter(result, z, x, y):
for r in result:
...
But this "workaround" has some flaw:
Adds unnecessary logic to each task (based on predecessor's result), compromising reuse
Still executes the subsequent tasks, with all the resulting overhead
I haven't played too much with passing references of the chain to tasks for fear of messing up things. I admit also I haven't tried Exception-throwing approach, because I think that the choice of not proceeding through the chain can be a functional (thus non exceptional) scenario...
Thanks for helping!

I think I found the answer to this issue: this seems the right way to proceed, indeed. I wonder why such common scenario is not documented anywhere, though.
For completeness I post the basic code snapshot:
#app.task(bind=True) # Note that we need bind=True for self to work
def task1(self, other_args):
#do_stuff
if end_chain:
self.request.callbacks[:] = []
....
Update
I implemented a more elegant way to cope with the issue and I want to share it with you. I am using a decorator called revoke_chain_authority, so that it can revoke automatically the chain without rewriting the code I previously described.
from functools import wraps
class RevokeChainRequested(Exception):
def __init__(self, return_value):
Exception.__init__(self, "")
# Now for your custom code...
self.return_value = return_value
def revoke_chain_authority(a_shared_task):
"""
#see: https://gist.github.com/bloudermilk/2173940
#param a_shared_task: a #shared_task(bind=True) celery function.
#return:
"""
#wraps(a_shared_task)
def inner(self, *args, **kwargs):
try:
return a_shared_task(self, *args, **kwargs)
except RevokeChainRequested, e:
# Drop subsequent tasks in chain (if not EAGER mode)
if self.request.callbacks:
self.request.callbacks[:] = []
return e.return_value
return inner
This decorator can be used on a shared task as follows:
#shared_task(bind=True)
#revoke_chain_authority
def apply_fetching_decision(self, latitude, longitude):
#...
if condition:
raise RevokeChainRequested(False)
Please note the use of #wraps. It is necessary to preserve the signature of the original function, otherwise this latter will be lost and celery will make a mess at calling the right wrapped task (e.g. it will call always the first registered function instead of the right one)

As of Celery 4.0, what I found to be working is to remove the remaining tasks from the current task instance's request using the statement:
self.request.chain = None
Let's say you have a chain of tasks a.s() | b.s() | c.s(). You can only access the self variable inside a task if you bind the task by passing bind=True as argument to the tasks' decorator.
#app.task(name='main.a', bind=True):
def a(self):
if something_happened:
self.request.chain = None
If something_happened is truthy, b and c wouldn't be executed.

Related

Is #transaction.atomic cheap?

This is mostly curiosity, but is the DB penalty for wrapping an entire view with #transaction.atomic a negligible one?
I'm thinking of views where the GET of a form or its re-display after a validation fail involves processing querysets. (ModelChoiceFields, for example, or fetching an object that the template displays.)
It seems to me to be far more natural to use with transaction.atomic() around the block of code which actually alters a bunch of related DB objects only after the user's inputs have validated.
Am I missing something?
From the source code:
def atomic(using=None, savepoint=True, durable=False):
# Bare decorator: #atomic -- although the first argument is called
# `using`, it's actually the function being decorated.
if callable(using):
return Atomic(DEFAULT_DB_ALIAS, savepoint, durable)(using)
# Decorator: #atomic(...) or context manager: with atomic(...): ...
else:
return Atomic(using, savepoint, durable)
It's the same. In both cases the function is returning an Atomic object which handles whether the transaction should commit or not.

How to call a function after finishing recursive asynchronous jobs in Python?

I use scrapy for scraping this site.
I want to save all the sub-categories in an array, then get the corresponding pages (pagination)
first step i have
def start_requests(self):
yield Request(start_urls[i], callback=self.get_sous_cat)
get_sous_cat is a function which gets all the sub-categories of a site, then starts asynchronously jobs to explore the sub-sub-categories recursively.
def get_sous_cat(self,response):
#Put all the categgories in a array
catList = response.css('div.categoryRefinementsSection')
if (catList):
for category in catList.css('a::attr(href)').extract():
category = 'https://www.amazon.fr' + category
print category
self.arrayCategories.append(category)
yield Request(category, callback=self.get_sous_cat)
When all the respective request have been sent, I need to call this termination function :
def pagination(self,response):
for i in range(0, len(self.arrayCategories[i])):
#DO something with each sub-category
I tried this
def start_requests(self):
yield Request(start_urls[i], callback=self.get_sous_cat)
for subCat in range(0,len(self.arrayCategories)):
yield Request(self.arrayCategories[subCat], callback=self.pagination)
Well done, this is a good question! Two small things:
a) use a set instead of an array. This way you won't have duplicates
b) site structure will change once a month/year. You will likely crawl more frequently. Break the spider into two; 1. The one that creates the list of category urls and runs monthly and 2. The one that gets as start_urls the file generated by the first
Now, if you really want to do it the way you do it now, hook the spider_idle signal (see here: Scrapy: How to manually insert a request from a spider_idle event callback? ). This gets called when there are no further urls to do and allows you to inject more. Set a flag or reset your list at that point so that the second time the spider is idle (after it crawled everything), it doesn't re-inject the same category urls for ever.
If, as it seems in your case, you don't want to do some fancy processing on the urls but just crawl categories before other URLs, this is what Request priority property is for (http://doc.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-request-subclasses). Just set it to e.g. 1 for your category URLs and then it will follow those links before it processes any non-category links. This is more efficient since it won't load those category pages twice as your current implementation would do.
This is not "recursion", it's asynchronous jobs. What you need is a global counter (protected by a Lock) and if 0, do your completion :
from threading import Lock
class JobCounter(object):
def __init__(self, completion_callback, *args, **kwargs):
self.c = 0
self.l = Lock()
self.completion = (completion_callback, args, kwargs)
def __iadd__(self, n):
b = false
with self.l:
self.c += n
if self.c <= 0:
b = true
if b:
f, args, kwargs = self.completion
f(*args, **kwargs)
def __isub__(self, n):
self.__iadd__(-n)
each time you launch a job, do counter += 1
each time a job finishes, do counter -= 1
NOTE : this does the completion in the thread of the last calling job. If you want to do it in a particular thread, use a Condition instead of a Lock, and do notify() instead of the call.

Send a success signal when the group of tasks in celery is finished

So I have a basic configuration django 1.6 + celery 3.1. Say I have an example task:
#app.task
def add(x, y):
time.sleep(6)
return {'result':x + y}
And a function that groups and returns job id
def nested_add(x,y):
grouped_task = group(add.s(x,y) for i in range(0,2))
job = result_array.apply_async()
job.save()
return job.id
Now I want to perform some action when that group of tasks is finished but if I put the the app.task decorator to nested_add and try to catch the task_success then it wouldn't work properly. Any tips of what I should use?
There are actually several options. The most simplest is to use chord. Chord will wail until all sub-tasks are finished with some result and then return the overall result back. More could be found http://ask.github.io/celery/userguide/tasksets.html. Another simple approach is to leverage AsyncResult API collect() method. More could be found here: http://celery.readthedocs.org/en/latest/reference/celery.result.html.
Don't forget to configure your result backend. more could be found http://celery.readthedocs.org/en/latest/getting-started/first-steps-with-celery.html#keeping-results. If you are using RabbitMQ as a brocker then configure it as a result backend too.

"Lazy load" of data from a context processor

In each view of my application I need to have navigation menu prepared. So right now in every view I execute complicated query and store the menu in a dictionary which is passed to a template. In templates the variable in which I have the data is surrounded with "cache", so even though the queries are quite costly, it doesn't bother me.
But I don't want to repeat myself in every view. I guessed that the best place to prepare the menu is in my own context processor. And so I did write one, but I noticed that even when I don't use the data from the context processor, the queries used to prepare the menu are executed. Is there a way to "lazy load" such data from CP or do I have to use "low level" cache in CP? Or maybe there's a better solution to my problem?
Django has a SimpleLazyObject. In Django 1.3, this is used by the auth context processor (source code). This makes user available in the template context for every query, but the user is only accessed when the template contains {{ user }}.
You should be able to do something similar in your context processor.
from django.utils.functional import SimpleLazyObject
def my_context_processor(request):
def complicated_query():
do_stuff()
return result
return {
'result': SimpleLazyObject(complicated_query)
If you pass a callable object into the template context, Django will evaluate it when it is used in the template. This provides one simple way to do laziness — just pass in callables:
def my_context_processor(request):
def complicated_query():
do_stuff()
return result
return {'my_info': complicated_query}
The problem with this is it does not memoize the call — if you use it multiple times in a template, complicated_query gets called multiple times.
The fix is to use something like SimpleLazyObject as in the other answer, or to use something like functools.lru_cache:
from functools import lru_cache:
def my_context_processor(request):
#lru_cache()
def complicated_query():
result = do_stuff()
return result
return {'my_info': complicated_query}
You can now use my_info in your template, and it will be evaluated lazily, just once.
Or, if the function already exists, you would do it like this:
from somewhere import complicated_query
def my_context_processor(request):
return {'my_info': lru_cache()(complicated_query)}
I would prefer this method over SimpleLazyObject because the latter can produce some strange bugs sometimes.
(I was the one who originally implemented LazyObject and SimpleLazyObject, and discovered for myself that there is curse on any code artefact labelled simple.)

Any thoughts on A/B testing in Django based project? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
We just now started doing the A/B testing for our Django based project. Can I get some information on best practices or useful insights about this A/B testing.
Ideally each new testing page will be differentiated with a single parameter(just like Gmail). mysite.com/?ui=2 should give a different page. So for every view I need to write a decorator to load different templates based on the 'ui' parameter value. And I dont want to hard code any template names in decorators. So how would urls.py url pattern will be?
It's useful to take a step back and abstract what A/B testing is trying to do before diving into the code. What exactly will we need to conduct a test?
A Goal that has a Condition
At least two distinct Paths to meet the Goal's Condition
A system for sending viewers down one of the Paths
A system for recording the Results of the test
With this in mind let's think about implementation.
The Goal
When we think about a Goal on the web usually we mean that a user reaches a certain page or that they complete a specific action, for example successfully registering as a user or getting to the checkout page.
In Django we could model that in a couple of ways - perhaps naively inside a view, calling a function whenever a Goal has been reached:
def checkout(request):
a_b_goal_complete(request)
...
But that doesn't help because we'll have to add that code everywhere we need it - plus if we're using any pluggable apps we'd prefer not to edit their code to add our A/B test.
How can we introduce A/B Goals without directly editing view code? What about a Middleware?
class ABMiddleware:
def process_request(self, request):
if a_b_goal_conditions_met(request):
a_b_goal_complete(request)
That would allow us to track A/B Goals anywhere on the site.
How do we know that a Goal's conditions has been met? For ease of implementation I'll suggest that we know a Goal has had it's conditions met when a user reaches a specific URL path. As a bonus we can measure this without getting our hands dirty inside a view. To go back to our example of registering a user we could say that this goal has been met when the user reaches the URL path:
/registration/complete
So we define a_b_goal_conditions_met:
a_b_goal_conditions_met(request):
return request.path == "/registration/complete":
Paths
When thinking about Paths in Django it's natural to jump to the idea of using different templates. Whether there is another way remains to be explored. In A/B testing you make small differences between two pages and measure the results. Therefore it should be a best practice to define a single base Path template from which all Paths to the Goal should extend.
How should render these templates? A decorator is probably a good start - it's a best practice in Django to include a parameter template_name to your views a decorator could alter this parameter at runtime.
#a_b
def registration(request, extra_context=None, template_name="reg/reg.html"):
...
You could see this decorator either introspecting the wrapped function and modifying the template_name argument or looking up the correct templates from somewhere (like a Model). If we didn't want to add the decorator to every function we could implement this as part of our ABMiddleware:
class ABMiddleware:
...
def process_view(self, request, view_func, view_args, view_kwargs):
if should_do_a_b_test(...) and "template_name" in view_kwargs:
# Modify the template name to one of our Path templates
view_kwargs["template_name"] = get_a_b_path_for_view(view_func)
response = view_func(view_args, view_kwargs)
return response
We'd need also need to add some way to keep track of which views have A/B tests running etc.
A system for sending viewers down a Path
In theory this is easy but there are lot of different implementations so it's not clear which one is best. We know a good system should divide users evenly down the path - Some hash method must be used - Maybe you could use the modulus of memcache counter divided by the number of Paths - maybe there is a better way.
A system for recording the Results of the Test
We need to record how many users went down what Path - we'll also need access to this information when the user reaches the goal (we need to be able to say what Path they came down to met the Condition of the Goal) - we'll use some kind of Model(s) to record the data and either Django Sessions or Cookies to persist the Path information until the user meets the Goal condition.
Closing Thoughts
I've given a lot of pseudo code for implementing A/B testing in Django - the above is by no means a complete solution but a good start towards creating a reusable framework for A/B testing in Django.
For reference you may want to look at Paul Mar's Seven Minute A/Bs on GitHub - it's the ROR version of the above!
http://github.com/paulmars/seven_minute_abs/tree/master
Update
On further reflection and investigation of Google Website Optimizer it's apparent that there are gaping holes in the above logic. By using different templates to represent Paths you break all caching on the view (or if the view is cached it will always serve the same path!). Instead, of using Paths, I would instead steal GWO terminology and use the idea of Combinations - that is one specific part of a template changing - for instance, changing the <h1> tag of a site.
The solution would involve template tags which would render down to JavaScript. When the page is loaded in the browser the JavaScript makes a request to your server which fetches one of the possible Combinations.
This way you can test multiple combinations per page while preserving caching!
Update
There still is room for template switching - say for example you introduce an entirely new homepage and want to test it's performance against the old homepage - you'd still want to use the template switching technique. The thing to keep in mind is your going to have to figure out some way to switch between X number of cached versions of the page. To do this you'd need to override the standard cached middleware to see if their is a A/B test running on the requested URL. Then it could choose the correct cached version to show!!!
Update
Using the ideas described above I've implemented a pluggable app for basic A/B testing Django. You can get it off Github:
http://github.com/johnboxall/django-ab/tree/master
If you use the GET parameters like you suggsted (?ui=2), then you shouldn't have to touch urls.py at all. Your decorator can inspect request.GET['ui'] and find what it needs.
To avoid hardcoding template names, maybe you could wrap the return value from the view function? Instead of returning the output of render_to_response, you could return a tuple of (template_name, context) and let the decorator mangle the template name. How about something like this? WARNING: I haven't tested this code
def ab_test(view):
def wrapped_view(request, *args, **kwargs):
template_name, context = view(request, *args, **kwargs)
if 'ui' in request.GET:
template_name = '%s_%s' % (template_name, request.GET['ui'])
# ie, 'folder/template.html' becomes 'folder/template.html_2'
return render_to_response(template_name, context)
return wrapped_view
This is a really basic example, but I hope it gets the idea across. You could modify several other things about the response, such as adding information to the template context. You could use those context variables to integrate with your site analytics, like Google Analytics, for example.
As a bonus, you could refactor this decorator in the future if you decide to stop using GET parameters and move to something based on cookies, etc.
Update If you already have a lot of views written, and you don't want to modify them all, you could write your own version of render_to_response.
def render_to_response(template_list, dictionary, context_instance, mimetype):
return (template_list, dictionary, context_instance, mimetype)
def ab_test(view):
from django.shortcuts import render_to_response as old_render_to_response
def wrapped_view(request, *args, **kwargs):
template_name, context, context_instance, mimetype = view(request, *args, **kwargs)
if 'ui' in request.GET:
template_name = '%s_%s' % (template_name, request.GET['ui'])
# ie, 'folder/template.html' becomes 'folder/template.html_2'
return old_render_to_response(template_name, context, context_instance=context_instance, mimetype=mimetype)
return wrapped_view
#ab_test
def my_legacy_view(request, param):
return render_to_response('mytemplate.html', {'param': param})
Justin's response is right... I recommend you vote for that one, as he was first. His approach is particularly useful if you have multiple views that need this A/B adjustment.
Note, however, that you don't need a decorator, or alterations to urls.py, if you have just a handful of views. If you left your urls.py file as is...
(r'^foo/', my.view.here),
... you can use request.GET to determine the view variant requested:
def here(request):
variant = request.GET.get('ui', some_default)
If you want to avoid hardcoding template names for the individual A/B/C/etc views, just make them a convention in your template naming scheme (as Justin's approach also recommends):
def here(request):
variant = request.GET.get('ui', some_default)
template_name = 'heretemplates/page%s.html' % variant
try:
return render_to_response(template_name)
except TemplateDoesNotExist:
return render_to_response('oops.html')
A code based on the one by Justin Voss:
def ab_test(force = None):
def _ab_test(view):
def wrapped_view(request, *args, **kwargs):
request, template_name, cont = view(request, *args, **kwargs)
if 'ui' in request.GET:
request.session['ui'] = request.GET['ui']
if 'ui' in request.session:
cont['ui'] = request.session['ui']
else:
if force is None:
cont['ui'] = '0'
else:
return redirect_to(request, force)
return direct_to_template(request, template_name, extra_context = cont)
return wrapped_view
return _ab_test
example function using the code:
#ab_test()
def index1(request):
return (request,'website/index.html', locals())
#ab_test('?ui=33')
def index2(request):
return (request,'website/index.html', locals())
What happens here:
1. The passed UI parameter is stored in the session variable
2. The same template loads every time, but a context variable {{ui}} stores the UI id (you can use it to modify the template)
3. If user enters the page without ?ui=xx then in case of index2 he's redirected to '?ui=33', in case of index1 the UI variable is set to 0.
I use 3 to redirect from the main page to Google Website Optimizer which in turn redirects back to the main page with a proper ?ui parameter.
You can also A/B test using Google Optimize. To do so you'll have to add Google Analytics to your site and then when you create a Google Optimize experiment each user will get a cookie with a different experiment variant (according to the weight for each variant). You can then extract the variant from the cookie and display various versions of your application. You can use the following snippet to extract the variant:
ga_exp = self.request.COOKIES.get("_gaexp")
parts = ga_exp.split(".")
experiments_part = ".".join(parts[2:])
experiments = experiments_part.split("!")
for experiment_str in experiments:
experiment_parts = experiment_str.split(".")
experiment_id = experiment_parts[0]
variation_id = int(experiment_parts[2])
experiment_variations[experiment_id] = variation_id
However there is a django package that integrates well with Google Optimize: https://github.com/adinhodovic/django-google-optimize/.
And here is a blog post on how to use the package and how Google Optimize works: https://hodovi.cc/blog/django-b-testing-google-optimize/.