Celery Storing unrecoverable task failures for later resubmission

Celery Storing unrecoverable task failures for later resubmission - django

I'm using the djkombu transport for my local development, but I will probably be using amqp (rabbit) in production.
I'd like to be able to iterate over failures of a particular type and resubmit. This would be in the case of something failing on a server or some edge case bug triggered by some new variation in data.
So I could be resubmitting jobs up to 12 hours later after some bug is fixed or a third party site is back up.
My question is: Is there a way to access old failed jobs via the result backend and simply resubmit them with the same params etc?

You can probably access old jobs using:
CELERY_RESULT_BACKEND = "database"
and in your code:
from djcelery.models import TaskMeta
task = TaskMeta.objects.filter(task_id='af3185c9-4174-4bca-0101-860ce6621234')[0]
but I'm not sure you can find the arguments that the task is being started with ... Maybe something with TaskState...
I've never used it this way. But you might want to consider the task.retry feature?
An example from celery docs:
#task()
def task(*args):
try:
some_work()
except SomeException, exc:
# Retry in 24 hours.
raise task.retry(*args, countdown=60 * 60 * 24, exc=exc)

From IRC
<asksol> dpn`: task args and kwargs are not stored with the result
<asksol> dpn`: but you can create your own model and store it there
(for example using the task_sent signal)
<asksol> we don't store anything when the task is sent, only send a
message. but it's very easy to do yourself
This was what I was expecting, but hoped to avoid.
At least I have an answer now :)

Related

Using requests library in on_failure or on_sucesss hook causes the task to retry indefinitely

This is what I have:
import youtube_dl # in case this matters
class ErrorCatchingTask(Task):
# Request = CustomRequest
def on_failure(self, exc, task_id, args, kwargs, einfo):
# If I comment this out, all is well
r = requests.post(server + "/error_status/")
....
#app.task(base=ErrorCatchingTask, bind=True, ignore_result=True, max_retires=1)
def process(self, param_1, param_2, param_3):
...
raise IndexError
...
The worker will throw exception and then seemingly spawn a new task with a different task id Received task: process[{task_id}
Here are a couple of things I've tried:
Importing from celery.worker.request import Request and overriding on_failure and on_success functions there instead.
app.conf.broker_transport_options = {'visibility_timeout': 99999999999}
#app.task(base=ErrorCatchingTask, bind=True, ignore_result=True, max_retires=1)
Turn off DEBUG mode
Set logging to info
Set CELERY_IGNORE_RESULT to false (Can I use Python requests with celery?)
import requests as apicall to rule out namespace conflict
Money patch requests Celery + Eventlet + non blocking requests
Move ErrorCatchingTask into a separate file
If I don't use any of the hook functions, the worker will just throw the exception and stay idle until the next task is scheduled, which what I expect even when I use the hooks. Is this a bug? I searched through and through on github issues, but couldn't find the same problem. How do you debug a problem like this?
Django 1.11.16
celery 4.2.1

My problem was resolved after I used grequests
In my case, celery worker would reschedule as soon as conn.urlopen() was being called in requests/adapters.py. Another behavior I observed was if I had another worker from another project open in the same machine, sometimes infinite rescheduling would stop. This probably was some locking mechanism that was originally intended for other purpose kicking in.
So this led me to suspect that this is indeed threading issue and after researching whether requests library was thread safe, I found some people suggesting different things.. In theory, monkey patching should have a similar effect as using grequests, but it is not the same, so just use grequests or erequests library instead.
Celery Debugging instruction is here

Should I have concern about datastoreRpcErrors?

When I run dataflow jobs that writes to google cloud datastore, sometime I see the metrics show that I had one or two datastoreRpcErrors:
Since these datastore writes usually contain a batch of keys, I am wondering in the situation of RpcError, if some retry will happen automatically. If not, what would be a good way to handle these cases?

tl;dr: By default datastoreRpcErrors will use 5 retries automatically.
I dig into the code of datastoreio in beam python sdk. It looks like the final entity mutations are flushed in batch via DatastoreWriteFn().
# Flush the current batch of mutations to Cloud Datastore.
_, latency_ms = helper.write_mutations(
self._datastore, self._project, self._mutations,
self._throttler, self._update_rpc_stats,
throttle_delay=_Mutate._WRITE_BATCH_TARGET_LATENCY_MS/1000)
The RPCError is caught by this block of code in write_mutations in the helper; and there is a decorator #retry.with_exponential_backoff for commit method; and the default number of retry is set to 5; retry_on_rpc_error defines the concrete RPCError and SocketError reasons to trigger retry.
for mutation in mutations:
commit_request.mutations.add().CopyFrom(mutation)
#retry.with_exponential_backoff(num_retries=5,
retry_filter=retry_on_rpc_error)
def commit(request):
# Client-side throttling.
while throttler.throttle_request(time.time()*1000):
try:
response = datastore.commit(request)
...
except (RPCError, SocketError):
if rpc_stats_callback:
rpc_stats_callback(errors=1)
raise
...

I think you should first of all determine which kind of error occurred in order to see what are your options.
However, in the official Datastore documentation, there is a list of all the possible errors and their error codes . Fortunately, they come with recommended actions for each.
My advice is that your implement their recommendations and see for alternatives if they are not effective for you

Celery: dynamic queues at object level

I'm writing a django app to make polls which uses celery to put under control the voting system. Right now, I have two queues, default and polls, the first one with concurrency set to 8 and the second one set to 1.
$ celery multi start -A myproject.celery default polls -Q:default default -Q:polls polls -c:default 8 -c:polls 1
Celery routes:
CELERY_ROUTES = {
'polls.tasks.option_add_vote': {
'queue': 'polls',
},
'polls.tasks.option_subtract_vote': {
'queue': 'polls',
}
}
Task:
#app.task
def option_add_vote(pk):
"""
Updates given option id and its poll increasing vote number by 1.
"""
option = Option.objects.get(pk=pk)
try:
with transaction.atomic():
option.vote_quantity += 1
option.save()
option.poll.total_votes += 1
option.poll.save()
except IntegrityError as exc:
raise self.retry(exc=exc)
The option_add_vote method (task) updates the poll-object vote-number value adding 1 to the previous value. So, to avoid concurrency problems, I set the poll queue concurrency to 1. This allow the system to handle thousand of vote requests to be completed successfully.
The problem will be, as I can imagine, a bottle-neck when the system grows up.
So, I was thinking about some kind of dynamic queues where all vote requests to any options of a certain poll where routered to a custom queue. I think this will make the system more reliable and fast.
What do you think? How can I make it?
EDIT1:
I got a new idea thanks to Paul and Plahcinski. I'm storing the votes as objects in their own model (a user-options relationship). When someone votes an option it creates an object from this model, allowing me to count how many votes an option has. This free the system from the voting-concurrency problem, so it could be executed in parallel.
I'm thinking about using CELERYBEAT_SCHEDULE to cron a task that updates poll options based on the result of Vote.objects.get(pk=pk).count(). Maybe I could execute it every hour or do partial updates for those options that are getting new votes...
But, how do I give to the clients updated options in real time?
As Plahcinski says, I can have a cached value for my options in Redis (or any other mem-cached system?) and use it to temporally store this values, giving to any new request the cached value.
How can I mix this with my standar values in django models? Anyone could give me some code references or hints?
Am I in the good way or did I make mistakes?

What I would do is remove your incrementation for the database and move to redis and use the database model as your cached value. Have a celery beat that updates recently incremented redis keys to your database
http://redis.io/commands/INCR

What about just having a simple model that stores vote -1/+1 integers then a celery task that reconciles those with the FK object for atomic transactions and updates?

PyMongo find query returns empty/partial cursor when running in a Django+uWsgi project

We developed a REST API using Django & mongoDB (PyMongo driver). The problem is that, on some requests to the API endpoints, PyMongo cursor returns a partial response which contains less documents than it should (but it’s a completely valid JSON document).
Let me explain it with an example of one of our views:
def get_data(key):
return collection.find({'key': key}, limit=24)
def my_view(request):
key = request.POST.get('key')
query = get_data(key)
res = [app for app in query]
return JsonResponse({'list': res})
We're sure that there is more than 8000 documents matching the query, but in
some calls we get less than 24 results (even zero). The first problem we've
investigated was that we had more than one MongoClient definition in our code. By resolving this, the number of occurrences of the problem decreased, but we still had it in a lot of calls.
After all of these investigations, we've designed a test in which we made 16 asynchronous requests at the same time to the server. With this approach, we could reproduce the problem. On each of these 16 requests, 6-8 of them had partial results. After running this test we reduced uWsgi’s number of processes to 6 and restarted the server. All results were good but after applying another heavy load on the server, the problem began again. At this point, we restarted uwsgi service and again everything was OK. With this last experiment we have a clue now that when the uwsgi service starts running, everything is working correctly but after a period of time and heavy load, the server begins to return partial or empty results again.
The latest investigation we had was to run the API using python manage.py with DEBUG=False, and we had the problem again after a period of time in this situation.
We can't figure out what the problem is and how to solve it. One reason that we can think of is that Django closes pymongo’s connections before completion. Because the returned result is a valid JSON.
Our stack is:
nginx (with no cache enabled)
uWsgi
MemCached (disabled during debugging procedure)
Django (v1.8 on python 3)
PyMongo (v3.0.3)
Your help is really appreciated.
Update:
Mongo version:
db version v3.0.7
git version: 6ce7cbe8c6b899552dadd907604559806aa2e9bd
We are running single mongod instance. No sharding/replicating.
We are creating connection using this snippet:
con = MongoClient('localhost', 27017)
Update 2
Subject thread in Pymongo issue tracker.

Pymongo cursors are not thread safe elements. So using them like what I did in a multi-threaded environment will cause what I've described in question. On the other hand Python's list operations are mostly thread safe, and changing snippet like this will solve the problem:
def get_data(key):
return list(collection.find({'key': key}, limit=24))
def my_view(request):
key = request.POST.get('key')
query = get_data(key)
res = [app for app in query]
return JsonResponse({'list': res})

My very speculative guess is that you are reusing a cursor somewhere in your code. Make sure you are initializing your collection within the view stack itself, and not outside of it.
For example, as written, if you are doing something like:
import ...
import con
collection = con.documents
# blah blah code
def my_view(request):
key = request.POST.get('key')
query = collection.find({'key': key}, limit=24)
res = [app for app in query]
return JsonResponse({'list': res})
You could end us reusing a cursor. Better to do something like
import ...
import con
# blah blah code
def my_view(request):
collection = con.documents
key = request.POST.get('key')
query = collection.find({'key': key}, limit=24)
res = [app for app in query]
return JsonResponse({'list': res})
EDIT at asker's request for clarification:
The reason you need to define the collection within the view stack and not when the file loads is that the collection variable has a cursor, which is basically how the database and your application talk to each other. Cursors do things like keep track of where you are in a long list of data, in addition to a bunch of other stuff, but thats the important part.
When you create the collection cursor outside the view method, it will re-use the cursor on each request if it exists. So, if you make one request, and then another really, really fast right after that (like what happened when you applied high load), the cursor might only be half way through talking to the database, and so some of your data goes to the first request, and some to the second. The reason you would get NO data in a request would be if a cursor finished fetching data but hadn't been closed yet, so the next request tried to fetch data from the cursor, and there was none left to fetch in the query.
By moving the collection definition (and by association, the cursor definition) into the view stack, you will ALWAYS get a new cursor when you process a new request. You wont get any cross talking between your cursors and different requests, as each request cycle will have its own.

Occasional "ConnectionError: Cannot connect to the database" to mongo

We are currently testing a django based project that uses MongoEngine as the persistence layer. MongoEngine is based on pymongo and we're using version 1.6 and we are running a single instance setup of mongo.
What we have noticed is that occasionally, and for about 5 minutes, connections cannot be established to the mongo instance. Has anyone come across such behavior? any tips on how to improve reliability?

We had an issue with AutoReconnect which sounds similar to what you are describing. I ended up monkeypatching pymongo in my <project>/__init__.py file:
from pymongo.cursor import Cursor
from pymongo.errors import AutoReconnect
from time import sleep
import sys
AUTO_RECONNECT_ATTEMPTS = 10
AUTO_RECONNECT_DELAY = 0.1
def auto_reconnect(func):
"""
Function wrapper to automatically reconnect if AutoReconnect is raised.
If still failing after AUTO_RECONNECT_ATTEMPTS, raise the exception after
all. Technically this should be handled everytime a mongo query is
executed so you can gracefully handle the failure appropriately, but this
intermediary should handle 99% of cases and avoid having to put
reconnection code all over the place.
"""
def retry_function(*args, **kwargs):
attempts = 0
while True:
try:
return func(*args, **kwargs)
except AutoReconnect, e:
attempts += 1
if attempts > AUTO_RECONNECT_ATTEMPTS:
raise
sys.stderr.write(
'%s raised [%s] -- AutoReconnecting (#%d)...\n' % (
func.__name__, e, attempts))
sleep(AUTO_RECONNECT_DELAY)
return retry_function
# monkeypatch: wrap Cursor.__send_message (name-mangled)
Cursor._Cursor__send_message = auto_reconnect(Cursor._Cursor__send_message)
# (may need to wrap some other methods also, we'll see...)
This resolved the issue for us, but you might be describing something different?

Here is another solution that uses subclassing instead of monkey patching, and handles errors that might be raised while establishing the initial connection or while accessing a database. I just subclassed Connection/ReplicasetConnection, and handled raised AutoReconnect errors during instantiation and any method invocations. You can specify the number of retries and sleep time between retries in the constructor.
You can see the gist here: https://gist.github.com/2777345

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js