AWS Lambda retry after certain time - amazon-web-services

I have a lambda that performs, alongside with other things, a GET request, every day, at 5am, on some service, triggered by CloudWatchEvents.
This service may or may not have the data I need by the time queried.
Therefore, if the data is not there, I need to re-invoke the lambda, let's say, 6am. If it' still not there, again at 7am, and so on.
How can I accomplish that using AWS infrastructure?

This seems like a very good use case for Step Functions.
Step functions allow you to create a workflow with AWS Services including Lambda that allow for decision branches and wait loops.
For example you could create a workflow that is invoked daily at 5am where you invoke the lambda, the lambda can return whether it could process the data or that it needs to wait more. The step function will inspect the results and either end the workflow since the data was processed or go into a wait state for an hour and then retry the function.
Check out this article that includes code samples for a workflow that is similar to yours.

I had a similar situation, my lambda needed to change schedule on weekends and this is how I solved it.
def lambda_handler(event, context):
reschedule_event()
keep_working()
REGULAR_SCHEDULE = 'rate(20 minutes)'
WEEKEND_SHEDULE = 'rate(1 hour)'
RULE_NAME = 'My Rule'
def reschedule_event():
"""
Cambia la planificación de la lambda, para que descanse los findes :D
"""
sched = boto3.client('events')
current = sched.describe_rule(Name=RULE_NAME)
if is_weekend() and 'minutes' in current['ScheduleExpression']:
sched.put_rule(
Name=RULE_NAME,
ScheduleExpression=WEEKEND_SCHEDULE,
)
if not is_weekend and 'hour' in current['ScheduleExpression']:
sched.put_rule(
Name=RULE_NAME,
ScheduleExpression=REGULAR_SCHEDULE,
)
I agree there must be some proper way to do this, but time was short at the moment and that lambda needed to go into production. You could do something alike to reschedule yours when there's no data to be retrieved and then go back to the original schedule.

Related

Processing AWS Lambda messages in Batches

I am wondering something, and I really can't find information about it. Maybe it is not the way to go but, I would just like to know.
It is about Lambda working in batches. I know I can set up Lambda to consume batch messages. In my Lambda function I iterate each message, and if one fails, Lambda exits. And the cycle starts again.
I am wondering about slightly different approach
Let's assume I have three messages: A, B and C. I also take them in batches. Now if the message B fails (e.g. API call failed), I return message B to SQS and keep processing the message C.
Is it possible? If it is, is it a good approach? Because I see that I need to implement some extra complexity in Lambda and what not.
Thanks
There's an excellent article here. The relevant parts for you are...
Using a batchSize of 1, so that messages succeed or fail on their own.
Making sure your processing is idempotent, so reprocessing a message isn't harmful, outside of the extra processing cost.
Handle errors within your function code, perhaps by catching them and sending the message to a dead letter queue for further processing.
Calling the DeleteMessage API manually within your function after successfully processing a message.
The last bullet point is how I've managed to deal with the same problem. Instead of returning errors immediately, store them or note that an error has occurred, but then continue to handle the rest of the messages in the batch. At the end of processing, return or raise an error so that the SQS -> lambda trigger knows not to delete the failed messages. All successful messages will have already been deleted by your lambda handler.
sqs = boto3.client('sqs')
def handler(event, context):
failed = False
for msg in event['Records']:
try:
# Do something with the message.
handle_message(msg)
except Exception:
# Ok it failed, but allow the loop to finish.
logger.exception('Failed to handle message')
failed = True
else:
# The message was handled successfully. We can delete it now.
sqs.delete_message(
QueueUrl=<queue_url>,
ReceiptHandle=msg['receiptHandle'],
)
# It doesn't matter what the error is. You just want to raise here
# to ensure the trigger doesn't delete any of the failed messages.
if failed:
raise RuntimeError('Failed to process one or more messages')
def handle_msg(msg):
...
For Node.js, check out https://www.npmjs.com/package/#middy/sqs-partial-batch-failure.
const middy = require('#middy/core')
const sqsBatch = require('#middy/sqs-partial-batch-failure')
const originalHandler = (event, context, cb) => {
const recordPromises = event.Records.map(async (record, index) => { /* Custom message processing logic */ })
return Promise.allSettled(recordPromises)
}
const handler = middy(originalHandler)
.use(sqsBatch())
Check out https://medium.com/#brettandrews/handling-sqs-partial-batch-failures-in-aws-lambda-d9d6940a17aa for more details.
As of Nov 2019, AWS has introduced the concept of Bisect On Function Error, along with Maximum retries. If your function is idempotent this can be used.
In this approach you should throw an error from the function even if one item in the batch is failing. AWS with split the batch into two and retry. Now one half of the batch should pass successfully. For the other half the process is continued till the bad record is isolated.
Like all architecture decisions, it depends on your goal and what you are willing to trade for more complexity. Using SQS will allow you to process messages out of order so that retries don't block other messages. Whether or not that is worth the complexity depends on why you are worried about messages getting blocked.
I suggest reading about Lambda retry behavior and Dead Letter Queues.
If you want to retry only the failed messages out of a batch of messages it is totally doable, but does add slight complexity.
A possible approach to achieve this is iterating through a list of your events (ex [eventA, eventB, eventC]), and for each execution, append to a list of failed events if the event failed. Then, have an end case that checks to see if the list of failed events has anything in it, and if it does, manually send the messages back to SQS (using SQS sendMessageBatch).
However, you should note that this puts the events to the end of the queue, since you are manually inserting them back.
Anything can be a "good approach" if it solves a problem you are having without much complexity, and in this case, the issue of having to re-execute successful events is definitely a problem that you can solve in this manner.
SQS/Lambda supports reporting batch failures. How it works is within each batch iteration, you catch all exceptions, and if that iteration fails add that messageId to an SQSBatchResponse. At the end when all SQS messages have been processed, you return the batch response.
Here is the relevant docs section: https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html#services-sqs-batchfailurereporting
To use this feature, your function must gracefully handle errors. Have your function logic catch all exceptions and report the messages that result in failure in batchItemFailures in your function response. If your function throws an exception, the entire batch is considered a complete failure.
To add to the answer by David:
SQS/Lambda supports reporting batch failures. How it works is within each batch iteration, you catch all exceptions, and if that iteration fails add that messageId to an SQSBatchResponse. At the end when all SQS messages have been processed, you return the batch response.
Here is the relevant docs section: https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html#services-sqs-batchfailurereporting
I implemented this, but a batch of A, B and C, with B failing, would still mark all three as complete. It turns out you need to explicitly define the lambda event source mapping to expect a batch failure to be returned. It can be done by adding the key of FunctionResponseTypes with the value of a list containing ReportBatchItemFailures. Here is the relevant docs: https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html#services-sqs-batchfailurereporting
My sam template looks like this after adding this:
Type: SQS
Properties:
Queue: my-queue-arn
BatchSize: 10
Enabled: true
FunctionResponseTypes:
- ReportBatchItemFailures

Should I have concern about datastoreRpcErrors?

When I run dataflow jobs that writes to google cloud datastore, sometime I see the metrics show that I had one or two datastoreRpcErrors:
Since these datastore writes usually contain a batch of keys, I am wondering in the situation of RpcError, if some retry will happen automatically. If not, what would be a good way to handle these cases?
tl;dr: By default datastoreRpcErrors will use 5 retries automatically.
I dig into the code of datastoreio in beam python sdk. It looks like the final entity mutations are flushed in batch via DatastoreWriteFn().
# Flush the current batch of mutations to Cloud Datastore.
_, latency_ms = helper.write_mutations(
self._datastore, self._project, self._mutations,
self._throttler, self._update_rpc_stats,
throttle_delay=_Mutate._WRITE_BATCH_TARGET_LATENCY_MS/1000)
The RPCError is caught by this block of code in write_mutations in the helper; and there is a decorator #retry.with_exponential_backoff for commit method; and the default number of retry is set to 5; retry_on_rpc_error defines the concrete RPCError and SocketError reasons to trigger retry.
for mutation in mutations:
commit_request.mutations.add().CopyFrom(mutation)
#retry.with_exponential_backoff(num_retries=5,
retry_filter=retry_on_rpc_error)
def commit(request):
# Client-side throttling.
while throttler.throttle_request(time.time()*1000):
try:
response = datastore.commit(request)
...
except (RPCError, SocketError):
if rpc_stats_callback:
rpc_stats_callback(errors=1)
raise
...
I think you should first of all determine which kind of error occurred in order to see what are your options.
However, in the official Datastore documentation, there is a list of all the possible errors and their error codes . Fortunately, they come with recommended actions for each.
My advice is that your implement their recommendations and see for alternatives if they are not effective for you

Is there any way to define task quota in celery?

I have requirements:
I have few heavy-resource-consume task - exporting different reports that require big complex queries, sub queries
There are lot users.
I have built project in django, and queue task using celery
I want to restrict user so that they can request 10 report per minute. The idea is they can put hundreds of request 10 minute, but I want celery to execute 10 task for a user. So that every user gets their turn.
Is there any way so that celery can do this?
Thanks
Celery has a setting to control the RATE_LIMIT (http://celery.readthedocs.org/en/latest/userguide/tasks.html#Task.rate_limit), it means, the number of task that could be running in a time frame.
You could set this to '100/m' (hundred per second) maning your system allows 100 tasks per seconds, its important to notice, that setting is not per user neither task, its per time frame.
Have you thought about this approach instead of limiting per user?
In order to have a 'rate_limit' per task and user pair you will have to do it. I think (not sure) you could use a TaskRouter or a signal based on your needs.
TaskRouters (http://celery.readthedocs.org/en/latest/userguide/routing.html#routers) allow to route tasks to a specify queue aplying some logic.
Signals (http://celery.readthedocs.org/en/latest/userguide/signals.html) allow to execute code in few well-defined points of the task's scheduling cycle.
An example of Router's logic could be:
if task == 'A':
user_id = args[0] # in this task the user_id is the first arg
qty = get_task_qty('A', user_id)
if qty > LIMIT_FOR_A:
return
elif task == 'B':
user_id = args[2] # in this task the user_id is the seconds arg
qty = get_task_qty('B', user_id)
if qty > LIMIT_FOR_B:
return
return {'queue': 'default'}
With the approach above, every time a task starts you should increment by one in some place (for example Redis) the pair user_id/task_type and
every time a task finishes you should decrement that value in the same place.
Its seems kind of complex, hard to maintain and with few failure points for me.
Other approach, which i think could fit, is to implement some kind of 'Distributed Semaphore' (similar to distributed lock) per user and task, so in each task which needs to limit the number of task running you could use it.
The idea is, every time a task which should have 'concurrency control' starts it have to check if there is some resource available if not just return.
You could imagine this idea as below:
#shared_task
def my_task_A(user_id, arg1, arg2):
resource_key = 'my_task_A_{}'.format(user_id)
available = SemaphoreManager.is_available_resource(resource_key)
if not available:
# no resources then abort
return
try:
# the resourse could be acquired just before us for other
if SemaphoreManager.acquire(resource_key):
#execute your code
finally:
SemaphoreManager.release(resource_key)
Its hard to say which approach you SHOULD take because that depends on your application.
Hope it helps you!
Good luck!

copying rather than modifying a job (APScheduler)

I'm writing a database-driven application with APScheduler (v3.0.0). Especially during development, I find myself frequently wanting to command a scheduled job to start running now without affecting its subsequent schedule.
It's possible to do this at job creation time, of course:
def dummy_job(arg):
pass
sched.add_job(dummy_job, trigger='interval', hours=3, args=(None,))
sched.add_job(dummy_job, trigger=None, args=(None,))
However, if I already have a job scheduled with an interval or date trigger...
>>> sched.print_jobs()
Jobstore default:
job1 (trigger: interval[3:00:00], next run at: 2014-08-19 18:56:48 PDT)
... there doesn't seem to be a good way to tell the scheduler "make a copy of this job which will start right now." I've tried sched.reschedule_job(trigger=None), which schedules the job to start right now, but removes its existing trigger.
There's also no obvious, simple way to duplicate a job object while preserving its args and any other stateful properties. The interface I'm imagining is something like this:
sched.dup_job(id='job1', new_id='job2')
sched.reschedule_job('job2', trigger=None)
Clearly, APScheduler already contains an internal mechanism to copy job objects since repeated calls to get_job don't return the same object (that is, (sched.get_job(id) is sched.get_job(id))==False).
Has anyone else come up with a solution here? I'm thinking of posting a suggestion on the developers' site if not.
As you've probably figured out by now, that phenomenon is caused by the job stores instantiating jobs on the fly based on data loaded from the back end. To run a copy of a job immediately, this should do the trick:
job = sched.get_job(id)
sched.add_job(job.func, args=job.args, kwargs=job.kwargs)

get number of pending tasks for a specific user

In one of my applications i want to limit users to make a only a specific number of document conversion each calendar month and want to notify them of the conversions they've made and number of conversions they can still make in that calendar month.
So I do something like the following.
class CustomUser(models.Model):
# user fields here
def get_converted_docs(self):
return self.document_set.filter(date__range=[start, end]).count()
def remaining_docs(self):
converted = self.get_converted_docs()
return LIMIT - converted
Now, document conversion is done in the background using celery. So there may be a situation when a conversion task is pending, so in that case the above methods would let a user make an extra conversion, because the pending task is not being included in the count.
How can i get the number of tasks pending for a specific CustomUser object here ??
update
ok so i tried the following:
from celery.task.control import inspect
def get_scheduled_tasks():
tasks = []
scheduled = inspect().scheduled()
for task in scheduled.values()
tasks.extend(task)
return tasks
This gives me a list of scheduled tasks but now all the values are unicode for the above mentioned task args look like this:
u'args': u'(<Document: test_document.doc>, <CustomUser: Test User>)'
is there a way these can be decoded back to original django objects so that i can filter them ?
Store the state of your documents somewhere else, don't inspect your queue.
Either create a seperate model for that, or eg. have a state on your document model, at least independently from your queue. This should have several advantages:
Inspecting the queue might be expensive - also depending on the backend for that. And as you see it can also turn out to be difficult.
Your queue might not be persistent, if eg. your server crashes and use something like Redis you would loose this information, so it's a good thing to have a log somewhere else to be able to reconstruct the queue)