Processing AWS Lambda messages in Batches - amazon-web-services

I am wondering something, and I really can't find information about it. Maybe it is not the way to go but, I would just like to know.
It is about Lambda working in batches. I know I can set up Lambda to consume batch messages. In my Lambda function I iterate each message, and if one fails, Lambda exits. And the cycle starts again.
I am wondering about slightly different approach
Let's assume I have three messages: A, B and C. I also take them in batches. Now if the message B fails (e.g. API call failed), I return message B to SQS and keep processing the message C.
Is it possible? If it is, is it a good approach? Because I see that I need to implement some extra complexity in Lambda and what not.
Thanks

There's an excellent article here. The relevant parts for you are...
Using a batchSize of 1, so that messages succeed or fail on their own.
Making sure your processing is idempotent, so reprocessing a message isn't harmful, outside of the extra processing cost.
Handle errors within your function code, perhaps by catching them and sending the message to a dead letter queue for further processing.
Calling the DeleteMessage API manually within your function after successfully processing a message.
The last bullet point is how I've managed to deal with the same problem. Instead of returning errors immediately, store them or note that an error has occurred, but then continue to handle the rest of the messages in the batch. At the end of processing, return or raise an error so that the SQS -> lambda trigger knows not to delete the failed messages. All successful messages will have already been deleted by your lambda handler.
sqs = boto3.client('sqs')
def handler(event, context):
failed = False
for msg in event['Records']:
try:
# Do something with the message.
handle_message(msg)
except Exception:
# Ok it failed, but allow the loop to finish.
logger.exception('Failed to handle message')
failed = True
else:
# The message was handled successfully. We can delete it now.
sqs.delete_message(
QueueUrl=<queue_url>,
ReceiptHandle=msg['receiptHandle'],
)
# It doesn't matter what the error is. You just want to raise here
# to ensure the trigger doesn't delete any of the failed messages.
if failed:
raise RuntimeError('Failed to process one or more messages')
def handle_msg(msg):
...

For Node.js, check out https://www.npmjs.com/package/#middy/sqs-partial-batch-failure.
const middy = require('#middy/core')
const sqsBatch = require('#middy/sqs-partial-batch-failure')
const originalHandler = (event, context, cb) => {
const recordPromises = event.Records.map(async (record, index) => { /* Custom message processing logic */ })
return Promise.allSettled(recordPromises)
}
const handler = middy(originalHandler)
.use(sqsBatch())
Check out https://medium.com/#brettandrews/handling-sqs-partial-batch-failures-in-aws-lambda-d9d6940a17aa for more details.

As of Nov 2019, AWS has introduced the concept of Bisect On Function Error, along with Maximum retries. If your function is idempotent this can be used.
In this approach you should throw an error from the function even if one item in the batch is failing. AWS with split the batch into two and retry. Now one half of the batch should pass successfully. For the other half the process is continued till the bad record is isolated.

Like all architecture decisions, it depends on your goal and what you are willing to trade for more complexity. Using SQS will allow you to process messages out of order so that retries don't block other messages. Whether or not that is worth the complexity depends on why you are worried about messages getting blocked.
I suggest reading about Lambda retry behavior and Dead Letter Queues.

If you want to retry only the failed messages out of a batch of messages it is totally doable, but does add slight complexity.
A possible approach to achieve this is iterating through a list of your events (ex [eventA, eventB, eventC]), and for each execution, append to a list of failed events if the event failed. Then, have an end case that checks to see if the list of failed events has anything in it, and if it does, manually send the messages back to SQS (using SQS sendMessageBatch).
However, you should note that this puts the events to the end of the queue, since you are manually inserting them back.
Anything can be a "good approach" if it solves a problem you are having without much complexity, and in this case, the issue of having to re-execute successful events is definitely a problem that you can solve in this manner.

SQS/Lambda supports reporting batch failures. How it works is within each batch iteration, you catch all exceptions, and if that iteration fails add that messageId to an SQSBatchResponse. At the end when all SQS messages have been processed, you return the batch response.
Here is the relevant docs section: https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html#services-sqs-batchfailurereporting
To use this feature, your function must gracefully handle errors. Have your function logic catch all exceptions and report the messages that result in failure in batchItemFailures in your function response. If your function throws an exception, the entire batch is considered a complete failure.

To add to the answer by David:
SQS/Lambda supports reporting batch failures. How it works is within each batch iteration, you catch all exceptions, and if that iteration fails add that messageId to an SQSBatchResponse. At the end when all SQS messages have been processed, you return the batch response.
Here is the relevant docs section: https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html#services-sqs-batchfailurereporting
I implemented this, but a batch of A, B and C, with B failing, would still mark all three as complete. It turns out you need to explicitly define the lambda event source mapping to expect a batch failure to be returned. It can be done by adding the key of FunctionResponseTypes with the value of a list containing ReportBatchItemFailures. Here is the relevant docs: https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html#services-sqs-batchfailurereporting
My sam template looks like this after adding this:
Type: SQS
Properties:
Queue: my-queue-arn
BatchSize: 10
Enabled: true
FunctionResponseTypes:
- ReportBatchItemFailures

Related

AWS Lambda - ensure complete processing - data persistence, event publishing etc

Say I have some code like this in a lambda function:
public void handleRequest(final S3Event s3Event) {
//stuff removed for brevity
fooRepository.add(foo);
eventPublisher.publish(someEvent1);
eventPublisher.publish(someEvent2);
}
Where the eventPublisher is a wrapper for an sqsClient which send messages to an sqs queue.
Is there a pattern/approach that can be used to ensure that if one command fails the whole lot gets aborted/reverted? (fails and succeeds as one unit)
e.g. in the above, say the publishing of someEvent1 fails an exception might get thrown causing the event to be reprocessed by the lambda - this could then result in duplicate data being added to the db via the call again to fooRepository.add(foo)...

how can I prevent dynamoDB Stream handler from infinitely processing a record when I use batchItemFailures

I have a dynamoDB stream which is triggering a lambda handler that looks like this:
let failedRequestId: string
await asyncForEachSerial(event.Records, async (record) => {
try {
await handle(record.dynamodb.OldImage, record.dynamodb.NewImage, record, context)
return true
} catch (e) {
failedRequestId = record.dynamodb.SequenceNumber
}
return false //break;
})
return {
batchItemFailures:[ { itemIdentifier: failedRequestId } ]
}
I have my lambda set up with a DestinationConfig.onFailure pointing to a DLQ I configured in SQS. The idea behind the handler is to process a batch of events and interrupt at the first failure. Then it reports the most recent failure in 'batchItemFailures' which tells the stream to continue at that record next try. (I pulled the idea from this article)
My current issue is that if there is a genuine failure of my handle() function on one of those records, then my exit code will trigger that record as my checkpoint for the next handler call. However the dlq condition doesn't ever trigger and I end up processing that record over and over again. I should also note that I am trying to avoid reprocessing records multiple times since handle() is not idempotent.
How can I elegantly handle errors while maintaining batching, but without triggering my handle() function more than once for well-behaved stream records?
I'm not sure if you have found the answer you were looking for. I'll respond in case someone else come across this issue.
There are 2 other parameters you'd want to use to avoid that issue. Quoting documentation (https://docs.aws.amazon.com/lambda/latest/dg/with-ddb.html):
Retry attempts – The maximum number of times that Lambda retries when the function returns an error. This doesn't apply to service errors or throttles where the batch didn't reach the function.
Maximum age of record – The maximum age of a record that Lambda sends to your function.
Basically, you'll have to specify how many time the failures should be retried and how far back in the events Lambda should be looking at.

MismatchingMessageCorrelationException : Cannot correlate message ‘onEventReceiver’: No process definition or execution matches the parameters

We are facing an MismatchingMessageCorrelationException for the receive task in some cases (less than 5%)
The call back to notify receive task is done by :
protected void respondToCallWorker(
#NonNull final String correlationId,
final CallWorkerResultKeys result,
#Nullable final Map<String, Object> variables
) {
try {
runtimeService.createMessageCorrelation("callWorkerConsumer")
.processInstanceId(correlationId)
.setVariables(variables)
.setVariable("callStatus", result.toString())
.correlateWithResult();
} catch(Exception e) {
e.printStackTrace();
}
}
When i check the logs : i found that the query executed is this one :
select distinct RES.* from ACT_RU_EXECUTION RES
inner join ACT_RE_PROCDEF P on RES.PROC_DEF_ID_ = P.ID_
WHERE RES.PROC_INST_ID_ = 'b2362197-3bea-11eb-a150-9e4bf0efd6d0' and RES.SUSPENSION_STATE_ = '1'
and exists (select ID_ from ACT_RU_EVENT_SUBSCR EVT
where EVT.EXECUTION_ID_ = RES.ID_ and EVT.EVENT_TYPE_ = 'message'
and EVT.EVENT_NAME_ = 'callWorkerConsumer' )
Some times, When i look for the instance of the process in the database i found it waiting in the receive task
SELECT DISTINCT * FROM ACT_RU_EXECUTION RES
WHERE id_ = 'b2362197-3bea-11eb-a150-9e4bf0efd6d0'
However, when i check the subscription event, it's not yet created in the database
select ID_ from ACT_RU_EVENT_SUBSCR EVT
where EVT.EXECUTION_ID_ = 'b2362197-3bea-11eb-a150-9e4bf0efd6d0'
and EVT.EVENT_TYPE_ = 'message'
and EVT.EVENT_NAME_ = 'callWorkerConsumer'
I think that the solution is to save the "receive task" before getting the response for respondToCallWorker, but sadly i can't figure it out.
I tried "asynch before" callWorker and "Message consumer" but it did not work,
I also tried camunda.bpm.database.jdbc-batch-processing=false and got the same results,
I tried also parallel branches but i get OptimisticLocak exception and MismatchingMessageCorrelationException
Maybe i am doing it wrong
Thanks for your help
This is an interesting problem. As you already found out, the error happens, when you try to correlate the result from the "worker" before the main process ended its transaction, thus there is no message subscription registered at the time you correlate.
This problem in process orchestration is described and analyzed in this blog post, which is definitely worth reading.
Taken from that post, here is a design that should solve the issue:
You make message send and receive parallel and put an async before the send task.
By doing so, the async continuation job for the send event and the message subscription are written in the same transaction, so when the async message send executes, you already have the subscription waiting.
Although this should work and solve the issue on BPMN model level, it might be worth to consider options that do not require remodeling the process.
First, instead of calling the worker directly from your delegate, you could (assuming you are on spring boot) publish a "CallWorkerCommand" (simple pojo) and use a TransactionalEventLister on a spring bean to execute the actual call. By doing so, you first will finish the BPMN process by subscribing to the message and afterwards, spring will execute your worker call.
Second: you could use a retry mechanism like resilience4j around your correlate message call, so in the rare cases where the result comes to quickly, you fail and retry a second later.
Another solution I could think of, since you seem to be using an "external worker" pattern here, is to use an external-task-service task directly, so the send/receive synchronization gets solved by the Camunda external worker API.
So many options to choose from. I would possibly prefer the external task, followed by the transactionalEventListener, but that is a matter of personal preference.

Called a lambda function once, it's executed twice

This is more of a concern than a question, but still, has anyone experienced this before? Does anyone know how to prevent it?
I have a lambda function (L1) which calls a second lambda function (L2) all written in NodeJs (runtime: Node.Js 8.10, and aws-sdk should be v2.488.0 - but I'm just pasting that from the documentation). The short story is that L1 is supposed to call L2, and when it does L2 is executed twice! I discovered this by writing logs to CloudWatch and I could see one L1 log and two L2 logs.
Here's a simplified version of L1 and L2.
L1:
const AWS = require('aws-sdk');
const lambda = new AWS.Lambda();
module.exports = {
handler: async (event, context, callback) => {
const payload: { rnd: Math.random() };
const lambdaParams = {
FunctionName: 'L2',
Qualifier: `dev`,
Payload: JSON.stringify(payload),
};
console.log(`L1 calling: ${JSON.stringify(payload)}`);
return await lambda.invoke(lambdaParams).promise();
},
};
L2:
module.exports = {
handler: async (event, context, callback) => {
console.log(`L2 called: ${JSON.stringify(event)}`);
},
};
In CloudWatch I can see one L1 calling {"rnd": 0.012072353149807702} and two L2 called: {"rnd": 0.012072353149807702}!
BTW, this does not happen all the time. This is part of a step function process which was going to call L1 10k times. My code is written in a way that if L2 is executed twice (per one call), it will fail the whole process (because L2 inserts a record to DB only if it does not exist and fails if it does). So far, I managed to log this behaviour three times. All of them processing the same 10k items, facing the issue at a different iteration each time.
Does anyone have the same experience? Or even better, knows how to make sure one call leads to exactly one execution?
Your lambda function must be idempotent, because it can be called twice in different situations.
https://aws.amazon.com/premiumsupport/knowledge-center/lambda-function-idempotent/
https://cloudonaut.io/your-lambda-function-might-execute-twice-deal-with-it/
With 10K lambda invokes it must be experiencing a failure and doing a retry.
From the documentation:
Asynchronous Invocation – Lambda retries function errors twice. If the
function doesn't have enough capacity to handle all incoming requests,
events may wait in the queue for hours or days to be sent to the
function. You can configure a dead-letter queue on the function to
capture events that were not successfully processed. For more
information, see Asynchronous Invocation.
If this is what is a happening and you setup the dead letter queue you'll be able to isolate the failure event.
You can also use CloudWatch Logs Insights to easily and quickly search for errors messages of the lambda. Once you select the log group this query should help you get started. Just change the time window.
fields #timestamp, #message
| filter #message like /(?i)(Exception|error|fail|5\d\d)/
| sort #timestamp desc
| limit 20
One case that may cause this is that in your L2 lambda you didn't return anything, which will lead the L1 lambda (the caller) to think there is an error with L2 and so the Retry mechanism is triggered. Try to return something in L2, even simply an "OK".
In my case, it happened that when calling my second lambda, there was a try that was catching an exception with traceback, this triggered the lambda to retry the call, normally without traceback this does not happen, but when commenting the module it stopped happening.
Also within the try it had a condition that yes or yes it could fail, since it had to query for a resource with boto3, so if it existed there was no problem, but since there was no traceback it forced the general failure, not capturing it as an exception.

Aws Lambda call to external api after callback

I have a lambda function that sends http call to a API(Let's say 'A'). After getting response from 'A' Immediately return the stuff's to the caller i.e., (callback(null, success)) within 10secs. Then save the Data fetched from API 'A' to My External API(Let's Say 'B').
I tried like below but Lambda waits until event loop is empty(It is waiting for the response from second http call).
I doesn't want to set the eventLoopWaitEmpty to false since it freezes the eventloop and Execute next time when invoked.
request.get({url: endpointUrlA},
function (errorA, responseA, bodyA) {
callback(null, "success");
request.post({url: endpointUrlB,
body: bodyA,
json: true}, function(errorB, responseB, bodyB){
//Doesn't want to wait for this response
});
/* Also tried the callback(null, "success"); here too
});
Anybody have any thoughts on How can I implement this? Thanks!
PS - Btw I read the Previous similar questions doesn't seems to clear with those.
This seems like a good candidate for breaking up this lambda into two lambdas with some support code.
First lambda recieves request to 'A' and places a message onto SQS. It then returns to the caller the success status.
A separate process monitors the SQS queue and invokes a second Lambda on it when a message becomes available.
This has several benefits.
Firstly, you no longer have a long-running lambda waiting for a second system that may be down to return.
Secondly, you're doing things asynchronously in the background.
Take a look at this blog post for an overview of how this could work in practice.