I'm using AWS cdk. I have a few lambdas running in a parallel state of a state machine and if one of them fails I am currently:
Adding a custom object to the step function event in the lambda wrapper
Have a failure lambda that then reads that custom object to see what the error was and alert about it.
The problem is that in the logs I can see that the step function event that moves between the step function steps has got populated with the custom object during shutdown, but the failure lambda catch does not have that in its input. It successfully gets the shortened error in the resultPath prop I provide, but that's unfortunately not enough.
const parallel = new sfn.Parallel(this, 'WithdrawEventParallelState')
.branch(happyPath)
.addCatch(failure.next(rollback), { resultPath: '$.Error' });
Am I missing some prop to carry on into the error catch lambda with the latest step function event?
Tried removing the resultPath, then I get no error at all
I am writing an application to consume messages from queue. I am able to successfully bind the sqs and receive the messages. However, when I want to requeue the message, I am using as follows.
message.getHeaders().get(AwsHeaders.ACKNOWLEDGMENT, QueueMessageAcknowledgment.class)
.acknowledge();
I also use to requeue
StaticMessageHeaderAccessor.getAcknowledgmentCallback(message).acknowledge(AcknowledgmentCallback.Status.REQUEUE);
But it is not successful.
I also tried PollableMessage but unclear of how to implement it.
https://docs.spring.io/spring-cloud-stream/docs/3.1.0/reference/html/spring-cloud-stream.html#_overview_2
I've a Consumer like this
public class DefaultChannel implements Channel, Consumer<Message<String>> {
#Override
public void accept(Message<String> message) {
if("success".equals(message.getPayLoad()){
message.getHeaders().get(AwsHeaders.ACKNOWLEDGMENT, QueueMessageAcknowledgment.class)
.acknowledge();
}else{
StaticMessageHeaderAccessor.getAcknowledgmentCallback(message).acknowledge(AcknowledgmentCallback.Status.REQUEUE);
}
}
}
I was able to re-queue succesfully messageDeletionPolicy: ON_SUCCESS properties and throwing Exception from the code.
I have a dynamoDB stream which is triggering a lambda handler that looks like this:
let failedRequestId: string
await asyncForEachSerial(event.Records, async (record) => {
try {
await handle(record.dynamodb.OldImage, record.dynamodb.NewImage, record, context)
return true
} catch (e) {
failedRequestId = record.dynamodb.SequenceNumber
}
return false //break;
})
return {
batchItemFailures:[ { itemIdentifier: failedRequestId } ]
}
I have my lambda set up with a DestinationConfig.onFailure pointing to a DLQ I configured in SQS. The idea behind the handler is to process a batch of events and interrupt at the first failure. Then it reports the most recent failure in 'batchItemFailures' which tells the stream to continue at that record next try. (I pulled the idea from this article)
My current issue is that if there is a genuine failure of my handle() function on one of those records, then my exit code will trigger that record as my checkpoint for the next handler call. However the dlq condition doesn't ever trigger and I end up processing that record over and over again. I should also note that I am trying to avoid reprocessing records multiple times since handle() is not idempotent.
How can I elegantly handle errors while maintaining batching, but without triggering my handle() function more than once for well-behaved stream records?
I'm not sure if you have found the answer you were looking for. I'll respond in case someone else come across this issue.
There are 2 other parameters you'd want to use to avoid that issue. Quoting documentation (https://docs.aws.amazon.com/lambda/latest/dg/with-ddb.html):
Retry attempts – The maximum number of times that Lambda retries when the function returns an error. This doesn't apply to service errors or throttles where the batch didn't reach the function.
Maximum age of record – The maximum age of a record that Lambda sends to your function.
Basically, you'll have to specify how many time the failures should be retried and how far back in the events Lambda should be looking at.
I have the following source queue definition.
lazy val (processMessageSource, processMessageQueueFuture) =
peekMatValue(
Source
.queue[(ProcessMessageInputData, Promise[ProcessMessageOutputData])](5, OverflowStrategy.dropNew))
def peekMatValue[T, M](src: Source[T, M]): (Source[T, M], Future[M]) {
val p = Promise[M]
val s = src.mapMaterializedValue { m =>
p.trySuccess(m)
m
}
(s, p.future)
}
The Process Message Input Data Class is essentially an artifact that is created when a caller calls a web server endpoint, which is hooked upto this stream (i.e. the service endpoint's business logic puts messages into this queue). The Promise of process message out is something that is completed downstream in the sink of the application, and the web server then has an on complete callback on this future to return the response back.
There are also other sources of ingress into this stream.
Now the buffer may be backed up since the other source may overload the system, thereby triggering stream back pressure. The existing code just drops the new message. But I still want to complete the process message output promise to complete with an exception stating something like "Throttled".
Is there a mechanism to write a custom overflow strategy, or a post processing on the overflowed element that allows me to do this?
According to https://github.com/akka/akka/blob/master/akkastream/src/main/scala/akka/stream/impl/QueueSource.scala#L83
dropNew would work just fine. On clients end it would look like.
processMessageQueue.offer(in, pr).foreach { res =>
res match {
case Enqueued => // Code to handle case when successfully enqueued.
case Dropped => // Code to handle messages that are dropped since the buffier was overflowing.
}
}
I am wondering something, and I really can't find information about it. Maybe it is not the way to go but, I would just like to know.
It is about Lambda working in batches. I know I can set up Lambda to consume batch messages. In my Lambda function I iterate each message, and if one fails, Lambda exits. And the cycle starts again.
I am wondering about slightly different approach
Let's assume I have three messages: A, B and C. I also take them in batches. Now if the message B fails (e.g. API call failed), I return message B to SQS and keep processing the message C.
Is it possible? If it is, is it a good approach? Because I see that I need to implement some extra complexity in Lambda and what not.
Thanks
There's an excellent article here. The relevant parts for you are...
Using a batchSize of 1, so that messages succeed or fail on their own.
Making sure your processing is idempotent, so reprocessing a message isn't harmful, outside of the extra processing cost.
Handle errors within your function code, perhaps by catching them and sending the message to a dead letter queue for further processing.
Calling the DeleteMessage API manually within your function after successfully processing a message.
The last bullet point is how I've managed to deal with the same problem. Instead of returning errors immediately, store them or note that an error has occurred, but then continue to handle the rest of the messages in the batch. At the end of processing, return or raise an error so that the SQS -> lambda trigger knows not to delete the failed messages. All successful messages will have already been deleted by your lambda handler.
sqs = boto3.client('sqs')
def handler(event, context):
failed = False
for msg in event['Records']:
try:
# Do something with the message.
handle_message(msg)
except Exception:
# Ok it failed, but allow the loop to finish.
logger.exception('Failed to handle message')
failed = True
else:
# The message was handled successfully. We can delete it now.
sqs.delete_message(
QueueUrl=<queue_url>,
ReceiptHandle=msg['receiptHandle'],
)
# It doesn't matter what the error is. You just want to raise here
# to ensure the trigger doesn't delete any of the failed messages.
if failed:
raise RuntimeError('Failed to process one or more messages')
def handle_msg(msg):
...
For Node.js, check out https://www.npmjs.com/package/#middy/sqs-partial-batch-failure.
const middy = require('#middy/core')
const sqsBatch = require('#middy/sqs-partial-batch-failure')
const originalHandler = (event, context, cb) => {
const recordPromises = event.Records.map(async (record, index) => { /* Custom message processing logic */ })
return Promise.allSettled(recordPromises)
}
const handler = middy(originalHandler)
.use(sqsBatch())
Check out https://medium.com/#brettandrews/handling-sqs-partial-batch-failures-in-aws-lambda-d9d6940a17aa for more details.
As of Nov 2019, AWS has introduced the concept of Bisect On Function Error, along with Maximum retries. If your function is idempotent this can be used.
In this approach you should throw an error from the function even if one item in the batch is failing. AWS with split the batch into two and retry. Now one half of the batch should pass successfully. For the other half the process is continued till the bad record is isolated.
Like all architecture decisions, it depends on your goal and what you are willing to trade for more complexity. Using SQS will allow you to process messages out of order so that retries don't block other messages. Whether or not that is worth the complexity depends on why you are worried about messages getting blocked.
I suggest reading about Lambda retry behavior and Dead Letter Queues.
If you want to retry only the failed messages out of a batch of messages it is totally doable, but does add slight complexity.
A possible approach to achieve this is iterating through a list of your events (ex [eventA, eventB, eventC]), and for each execution, append to a list of failed events if the event failed. Then, have an end case that checks to see if the list of failed events has anything in it, and if it does, manually send the messages back to SQS (using SQS sendMessageBatch).
However, you should note that this puts the events to the end of the queue, since you are manually inserting them back.
Anything can be a "good approach" if it solves a problem you are having without much complexity, and in this case, the issue of having to re-execute successful events is definitely a problem that you can solve in this manner.
SQS/Lambda supports reporting batch failures. How it works is within each batch iteration, you catch all exceptions, and if that iteration fails add that messageId to an SQSBatchResponse. At the end when all SQS messages have been processed, you return the batch response.
Here is the relevant docs section: https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html#services-sqs-batchfailurereporting
To use this feature, your function must gracefully handle errors. Have your function logic catch all exceptions and report the messages that result in failure in batchItemFailures in your function response. If your function throws an exception, the entire batch is considered a complete failure.
To add to the answer by David:
SQS/Lambda supports reporting batch failures. How it works is within each batch iteration, you catch all exceptions, and if that iteration fails add that messageId to an SQSBatchResponse. At the end when all SQS messages have been processed, you return the batch response.
Here is the relevant docs section: https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html#services-sqs-batchfailurereporting
I implemented this, but a batch of A, B and C, with B failing, would still mark all three as complete. It turns out you need to explicitly define the lambda event source mapping to expect a batch failure to be returned. It can be done by adding the key of FunctionResponseTypes with the value of a list containing ReportBatchItemFailures. Here is the relevant docs: https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html#services-sqs-batchfailurereporting
My sam template looks like this after adding this:
Type: SQS
Properties:
Queue: my-queue-arn
BatchSize: 10
Enabled: true
FunctionResponseTypes:
- ReportBatchItemFailures