Is there a way to see in the log that the retry is happening? I need to know if this is working in our test environment before implementing it into production.
There are rare instances when we get the following due to a portion of the key being a timestamp and data coming in to the table from various sources. We need to have the writer retry when we get a - DB2 SQL Error: SQLCODE=-803, SQLSTATE=23505
<chunk>
...
<retryable-exception-classes>
<include class="com.ibm.db2.jcc.am.SqlIntegrityConstraintViolationException"></include>
</retryable-exception-classes>
</chunk>
JBeret does not log these event, but you can implement some listeners defined by batch spec to act on you own. For example, RetryReadListener, RetryWriteListener, or RetryProcessListener.
I am wondering something, and I really can't find information about it. Maybe it is not the way to go but, I would just like to know.
It is about Lambda working in batches. I know I can set up Lambda to consume batch messages. In my Lambda function I iterate each message, and if one fails, Lambda exits. And the cycle starts again.
I am wondering about slightly different approach
Let's assume I have three messages: A, B and C. I also take them in batches. Now if the message B fails (e.g. API call failed), I return message B to SQS and keep processing the message C.
Is it possible? If it is, is it a good approach? Because I see that I need to implement some extra complexity in Lambda and what not.
Thanks
There's an excellent article here. The relevant parts for you are...
Using a batchSize of 1, so that messages succeed or fail on their own.
Making sure your processing is idempotent, so reprocessing a message isn't harmful, outside of the extra processing cost.
Handle errors within your function code, perhaps by catching them and sending the message to a dead letter queue for further processing.
Calling the DeleteMessage API manually within your function after successfully processing a message.
The last bullet point is how I've managed to deal with the same problem. Instead of returning errors immediately, store them or note that an error has occurred, but then continue to handle the rest of the messages in the batch. At the end of processing, return or raise an error so that the SQS -> lambda trigger knows not to delete the failed messages. All successful messages will have already been deleted by your lambda handler.
sqs = boto3.client('sqs')
def handler(event, context):
failed = False
for msg in event['Records']:
try:
# Do something with the message.
handle_message(msg)
except Exception:
# Ok it failed, but allow the loop to finish.
logger.exception('Failed to handle message')
failed = True
else:
# The message was handled successfully. We can delete it now.
sqs.delete_message(
QueueUrl=<queue_url>,
ReceiptHandle=msg['receiptHandle'],
)
# It doesn't matter what the error is. You just want to raise here
# to ensure the trigger doesn't delete any of the failed messages.
if failed:
raise RuntimeError('Failed to process one or more messages')
def handle_msg(msg):
...
For Node.js, check out https://www.npmjs.com/package/#middy/sqs-partial-batch-failure.
const middy = require('#middy/core')
const sqsBatch = require('#middy/sqs-partial-batch-failure')
const originalHandler = (event, context, cb) => {
const recordPromises = event.Records.map(async (record, index) => { /* Custom message processing logic */ })
return Promise.allSettled(recordPromises)
}
const handler = middy(originalHandler)
.use(sqsBatch())
Check out https://medium.com/#brettandrews/handling-sqs-partial-batch-failures-in-aws-lambda-d9d6940a17aa for more details.
As of Nov 2019, AWS has introduced the concept of Bisect On Function Error, along with Maximum retries. If your function is idempotent this can be used.
In this approach you should throw an error from the function even if one item in the batch is failing. AWS with split the batch into two and retry. Now one half of the batch should pass successfully. For the other half the process is continued till the bad record is isolated.
Like all architecture decisions, it depends on your goal and what you are willing to trade for more complexity. Using SQS will allow you to process messages out of order so that retries don't block other messages. Whether or not that is worth the complexity depends on why you are worried about messages getting blocked.
I suggest reading about Lambda retry behavior and Dead Letter Queues.
If you want to retry only the failed messages out of a batch of messages it is totally doable, but does add slight complexity.
A possible approach to achieve this is iterating through a list of your events (ex [eventA, eventB, eventC]), and for each execution, append to a list of failed events if the event failed. Then, have an end case that checks to see if the list of failed events has anything in it, and if it does, manually send the messages back to SQS (using SQS sendMessageBatch).
However, you should note that this puts the events to the end of the queue, since you are manually inserting them back.
Anything can be a "good approach" if it solves a problem you are having without much complexity, and in this case, the issue of having to re-execute successful events is definitely a problem that you can solve in this manner.
SQS/Lambda supports reporting batch failures. How it works is within each batch iteration, you catch all exceptions, and if that iteration fails add that messageId to an SQSBatchResponse. At the end when all SQS messages have been processed, you return the batch response.
Here is the relevant docs section: https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html#services-sqs-batchfailurereporting
To use this feature, your function must gracefully handle errors. Have your function logic catch all exceptions and report the messages that result in failure in batchItemFailures in your function response. If your function throws an exception, the entire batch is considered a complete failure.
To add to the answer by David:
SQS/Lambda supports reporting batch failures. How it works is within each batch iteration, you catch all exceptions, and if that iteration fails add that messageId to an SQSBatchResponse. At the end when all SQS messages have been processed, you return the batch response.
Here is the relevant docs section: https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html#services-sqs-batchfailurereporting
I implemented this, but a batch of A, B and C, with B failing, would still mark all three as complete. It turns out you need to explicitly define the lambda event source mapping to expect a batch failure to be returned. It can be done by adding the key of FunctionResponseTypes with the value of a list containing ReportBatchItemFailures. Here is the relevant docs: https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html#services-sqs-batchfailurereporting
My sam template looks like this after adding this:
Type: SQS
Properties:
Queue: my-queue-arn
BatchSize: 10
Enabled: true
FunctionResponseTypes:
- ReportBatchItemFailures
When I run dataflow jobs that writes to google cloud datastore, sometime I see the metrics show that I had one or two datastoreRpcErrors:
Since these datastore writes usually contain a batch of keys, I am wondering in the situation of RpcError, if some retry will happen automatically. If not, what would be a good way to handle these cases?
tl;dr: By default datastoreRpcErrors will use 5 retries automatically.
I dig into the code of datastoreio in beam python sdk. It looks like the final entity mutations are flushed in batch via DatastoreWriteFn().
# Flush the current batch of mutations to Cloud Datastore.
_, latency_ms = helper.write_mutations(
self._datastore, self._project, self._mutations,
self._throttler, self._update_rpc_stats,
throttle_delay=_Mutate._WRITE_BATCH_TARGET_LATENCY_MS/1000)
The RPCError is caught by this block of code in write_mutations in the helper; and there is a decorator #retry.with_exponential_backoff for commit method; and the default number of retry is set to 5; retry_on_rpc_error defines the concrete RPCError and SocketError reasons to trigger retry.
for mutation in mutations:
commit_request.mutations.add().CopyFrom(mutation)
#retry.with_exponential_backoff(num_retries=5,
retry_filter=retry_on_rpc_error)
def commit(request):
# Client-side throttling.
while throttler.throttle_request(time.time()*1000):
try:
response = datastore.commit(request)
...
except (RPCError, SocketError):
if rpc_stats_callback:
rpc_stats_callback(errors=1)
raise
...
I think you should first of all determine which kind of error occurred in order to see what are your options.
However, in the official Datastore documentation, there is a list of all the possible errors and their error codes . Fortunately, they come with recommended actions for each.
My advice is that your implement their recommendations and see for alternatives if they are not effective for you
I've got a C/C++/Objective-C project that send asl logging messages.
The default configuration in asl.conf route all log message with level above notice to system log (see below rule), and I'd like to cancel this rule for my specific facility only.
This means, that all log messages under my facility will be routed to my log file only, and not to system.log.
here's the configuraiton where my facility is defined to com.bla.bla
asl.conf
? [<= Level notice] file system.log
my_asl.conf
? [<= Level notice] [=Facility com.bla.bla] skip / ignore
I've tried both skip and ignore, but i didn't made any change. the only thing that work is to erase the rule from asl.conf, but i don't want to change the behavior of other processes / facilities and to modify some default rules.
is there any rule i can add to ban my messages only from system.log ?
thanks
After re-reading asl.conf man page over and over again, I've found out that i can use 'claim' command to ignore asl.conf base configuration file for my specific rule
claim Messages that match the query associated with a 'claim' action are not processed by the main ASL configuration file /etc/asl.conf. While claimed messages are not pro-cessed processed cessed by /etc/asl.conf, they are not completely private. Other modules may also claim messages, and in some cases two or more modules may have claim actions that match the same messages. This action only blocks processing by /etc/asl.conf.
The `claim' action may be followed by the keyword 'only'. In this case, only those messages that match the 'claim only' query will be processed by subsequent
rules in the module.
I followed the description of the tag 'claim' and added the following configuration to my config file :
? [= com.bla.bla] file /var/log/my-log
? [= com.bla.bla] claim
I have a unit test that creates an error condition. Normally, the class under test writes this error to the logs (using log4j in this case, but I don't think that matters). I can change the log level temporarily, using
Logger targetLogger = Logger.getLogger(ClassUnderTest.class);
Level oldLvl = targetLogger.getLevel();
targetLogger.setLevel(Level.FATAL);
theTestObject.doABadThing();
assertTrue(theTestObject.hadAnError());
targetLogger.setLevel(oldLvl);
but that also means that if an unrelated / unintended error occurs during testing, I won't see that information in the logs either.
Is there a best practice or common pattern I'm supposed to use here? I don't like prodding the log levels if I can help it, but I also don't like having a bunch of ERROR noise in the test output, which could scare future developers.
If your logging layer permits, it is a good practice to make an assertion on the error message. You can do it by implementing your own logger that just asserts on the message (without output), or by using a memory-buffer logger and then check on the contents of the log buffer.
Under no circumstances should the error message end up in the unit-test execution log. This will cause people to get used to errors in the log and mask other errors. In short, your options are:
Most preferred: Catch the message in the harness and assert on it.
Somewhat OK: Raise the level and ignore the message.
Not OK: Don't do anything and let the log message reach stderr/syslog.
The way I approach this assuming an XUnit style of unit testing (Junit,Pyunit, etc)
#Test(expected = MyException)
foo_1() throws Exception
{
theTestObject.doABadThing(); //MyException here
}
The issue with doing logging is that someone needs to go and actually parse the log file, this is time consuming and error prone. However the test will pass above if MyException is generated and fail if it isn't. This in turn allows you to fail the build automatically instead of hoping the tester read the logs correctly.