Fallback for DynamoDB with SQS - amazon-web-services

We have a synchronous REST endpoint that does other processing apart from saving item to DynamoDB database which will be used for later purpose.
The requirement is to not error out if the database save fails due to any type of exception.
How do we handle the case where dynamo db is down in the entire region(rare but possible).Is it the right pattern to publish to SQS and have a seperate process consume and save to DynamoDB by pinging it(ListTables or ping).
Should we fallback to another region or publish to SQS? Is it worth using resilience4j circuit breaker pattern?

It is a common pattern to have the API simple enqueue a request to SQS. This has many benefits such as allowing higher throughput, decoupling the producer and consumer and better fault tolerance.
This would be a fine design but your REST API will no longer be synchronous and the caller won't quite know whether the operation was successfully processed so you may need to add another endpoint to get the status of the request.
I am not super familiar with resilence4j circuit breakers but this may not be necessary as the Amazon SDKs already have built in retries if that is the main benefit you are seeking.

Related

How should I handle asynchronous processes that occur after API calls in AWS?

I'm designing the backend for a website that uses API Gateway and Lambda to handle API requests, many of which target a MySQL DB on RDS. Some processes need to happen asynchronously but I'm debating which is best practice or cleaner.
In the given scenario, every time a user creates a new row in a certain table, let's say an email also needs to be sent asynchronously. There are many other scenarios similar to this but this will set precedent.
Option 1: In the lambda that handles the API request, first write to the MySQL instance to add the new row. When the response from MySQL comes back successful, write to something like SQS which will later be read from another lambda that sends an email. When the response from SQS is successful that the record was added to the queue, send a 201 response saying the REST API call was successful.
Option 2: In the lambda that handles the API request, write to the MySQL instance to add the new row. When the response from the MySQL comes back successful, send a 201 response saying the REST API call was successful. Then set up a DMS (data migration service) task that runs indefinitely to send database modification binlogs to a kinesis stream which will trigger a lambda that will handle all DB changes, read the change as a new row in a certain table, and send an email.
Option 1:
less infrastructure
more direct tracking of logic from an API call
1 extra http call (to sqs) delaying response times for an api for a web page
Option 2:
more infrastructure (dms task, replication instance)
scaling out shards may mean loss of ordering when processes binlog events if ordering is a requirement (it is)
side question: Are you able to choose hash key for kinesis for dms tasks from mysql?
a single codebase for reacting to all modifications in the DB may actually make following logic in code simpler
Is this the tradeoff or am I missing something? What is best practice in this scenario?
Option 1 in my view seems most logical, but I would replace SQS and second lambda with SNS. So, modified option 1 could be:
Option 1: In the lambda that handles the API request, first write to the MySQL instance to add the new row. When the response from MySQL comes back successful, publish confirmation message to SNS that sends an email. When the response from SNS is successful send a 201 response saying the REST API call was successful.
This should be faster, cheaper and easier to implement then using SQS and second lambda for sending email.

Kinesis Producer callback functions - guaranteed delivery?

Streaming to Kinesis billions of messages a day.
We're looking for an implementation that would allow us to deliver messages to Kinesis with exactly-once guarantee.
Our producer framework requires a streaming sink to be idempotent for exactly-once delivery guarantee, which Kinesis is not. So we're getting at-least once deliveries currently. (duplicates are possible and we do see them, when a streaming micro-batch has to restart for whatever reason on the producer side)
We started looking at Kinesis Producer Library (KPL) callback functions. Basically we would be tracking state of what messages were delivered and what not in DynamoDB based on a key that's present in each message. And if we know that a message was already sent, we will skip it for delivery re-attempt. Then it seems exactly-once is possible.. with two concerns:
1)
The only question we have - how likely it is we would lose a invocation of the callback function (e.g. network glitch etc), or the callback function itself has failed (e.g. we ran into a DynamoDB limit/ outage etc) -- is this documented somewhere? I know the chances are not high, but we want to design a system that would be resilient to some expected things like these.
2)
Timing. Let's say if for whatever reason Kinesis invoked a callback function with a delay (5-15 milliseconds would be enough to break some assumptions in the above callback functions that persists delivery state in DynamoDB). And while we haven't received a confirmation on the delivery, our streaming producer framework has attempted redelivery that it thinks wasn't yet delivery. Any workarounds for this potential issue?
ps. We know that one way to workaround, is to make dedups on an application side (receiver from that kinesis stream), but that's outside of our project and we have a hard requirement to get exactly-once into that Kinesis stream.
For #1, any path you go down you'll find yourself in edge cases that could lead you to loss of data, or duplicate calls. Even using a two phased commit protocol doesn't work here if the consumer isn't participating in that protocol.
For #2, Kinesis is ordered, so if you do get duplicates you should be able to reliably assume they will be on the same shard, and thus not processed while another reader is still processing (assuming one reader per shard). Just make sure you are using a strongly consistent read when calling DynamoDB.

AWS SQS BackUp Solution Design

Problem Statement
Informal State
We have some scenarios where the integration layer (a combination of AWS SNS/SQS components and etc.) is also responsible for the data distribution to target systems. Those are mostly async flows. In this case, we send a confirmation to a caller that we have received the data and will take a responsibility for the data delivery. Here, although the data is not originated from the integration layer we are still holding it and need to make sure that the data is not lost, for example, if the consumers are down or if messages, on-error, are sent to the DLQs and hence being automatically deleted after the retention period.
Solution Design
Currently my idea was to proceed with a back-up of the SQS/DLQ queues based upon CloudWatch configured alerts using ApproximateAgeOfOldestMessage metric with some applied threshold (something like the below):
Msg Expiration Event if ApproximateAgeOfOldestMessage / Message retention > Threshold
Now, more I go forward with this idea and more I doubt that this might be actually the right approach…
In particular, I would like to build something unobtrusive that can be "attached" to our SQS queues and dump the messages that are about to expire in some repository, like for example the AWS S3. Then have a procedure to recover the messages from S3 to the same original queue.
The above procedure contains many challenges like: message identification and consumption (receive message is not design to "query" for specific messages), message dump in the repository with a reference to the source queue, etc. which would suggest to me that the above approach might be a complex over-kill.
That being said, I'm aware of other "alternatives" (such as this) but I would appreciate if you could answer to the specific technical details described above, without trying to challenge the "need" instead.
Similar to Mark B's suggestion, you can use the SQS extended client (https://github.com/awslabs/amazon-sqs-java-extended-client-lib) to send all your messages through S3 (which is a configuration knob: https://github.com/awslabs/amazon-sqs-java-extended-client-lib/blob/master/src/main/java/com/amazon/sqs/javamessaging/ExtendedClientConfiguration.java#L189).
The extended client is a drop-in replacement for the AmazonSQS interface so it minimizes the intrusion on business logic - usually it's a matter of just changing your dependency injection.

Workaround aws apigateway timeout with lambda - asynchronous processing

I have a serverless backend running on lambda. The runtime usually varies betweeen 40-250s which is over the apigateway max allowed runtime (29s). As such I think my only option is to resort to asynchronous processing. I get the idea behind it, but help online seems sparse and I'd like to know if there are any best practices out there? Or what would be the simplest way for me to get around this timeout problem–using asynchronous processing or other?
It really depends on your use case. But probably an asynchronous approach is best fitted for this scenario given that it's not usually a good idea from the calling side of your API to wait 250 seconds to get the reply back (probably that's why the 29s limitation on API Gateway).
Asynchronous simply means that you will be replying back from Lambda saying that you received the request and you are going to work on it but it will be available only later.
Then, you will be changing the logic on the client side, too, to check back after some time or perform some checks in a loop until the requested resource is ready.
Depending on what work needs to be done you could create an S3 bucket on the fly and reply back to the client with an S3 presigned URL. Then your worker will upload their results to the S3 bucket and the client will poll that bucket for the results until they are present.

does MSMQ have "lock until expire" functionality similar to Amazon SQS?

I've been using AWS SQS, which has a nice feature that when a message is claimed from the queue it locks for a period of time. During this lock if it is processed successfully the message is marked as completed. If the processing fails (and no response is received from the message processor), after a period of time the lock expires and the message is available for another processor to pick up.
Now I have a requirement to use queues outside of SQS (mostly for latency reasons, but potentially for cost reasons too). I'm really looking for a queue provider that has the same characteristic. MSMQ would be the obvious choice for me, since it's already installed and we use it elsewhere, but I can't find any functionality that handles failed messages in the same way.
Does MSMQ allow for this, or is there an easy way to replicate it?
Alternatively, is there another lightweight, open-source messaging service that does?
MSMQ does this already. If you read a message within a transaction and the transaction aborts then the message will reappear in the queue.