How to make an AWS SAM app highly available and fault tolerant - amazon-web-services

I have developed a SAM app that takes incoming emails submissions, listens to specific SQS queues for the incoming data type, parses the payload via the correct python Lambda worker function and submits data down the line into another SQS queue for further action. I have deployed this stack from PyCharm using a SAM YAML template.
I need to improve the app with better disaster recovery options for fault tolerance. I don't believe I can improve the inbound email receiving function through SES, so that will be a single point of failure (if SES fails in us-east-1, my app fails). However, I'd like for the SQS messages to be delivered to the Parser Worker functions in multiple regions.
Scenario:
email1 comes into SES and is stored in the correct S3 key
S3 triggers event notification to Parser1 SQS queue
Parser1 SQS message is picked up in a load-balanced manner by one of multiple Parser1 Worker functions deployed in multiple Regions (us-east1a, us-east-2a, etc) thus providing disaster recovery and fault tolerance
Is there a way to distribute the SQS message in a load-balanced type of scenario to multiple Lambdas deployed in multiple regions so that I don't have the single point of failure in one stack in one region?
I have the app running in production in its single-stack deployment. I am concerned about redundancy and single points of failure. I am not sure how to improve this situation, so have not tried anything yet.

Related

How to deliver message from Amazon SQS to all running instnaces of a service

I have two services, one is the producer (Service A) and one is a consumer (Service B). So Service A will produce a message which will be published to Amazon SQS service and then it will be delivered to Service B as it has subscribed to the queue. So, this works fine until I have a single instance of Service B.
But when I start another instance of Service B, such that now there are 2 instances of Service B, both of which are subscribing to the same queue, as it is the same service, I observe that the messages from SQS are now being delivered in round-robin fashion. Such that at a given time, only one instance of Service B receives the message that is published by Service A. I want that when a message is published to this queue, it should be received by all the instances of Service B.
How can we do this? I have developed these services as Springboot applications, along with Spring cloud dependencies.
Please see the diagram below for reference.
If you are interested in building functionality like this, use SNS, not SQS. We have a Spring BOOT example that shows how to build a web app that lets users sign up for email subscriptions and then when a message is published, all subscribed emails get the message.
The purpose of this example is to get you up and running building a Spring BOOT app using the Amazon Simple Notification Service. That is, you can build this app with Spring BOOT and the official AWS Java V2 API:
Creating a Publish/Subscription Spring Boot Application
While your message may appear to be read in a round robbin fashion, they are not actually consumed in a round robin. SQS works by making all messages available to any consumer (that has the appropriate IAM permissions) and hides the message as soon as one consumer fetches the message for a pre-configured amount of time that you can configure, effectively "locking" that message. The fact that all of your consumer seem to be operating in a round robin way is most likely coincidental.
As others have mentioned you could use SNS instead of SQS to fanout messages to multiple consumers at once, but that's not as simple a setup as it may sound. If your service B is load balanced, the HTTP endpoint subscriber will point to the Load Balancer's DNS name, and thus only one instance will get the message. Assuming your instances have a public IP, you could modify your app so that it self-registers as an HTTP subscriber to the topic when the application wakes up. The downsides here are that you're not only bypassing your Load Balancer, you're also losing the durability guarantees that come with SQS since an SNS topic will try to send the message X times, but will simply drop the message after that.
An alternative solution would be to change the message hiding timeout setting on the SQS queue to 0, that way the message is never locked and every consumer will be able to read it. That will also mean you'll need to modify your application to a) not process messages twice, as the same message will likely be read more than once by the time it has finished processing and b) handle failure gracefully when one of the instance deletes the message from the queue and other instances try to delete that message from the queue after that.
Alternatively, you might want to use some sort of service mesh, or service discovery mechanism so that instances can communicate between each other in a peer-to-peer fashion so that one instance can pull the message from the SQS queue and propagate it to the other instances of the service.
You could also use a distributed store like Redis or DynamoDB to persist the messages and their current status so that every instance can read them, but only one instance will ever insert a new row.
Ultimately there's a few solutions out there for this, but without understanding the use-case properly it's hard to make a hard recommendation.
Implement message fanout using Amazon Simple Notification Service (SNS) and Amazon Simple Queue Service (SQS). There is a hands-on Getting Started example of this.
Here's how it works: in the fanout model, service A publishes a message to an SNS topic. Each instance of service B has an associated SQS queue which is subscribed to that SNS topic. The published message is delivered to each subscribed queue and hence to each instance of service B.

How to target http api gateway or ALB from EventBridge

I need to send data from 1 ecs container to another. How can I do that? There is AWS EventBridge that allows me to send data from ECS container to EventBridge. But I could not figure out how to send this data to the other ECS container from EventBridge.
P.S. I have node applications running in ECS containers. I am using HTTP API Gateway and Application Load Balancer (ALB)
Answers to questions asked in comments
What kind of data? Text Data
How big is one msg? Small. Just simple objects
Does it have to be real-time or not? No
I need to send data from 1 ecs container to another. How can I do that?
Usually, when you want your microservices to communicate with each other, an SQS is a preferred choice. The use of the SQS allows you to fully de-couple the producer and the consumer of the messages.
In your case, one container would publish messages to the queue, while the second container would pull for the messages on a fixed schedule. For these to work, both containers would need to have permissions in their task executions role to access the SQS and use AWS SDK to publish and receive the message.
There are other choices as well, such as SNS and EventBridge as you noted. However, due to its simplicity, SQS is often the first choice to consider.

AWS Reduce webhooks ec2 impact with queue

I have a PHP web application that is running on an ec2 server. The app is integrated with another service which involves subscribing to a number of webhooks.
The number of requests the server is receiving from these webhooks has become unmanageable, and I'm looking for a more efficient way to deal with data coming from these webhooks.
My initial thought was to use API gateway and put these requests into an SQS queue and read from this queue in batches.
However, I would like these batches to be read by the ec2 instance because the code used to process the webhooks is code reused throughout my application.
Is this possible or am I forced to use a lambda function with SQS? Is there a better way?
The approach you suggested (API Gateway + SQS) will work just fine. There is no need to use AWS Lambda. You'll want to use the AWS SDK for PHP when writing the application code that receives messages from your SQS queue.
I've used this pattern before and it's a great solution.
. . . am I forced to use a lamda function with SQS?
SQS plus Lambda is basically free. At this time, you get 1M (million) lambda calls and 1M (million) SQS requests per month. Remember that those SQS Requests may contain up to 10 messages and that's a potential 10M messages, all inside the free tier. Your EC2 instance is likely always on. Your lambda function is not. Even if you only use Lambda to push the SQS data to a data store like RDBMS for your EC2 to periodically poll, the operation would be bullet-proof and very inexpensive. With the introduction of SQS you could transition the common EC2 code to Lambda function(s). These now have a run time of 15 minutes.
To cite my sources:
SQS pricing for reference: https://aws.amazon.com/sqs/pricing/
Lambda pricing for reference: https://aws.amazon.com/lambda/pricing/

AWS SQS Asynchronous Queuing Pattern (Request/Response)

I'm looking for help with an architectural design decision I'm making with a product.
We've got multiple producers (initiated by API Gateway calls into Lambda) that put messages on a SQS queue (the request queue). There can be multiple simultaneous calls, so there would be multiple Lambda instances running in parallel.
Then we have consumers (lets say twenty EC2 instances) who long-poll on the SQS for the message to process them. They take about 30-45 seconds to process a message each.
I would then ideally like to send the response back to the producer that issued the request - and this is the part I'm struggling with with SQS. I would in theory have a separate response queue that the initial Lambda producers would then be consuming, but there doesn't seem to be a way to cherry pick the specific correlated response. That is, each Lambda function might pick up another function's response. I'm looking for something similar to this design pattern: http://soapatterns.org/design_patterns/asynchronous_queuing
The only option that I can see is to create a new SQS Response queue for each Lambda API call, passing in its ARN in the message for the consumers to put the response on, but I can't imagine that's very efficient - especially when there's potentially hundreds of messages a minute? Am I missing something obvious?
I suppose the only other alternative would be setting up a bigger message broker (e.g. RabbitMQ/ApacheMQ) environment, but I'd like to avoid that if possible.
Thanks!
Create a (Temporary) Response Queue For Every Request
To late for the party, but i was thinking that i might find some help in what i want to achieve, #MattHouser #Zaheer Ally , or give an idea to someone working on a related issue.
I am facing a similar challenge. I have an API that upon request by a client, needs to communicate to multiple external APIs and collect (delayed) results.
Since my PHP API is synchronous, it can only perform these requests sequentially. So, i was thinking to use a request queue, where the producer (API) would send messages. Then, multiple workers would consume these messages, each of them performing one of these external API calls.
To get the results back, the producer would have created a temporary response queue, the name-identifier of which would be embedded in the message sent to workers. Hence, each worker would 'publish' his results on this temporary queue.
In the meantime, the producer would keep polling the temporary queue until he received the expected number of messages. Finally, he would delete the queue and send the collected results back to the client.
Yes, you could use RabbitMQ for a more "rpc" queue pattern.
But if you want to stay within AWS, try using something other than SQS for the response.
Instead, you could use S3 for the response. When your producer puts the item into SQS, include in the message an S3 destination for the response. When your consumer completes the tasks, put the response in the desired S3 location.
Then you can check S3 for the response.
Update
You may be able to accomplish an RPC-like message queue using Redis.
https://github.com/ServiceStack/ServiceStack/wiki/Messaging-and-redis
Then, you can use AWS ElastiCache for your Redis cluster. This would completely replace the use of SQS.
Another option would be to use Redis' pub/sub mechanism to asynchronously notify your lambda that the backend work is done. You can use AWS's Elasticache for Redis for an all-AWS-managed solution. Your lambda function would generate a UUID for each request, use that as the channel name to subscribe to, pass it along in the SQS message, and then the backend workers would publish a notification to that channel when the work is done.
I was facing this same problem so I tried it out, and it does work. Whether it's worth the effort over just polling S3 is another question. You have to configure the lambda functions to run inside your VPC, so they can access your Redis. I was going to have to do this anyway since I'd want the workers, in my case also lambda functions, to be able to access my Elasticsearch and RDS. But there are some considerations: most importantly, you need to use a private subnet with a NAT Gateway (or your own NAT Instance), so it can get out to the Internet and AWS managed services (including SQS).
One other thing I just stumbled across is that requests through API Gateway currently cannot take longer than 29 seconds, and this cannot be increased by AWS. You mentioned your jobs take 30 or more seconds, so this could be a showstopper for you using API Gateway and Lambda in this way anyway.
AWS now provides a Java client that supports temporary queues. This is useful for request/response patterns. I can't see a non-Java version.

Build a firebase / fanout.io like service on amazon web services aws

I am using firebase to notify web browsers (javascript clients) about changes on specific topics. I am very happy with it. However I would really like to (only) use aws web services.
Unfortunately I am not able to determine whether it is possible to build such a service on aws. I am not talking about having EC2 instances running some firebase / fanout.io alternatives. I am talking about utilizing services such as lambda, dynamodb streams, SNS & SQS.
Are there any socket notification services available or is it possible to achieve something similar by using the provided services?
I looked into this very recently with the same idea, but eventually I came down on just using fanout. AWS does not provide server-push HTTP notification services out of the box.
Lambda functions are billed per 100 ms, so any long-polling against lambda will end up billing for the entirety of the time the client is connected.
SNS does not provide long polling to browsers; the available clients are geared towards mobile, email, HTTP/S, and other Amazon products like Lambda and SQS.
SQS would require a dedicated queue per client as it does not support broadcast.
Now, if the lambda pricing doesn't bother you, you could possibly do this:
Write a lambda function that is called via the API service that opens up a connection to SQS and waits for a message. The key is to start the lambda call from HTTP, but within the function wait on the queue (using Boto, for example, if you are writing this in Python). This code would need to create a queue dedicated to servicing one particular client, uniquely keyed by something like a GUID that is passed in by the client.
Link to the lambda function using the Amazon API service.
Call the lambda function via the API from the browser and wait for it to either receive a message on the dedicated SQS queue or timeout, probably using long-polling both in the API connection and the SQS connection. Fully draining the queue (or at least taking as many messages in a batch up to some limit) would be advisable here as well in order to reduce the number of calls to the API.
Publish your event to the dedicated SQS queue associated with the client. This will require the publisher to know the client's unique key.
Return the event read from SQS as the result of the lambda call.
Some problems with this approach:
Lambda pricing - not terribly expensive, but something like fanout is basically free
You would need a dedicated SQS queue per client; cleanup might become a problem
SQS bills on number of calls, which includes checking for a message. Long-polling SQS will alleviate some of this
You would need to write the JavaScript client to call the lambda API endpoint repeatedly in a long-polling fashion
Lambda is currently limited as to the number of concurrently running functions it supports (100 right now but you can contact support to bump that up)
Some benefits with this approach:
SQS queues are persistent, so unless a message is processed successfully it will go back on the queue after the visibility timeout
You can set up CloudWatch to monitor all of the API, Lambda, and SQS events
Other Notes
You could call the SQS APIs directly from the browser by using Lambda to issue temporary security credentials via STS. Receiving a message in JavaScript is documented here: http://docs.aws.amazon.com/AWSJavaScriptSDK/guide/browser-examples.html#Receiving_a_message I do not, however, know off the top of my head if you would run into cross-domain issues.
Your only other option, if it must be all AWS, is to use load-balanced EC2 instances running something like fanout as you mentioned.
Using fanout is very little work: it's both extremely affordable and already built and tested.