SQS fanout with DLQ - amazon-web-services

SQS fanout with DLQ - amazon-web-services

I am looking at using SNS-SQS services to deliver updates to machines running the same service. Since the plan is not for machines to communicate with each other, I was planning on creating a SQS for each machine (SQS would be created at startup).
I am however, not sure how to use a Dead-Letter Queue (DLQ) in such case. Should each SQS have its own DLQ or can I have common one which is shared across my SQS in the region? The concern I have with former approach is too many queues would be created (2x machines) and the concern with later is potential multiple copies of same message in the queue.
What is the best practice and recommended approach when using multiple SQS queues?

I wouldn't be concerned with the number of queues - they don't cost anything - so it really depends on how you plan on using the items in the dead-letter queue. I'll make the assumption that you will have some sort of process to review items in the DLQ to figure out why they were not processed before expiring.
Without knowing the details of what you plan to do, I would think a single DLQ would be better, and if you need to periodically process DLQ records, the processing app/system only needs to monitor that single queue.
Can't see the advantage of multiple DLQs in this case, at least based on your question.

As you are planning on doing a fanout process, having multiple queues is not a harm as long as they are used for asynchronous processing. Else a single queue is preferred. A fanout process in generally used when you want to process a few tasks concurrently by dividing it among several queues and working on them separately. (to read more about fanout)
The purpose of a Dead Letter Queue (DLQ) is to store messages that cannot be completed successfully by a certain process. Unless your process has a major fault, the number of elements that will be stored in a DLQ should be very less. Therefore it is okay to go ahead and use one DLQ for all other SQSs.
Having multiple DLQs will bring an overhead, where several processes will have to poll the DLQs for failed messages. Having just one DLQ can reduce this overhead.
It is recommended to use multiple DLQs if you want to store different categories of the failed messages.

Related

Auto-scaling processors based on SQS, with different backlog SLOs

I have this problem where we may have different SLOs (service level objectives) based on request. Some requests we want to process within 5 minutes, and some requests we can take longer to process (like 2 hours etc).
I was going to use Amazon SQS as a way to queue up the messages that need to be processed, and then use Auto-scaling to increase resources in order to process within allotted SLO. For example, if 1 machine can process 1 request every 10 seconds, then within 5 minutes I can process 30 messages. If I detect in the queue that the number of messages is > 30, I should spawn another machine to meet 5-minute SLO demand.
Similarly, if I have a 2-hour SLO, I can have a backlog as large as 720 before I need to scale up.
Based on this, I can't really place these different SLOs into the same queue, because then they will interfere with each other.
Possible approaches I was considering:
Have an SQS queue for each SLO, and auto-scale accordingly.
Have multiple message groups (one for each SLO), and then auto-scale based on message group.
Is (2) possible, I couldn't find documentation on that? If they are both possible, what are the pros and cons to each?

If you have messages to process with different priorities, the normal method is:
Create 2 Amazon SQS queues: One for high-priority messages, another for 'normal' messages
Have the workers pull from the high-priority queue first. If it is empty, pull from the other queue.
However, this means that 'normal' messages might never get processed if there are always messages in the high-priority queue, so you could instead have a certain number of workers pulling 'high then normal', and other workers just pulling from 'normal'.
The absolutely better way would be to process the messages with AWS Lambda functions. The default concurrency limit of 1000 can be increased on request. AWS Lambda would take care of all scaling and capacity issues and would likely be a cheaper option since there is no cost for idle time.

Limit concurrent invocation of a AWS Lambda triggered from AWS SQS (Reserved concurrency ignored)?

To me this seemed like a simple use case when I started, but it turned out a lot harder than I had anticipated.
Problem
I have an AWS SQS acting as a job queue that triggers a worker AWS Lambda. However since the worker lambdas are sharing non-scalable resources it is important to limit the number of concurrent running lambdas to (for the sake of example) no more than 5 lambdas running simultaneously.
Simple enough, according to Managing Concurrency for a Lambda Function
Reserved concurrency also limits the maximum concurrency for the
function, and applies to the function as a whole
However, setting the Reserved concurrency-property to 5 seems to be completely ignored by SQS, with the queue Messages in Flight-property in my case showing closer to 20-30 concurrent executions depending on the amount of messages put into the queue.
Question
The closest I have come to a solution is to use a SQS FIFO queue and setting the MessageGroupId to a value of either randomly selecting or alternating between 1-5. However, due to uneven workload this is not optimal as it would be better to have the concurrency distributed by actual workload rather than by chance.
I have also tried using the AWS Step Functions as the Map-state has a MaxConcurrency parameter, which seemed to work well on small job queues, but due to each state having an input/output limit of 32kb, this was not feasible in my use-case.
Has anyone found a better or alternative solution? Are there any other ways Reserved concurrency is supposed to be used?
Similar
Here are some similar questions I have found, but I think my question is different because I am not interested in limiting the total number of invocation, and (although I have not tried it myself) I can not see why triggers from S3 or Kinesis Steam would behave different from SQS.

According to AWS docs AWS SQS doesn't take into account reserved concurrency. If number of batches to be processed is greater than reserved concurrency, your messages might end up in a dead-letter queue:
If your function returns an error, or can't be invoked because it's at
maximum concurrency, processing might succeed with additional
attempts. To give messages a better chance to be processed before
sending them to the dead-letter queue, set the maxReceiveCount on the
source queue's redrive policy to at least 5.
https://docs.aws.amazon.com/lambda/latest/dg/with-sqs.html
You can check this article for details: https://zaccharles.medium.com/lambda-concurrency-limits-and-sqs-triggers-dont-mix-well-sometimes-eb23d90122e0

This issue is resolved today Jan 2023. You can use maximum concurrency as suggested in this blog . I was using FIFO with groupid as my backend was non-scalable and i wanted to not have any throttling issue as having too many messages on DLQ does not help.
https://aws.amazon.com/blogs/compute/introducing-maximum-concurrency-of-aws-lambda-functions-when-using-amazon-sqs-as-an-event-source/

How to Fan-Out SQS

I have multiple sources which are pushing raw data to S3. I have configured a SQS event notification over my S3 bucket.
The problem is the lag and limitations.
I anticipate that there will be more sources in near future and since we can get only 10 messages in a single poll from SQS, I think that in the near future when there will be more sources that will push data to S3, then the SQS will be full of some thousands of messages and I won't be able to process them faster.
I am thinking to fan-out SQS by spreading the message to more SQS queues from my master SQS queue, so that my processing layer can poll multiple queues eg: 5 queues and process more messages.
What should be the probable approach?

"... since we can get only 10 Messages in a single poll from SQS...I am thinking to fan-out sqs like spreading the message to more SQS queues from my master SQS queue, so that my processing layer can poll multiple queues eg : 5 queues and process more messages."
Short Answer: Don't do this.
Here's why:
Yes, a single poll can retrieve up to 10 messages. However, you can have multiple threads and multiple hosts all polling a single queue. Getting your consumers to run in parallel is the key here, as processing queue entries will be your bottleneck - not retrieving entries from the queue. A single SQS queue can handle tons of polling threads.
A multi-queue fanout as you proposed would have a number of drawbacks:
More complicated to code & operate
Slower - items will have to go through the overhead of transfer from your main queue (or SNS if you use that) to the consumption queues
More expensive - SQS charges per message. SNS charges per message.
You'll have to deal with duplication on your own - with a single queue, SQS built-in visibility timeout will mostly prevent other consumers from working on the same items. With multiple queues, you'll have to come up with a deduplication strategy of your own
Just use a single queue. You'll thank me later.

The typical way to fanout messages to multiple sqs queues is to use SNS.
The s3 event notifications would goto SNS instead of SQS and the SNS would be responsible for fanning those messages out to as many queues as you want.
That said, I am not sure I understand why you think you will be able to process messages faster if you use multiple message queues.
A single queue SQS queue being polled by either multiple processing clients or a single client using multi-threading are probably both better ways to improve processing speed than simply introducing more queues.

Consume SQS messages using AWS lambda function

I have 2 FIFO SQS queues which receives JSON messages that are to be indexed to elasticsearch. One queue is constantly adding delta changes to the database and adding them to the queue. The second queue is used for database re-indexing i.e. the entire 50Tb if data is to be indexing every couple of months (where everything is added to the queue). I have a lambda function that consumes the messages from the queues and places them into the appropriate queue (either the active index or the indexing being rebuilt).
How should I trigger the lambda function to best process the backlog of messages in SQS so it process both queues as quickly as possible?
A constraint I have is that the queue items need to be processed in order. If the lambda function could be run indefinitely without the 5 minute limit I could keep running one function that constantly processes messages.

Instead of pushing your messages directly into SQS you could publish the messages to a SNS Topic with 2 Subscriber registered.
Subscriber: SQS
Subscriber: Lambda Function
Has the benefit that your Lambda is invoked at the same time as the message is stored in SQS.

The standard way to do this is to use Cloudwatch Events that run periodically. This lets you pull data from the queue on a regular schedule.
Because you have to poll SQS this may not lead to the fastest processing of messages. Also, be careful if you constantly have messages to process - Lambda will end up being far more expensive than a small EC2 instance to handle the messages.

Not sure I fully understand your problem, but here are my 2 cents:
If you have a constant and real-time stream of data, consider using Kinesis Streams with 1 shard in order to preserve the FIFO. You may consume the data in batch of n items using lambda. Up to you to decide the batch size n and the memory size of lambda.
with this solution you pay a low constant price for Kinesis Streams and a variable price for Lambdas.
Should you really are in love with SQS and the real-time does not metter, you may consume items with Lambdas or EC2 or Batch. Either you trigger many lambdas with CloudWatch Events, either you keep alive an EC2, either you trigger on a regular basis an AWS Batch job.
there is an economic equation to explore, each solution is the best for one use case and the worst for another, make your choice ;)
I prefer SQS + Lambdas when there are few items to consume and SQS + Batch when there are a lot of items to consume.
You may probably also consider using SNS + SQS + Lambdas like #maikay says in his answer, but I wouldn't choose that solution.
Hope it helps. Feel free to ask for clarifications. Good luck!

AWS SQS standard queue or FIFO queue when message can not be duplicated?

We plan to use AWS SQS service to queue events created from web service and then use several workers to process those events. One event can only be processed one time. According to AWS SQS document, AWS SQS standard queue can "occasionally" produce duplicated message but with unlimited throughput. AWS SQS FIFO queue will not produce duplicated message but with throughput limitation of 300 API calls per second (with batchSize=10, equivalent of 3000 messages per second). Our current peak hour traffic is only 80 messages per second. So, both are fine in terms of throughput requirement. But, when I started to use AWS SQS FIFO queue, I found that I need to do extra work like providing extra parameters
"MessageGroupId" and "MessageDeduplicationId" or need to enable "ContentBasedDeduplication" setting. So, I am not sure which one is a better solution. We just need the message not duplicated. We don't need the message to be FIFO.
Solution #1:
Use AWS SQS FIFO queue. For each message, need to generate a UUID for "MessageGroupId" and "MessageDeduplicationId" parameters.
Solution #2:
Use AWS SQS FIFO queue with "ContentBasedDeduplcation" enabled. For each message, need to generate a UUID for "MessageGroupId".
Solution #3:
Use AWS SQS standard queue with AWS ElasticCache (either Redis or Memcached). For each message, the "MessageId" field will be saved in the cache server and checked for duplication later on. Existence means this message has been processed. (By the way, how long should the "MessageId" exists in the cache server. AWS SQS document does not mention how far back a message could be duplicated.)

You are making your systems complicated with SQS.
We have moved to Kinesis Streams, It works flawlessly. Here are the benefits we have seen,
Order of Events
Trigger an Event when data appears in stream
Deliver in Batches
Leave the responsibility to handle errors to the receiver
Go Back with time in case of issues
Buggier Implementation of the process
Higher performance than SQS
Hope it helps.

My first question would be that why is it even so important that you don't get duplicate messages? An ideal solution would be to use a standard queue and design your workers to be idempotent. For e.g., if the messages contain something like a task-ID and store the completed task's result in a database, ignore those whose task-ID already exists in DB.
Don't use receipt-handles for handling application-side deduplication, because those change every time a message is received. In other words, SQS doesn't guarantee same receipt-handle for duplicate messages.
If you insist on de-duplication, then you have to use FIFO queue.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js