Managing SQS Queue Manually vs Lambda Trigger

Managing SQS Queue Manually vs Lambda Trigger - amazon-web-services

I'm not sure if this would be better served on ServerFault or Software Engineering, willing to move this post if appropriate.
We have somewhat recently started to move some of our data processing pipeline to use queues to manage individual bits of data, whereas previously we had timed lambdas that would pull all data since last change.
While making this change, we noticed that queues didn't work quite as we had anticipated first of all - we thought lambda would just pull items off the queue as the lambdas had availability. Instead, it seems the aws managed lambda trigger grabs a chunk of messages (up to ten) and throws it at the lambda service. If lambda doesn't have availability, the message gets throttled, then replayed after a backoff time, up til our configured replay "error" limit (five). After that, it's thrown into our dead letter queue.
We see a handful of message per day end up in the dead letter queue as a result of throttling. We then throw these back into the main queue (we have a process to do so every handful of hours). However, we weren't 100% sure throttling was the reason for things being pushed over since nothing indicates why the messages are moved over - we just assumed as much because we weren't getting any error logs for those messages. We contacted Amazon support to ask about this, and they were able to actually confirm the messages were in fact "errored" as a result of throttling.
We asked further into their recommendations for this - this must be a common problem right? They first off suggested upping our replay limit, which seemed an obvious no go. Replays occur for any failure, so that would just hammer our lambdas with bad requests when they came through. Asked also if there's any way to differentiate the errors because we don't care for throttling, we'd happily let those retry a dozen times if needed - but no. The other suggestion they had was to manage the queue ourselves from our lambdas. Build our own code within our lambdas to pull messages and then delete them after processing. This seems really counter-intuitive, though - why would every AWS consumer build their own infrastructure?
So I guess my question is, is this what others are doing? Are you using the built in lambda triggers? Are you creating your own code for managing queue consumption? Do you see these sorts of throttling, or is there anything we could do differently? Are there any difference with other services to manage this?

Best practice is to handle errors in your code and manually delete messages that have succeeded. That allows you to handle poison messages without reprocessing the good messages again. Throttles shouldn't be ending up in a DLQ that often. This video from re:Invent 2020 has a good explaination of how this works. Scalable serverless event-driven architectures with SNS, SQS & Lambda. Start at about the 20 minutes mark to get into SQS error handling.

Related

Is it possible to selectively read from AWS SQS?

I have a use-case. I want to read from SQS always, except when another event happens.
For instance, I have football news into SQS as messages. I want to retrieve them always, except for times when live matches are happening.
Is there any possibility to read unless there is another event does the job?
I scrolled the docs and Stack Overflow, but I don't see a solution.
COMMENT: I have a small and week service, and I cannot because of technical limitations increase it (memory/CPU, etc.), but I still want 2 "conflicting" flows to be in the service. They are both supposed to communicate to the same API, and I don't want them to send conflicting requests.
Is there a way to do it, or will I have to write a custom communicator with SQS?

You can't select which messages you want to read from SQS and which you'd rather not - there is no filtering in SQS.
If you have messages that need to be processed at all times and others that need to be processed only sometimes or in batches, you should put them in separate queues and read from the seperately.
You don't say anything about the infrastructure that reads from the queue, but if it's a process on EC2, you could just stop it while live matches are happening and restart it later. SQS is built for asynchronous messaging and will store the messages for up to 14 days (depending on your configuration) until a consumer is available to read them.

SNS > AWS Lambda asyncronous invocation queue vs. SNS > SQS > Lambda

Background
This archhitecture relies solely on Lambda's asyncronous invocation mechanism as described here:
https://docs.aws.amazon.com/lambda/latest/dg/invocation-async.html
I have a collector function that is invoked once a minute and fetches a batch of data in that can vary drastically in size (tens of of KB to potentially 1-3MB). The data contains a JSON array containing one-to-many records. The collector function segregates these records and publishes them individually to an SNS topic.
A parser function is subribed the SNS topic and has a concurrency limit of 3. SNS asynchronously invokes the parser function per record, meaning that the built-in AWS managed Lambda asyncronous queue begins to fill up as the instances of the parser maxes out at 3. The Lambda queueing mechanism initiates retries at incremental backups when throttling occurs, until the invocation request can be processed by the parser function.
It is imperitive that a record does not get lost during this process as they can not be resurrected. I will be using dead letter queues where needed to ensure they ultimately end up somewhere in case of error.
Testing this method out resulted in no lost invocation. Everything worked as expected. Lambda reported hundreds of throttle responses but I'm relying on this to initiate the Lambda retry behaviour for async invocations. My understanding is that this behaivour is effectively the same as that which I'd have to develop and initiate myself if I wanted to retry consuming a message coming from SQS.
Questions
1. Is the built-in AWS managed Lambda asyncronous queue reliable?
The parser could be subject to a consistent load of 200+ invocations per minute for prelonged periods so I want to understand whether the Lambda queue can handle this as sensibly as an SQS service. The main part that concerns me is this statement:
Even if your function doesn't return an error, it's possible for it to receive the same event from Lambda multiple times because the queue itself is eventually consistent. If the function can't keep up with incoming events, events might also be deleted from the queue without being sent to the function. Ensure that your function code gracefully handles duplicate events, and that you have enough concurrency available to handle all invocations.
This implies that an incoming invocation may just be deleted out of thin air. Also in my implementation I'm relying on the retry behaviour when a function throttles.
2. When a message is in the queue, what happens when the message timeout is exceeded?
I can't find a difinitive answer but I'm hoping the message would end up in the configured dead letter queue.
3. Why would I use SQS over the Lambda queue when SQS presents other problems?
See the articles below for arguments against SQS. Overpulling (described in the second link) is of particular concern:
https://lumigo.io/blog/sqs-and-lambda-the-missing-guide-on-failure-modes/
https://medium.com/#zaccharles/lambda-concurrency-limits-and-sqs-triggers-dont-mix-well-sometimes-eb23d90122e0
I can't find any articles or discussions of how the Lambda queue performs.
Thanks for reading!

Quite an interesting question. There's a presentation that covered queues in detail. I can't find it at the moment. The premise is the same as this queues are leaky buckets
So what if I add more Leaky Buckets. We'll you've delayed the leaking, however it's now leaking into another bucket. Have you solved the problem or delayed it?
What if I vibrate the buckets at different frequencies?
Further reading:
operate lambda
message expiry
message timeout
DDIA / DDIA Online
SQS Performance
sqs failure modes
mvce is missing from this question so I cannot address the the precise problem you are having.
As for an opinion on which to choose for SQS and Lambda queue I'll point to the Meta on this
sqs faq mentions Kinesis streams
sqs sns kinesis comparison
TL;DR;
It depends

I think the biggest advantage of using your own queue is the fact that you as a user have visibility into the state of the your backpressure.
Using the Lambda async invoke method, you have the potential to get throttled exceptions with the 'guarantee' that lambda will retry over an interval. If using a SQS source queue instead, you have complete visibility into the state of your message processing at all times with no ambiguity.
Secondly regarding overpulling. In theory this is a concern but in practice its never happened to me. I've run applications requiring thousands of transactions per second and never once had problems with SQS -> Lambda. Obviously set your retry policy appropriately and use a DLQ as transient/unpredictable errors CAN occur.

What is the difference between Kinesis and SQS?

I know there is a lot materials online for this question, however I have not found any that can explain this question quite clearly to a rookie like me... Appreciate it if some one can help me understand the key differences between these two services and use cases with real life examples. Thank you!

Amazon SQS is a queue. The basic process is:
Messages are sent to the queue. They stay there for up to 14 days.
Worker programs can request a message (or up to 10 messages) from the queue.
When a message is retrieved from the queue:
It stays in the queue but is marked as invisible
When the worker has finished processing the message, it tells SQS to delete the message from the queue
If the worker does not delete the message within the queue's invisibility timeout period, then the message reappears on the queue for another worker to process
The worker can, if desired, periodically tell SQS to keep a message invisible because it is still being processed
Thus, once a message is processed, it is deleted.
In Amazon Kinesis, a message is sent to a stream. The stream is divided into shards (think of them as mini-streams). When a message is received, Kinesis stores the message in sequential order. Then, workers can request a message from the start of the stream, or from a specific spot in the stream. For example, if it has already processed 5 messages, it can ask for the 6th message. The messages are retained in the stream for a period of time (eg 24 hours).
I like to think of it like a film strip — each frame in a film is kept in order. You can play a film from the start, or you can fast-forward to the middle and start playing from there. In addition, you can rewind to an earlier part and watch it. The same is true for a Kinesis stream, and multiple consumers can read from various parts of the stream simultaneously.
So, which to choose?
If a message is used once and then discarded, a queue is probably the better choice.
If retaining message order is important and/or messages will be used more than once, then a stream is probably better.

This article sums it up pretty nicely, imo:
https://sookocheff.com/post/aws/comparing-kinesis-and-sqs/
but basically, if you don't know which one you need, start with SQS until it can't do what you want. SQS is dead-simple to setup and use, and requires almost no experise to use it well.
Kinesis takes a lot more time and expertise to setup to use, so unless you need it, don't bother - even though it could be used for many of the same things as SQS.
One big difference, with SQS if you have multiple consumers reading from the queue, than each consumer will only ever see thge messages they consume - because other consumers will be blocked from seeing them; with Kinesis, many consumers can access the stream at the same time, and each consumer sees the entire streem - so SQS is good for taking a large number of tasks and doling out pieces to lots of consumers to work on in parallel (among other things), where as with Kinesis multiple consumers could read and see the entire streem and do something with ALL of the data in the stream.
The linked article explains it better than me.

I try to give a simple answer based on my practical experience:
Consider SQS as temporary storage service. Use cases:
manage data with different queue priorities
store data for a limited period of time
Lambda DLQ
reduce costs with long polling
create a FIFO
Consider Kinesis as a collector of large stream of real-time data. Use cases:
very very large stream of data from different sources
backup of data just enabling Firehose (you get a data lake for free)
get statistics at once during the collecting phase integrating Kinesis Analytics
have checkpoints to keep track in DynamoDB of records processed/failed
Note: consider that both services can be integrated with Lambda Functions very easily, so there are a plenty of use cases that can be solved both with SQS and Kinesis. Anyway, I tried to list some use cases where I found that one of the two performed peculiarly better than the other. Hope it can be helpful :)

What if my lambda job, which is subscribed to an AWS SNS topic, goes down or stops working?

I have one publisher and one subscriber for my SNS topic in AWS.
Suppose my subscriber is getting failed and exiting with a failure.
Will SNS repush those failed messages?
If not...
Is there another way to achieve that goal where my system starts processing from the last successful lambda execution?

There is a retry policy, but if your application already received the message, then no. If something goes wrong you won't see it again and since Lambdas don't carry state...You could be in trouble.
I might consider looking at SQS instead of SNS. Remember, messages in SQS won't be removed until you remove them and you can set a window of invisibility. Therefore, you can easily ensure the next Lambda execution picks up where things left off (depending on your settings). Each Lambda would then be responsible for removing that message from SQS and that's how you'd know the message was processed.
Without knowing more about your application and needs, I couldn't say for sure...But I would take a look at it. I've built a "taskmaster" Lambda before that ran on a schedule and read from an SQS queue (multiple queues actually - the scheduled job passed different JSON event based on which queue to read from). It would then pass the job off to the appropriate Lambda "worker" which would then remove that message. Should it stop working...Well, the invisibility period would timeout (and 5 minutes isn't bad here given that's all Lambdas can execute for) and the next Lambda would pick it up. The taskmaster then would run as often as needed and read as many jobs from the queue as necessary. This really helps you have complete control over at what rate you are processing things, how many times you are retrying things, etc. Then you can also make use of a dead-letter queue to catch anything that may have failed (also, think about sticking things back into the queue).
You have a LOT of flexibility with SQS that I'm not really sure you get with SNS to be honest. I was never fond of SNS, though it too has a place and time and so again without knowing more here I couldn't say if SQS would be the fit for you...But I think your concerns can be taken care of with SQS if it makes sense for your application.

SQS/task-queue job retry count strategy?

I'm implementing a task queue with Amazon SQS ( but i guess the question applies to any task-queue ) , where the workers are expected to take different action depending on how many times the job has been re-tried already ( move it to a different queue, increase visibility timeout, send an alert..etc )
What would be the best way to keep track of failed job count? I'd like to avoid having to keep a centralized db for job:retry-count records. Should i look at time spent in the queue instead in a monitoring process? IMO that would be ugly or un-clean at best, iterating over jobs until i find ancient ones..
thanks!
Andras

There is another simpler way. With your message you can request ApproximateReceiveCount information and base your retry logic on that. This way you won't have to keep it in the database and can calculate it from the message itself.
http://docs.aws.amazon.com/AWSSimpleQueueService/latest/APIReference/API_ReceiveMessage.html

I've had good success combining SQS with SimpleDB. It is "centralized", but only as much as SQS is.
Every job gets a record in simpleDB and a task in SQS. You can put any information you like in SimpleDB like the job creation time. When a worker pulls a job from the queue it can grab the corresponding record from simpleDB to determine it's history. You can see how old the job is, and you can see how many times it has been attempted. Once you're done, you can add worker data to the SimpleDB record (completion time, outcome, logs, errors, stack-trace, whatever) and acknowledge the message from SQS.
I prefer this method because it helps diagnose faults by providing lots of debug info for failed tasks. It also allows workers to handle the job differently depending on how long the job has been queued, how many failures it's had, etc.
It also gives you the ability to query SimpleDB directly and calculate things like average time per task, percent failure rate, etc.

Amazon just released Simple workflow serice (swf) which you can think of as a more sophisticated/flexible version of GAE Task queues.
It will let you monitor your tasks (with hearbeats), configure retry strategies and create complicated workflows. It looks pretty promising abstracting out task dependencies, scheduling and fault tolerance for tasks (esp. asynchronous ones)
Checkout http://docs.amazonwebservices.com/amazonswf/latest/developerguide/swf-dg-intro-to-swf.html for overview.

SQS stands for "Simple Queue Service" which, in concept is the incorrect name for that service. The first and foremost feature of a "Queue" is FIFO (First in, First out), and SQS lacks that. Just wanting to clarify.
Also, Azure Queue Services lacks that as well. For the best cloud Queue service, use Azure's Service Bus since it's a TRUE Queue concept.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js