Tracking a message on multiple AWS SQS queues - amazon-web-services

I have around 3 AWS Lambda functions taking the following form:
Lambda function 1: Reads from an SQS queue and puts a Message on an SQS queue (the incoming and outgoing message formats are different)
Lambda function 2: Reads the message from Lambda function 1, and puts a Message on an SQS queue (the incoming and outgoing message formats are different)
Lambda function 3: Reads the message from Lambda function 3, and updates storage.
There are 3 queues involved and the message format (structure) in each queue is different, however they have one uniqueId which are the same and can be used the relate one to each other. So much question is, is there any way in SQS on or some other tool to track the messages, what I'm specifically looking at is stuff like:
Time the message was entered into the queue
Time the message was taken by the Lambda function for processing
My problem is that the 3 Lambda functions individually perform within a couple of milliseconds, but the time taken for end to end execution is way too long, I suspect that the messages are taking too long in transit.
I'm open for any other ideas on optimisation.

AWS Step Functions is specifically designed for passing information between Lambda functions and orchestrating the whole process.
However, you would need to change the way you have written your functions to take advantage of Step Functions.
Since your only real desire is to explore why it takes "way too long", then AWS X-Ray should be a good way to gather this information. It can track a single transaction end-to-end through each process. I think it's just a matter of including a library and activating X-Ray.
See: AWS Lambda and AWS X-Ray - AWS X-Ray
Or, just start doing some manual investigations in the log files. The log shows how long each function takes to run, so you should be able to identify whether the time taken is within a particular Lambda function, or whether the time is spent between functions, waiting for them to trigger.

Related

SQS → Lambda Problem With maximumBatchingWindow

Our intention is to trigger a lambda when messages are received in an SQS queue.
we only want one invocation of the lambda to run at a time (maximum concurrency of one)
We would like for the lambda to be triggered every time one of the following is true:
There are 10,000 messages in the queue
Five minutes has passed since the last invocation of the lambda
Our consumer lambda is dealing with an API with limited API calls and strict concurrency limits. The above solution ensures we never encounter concurrency issues and we can batch our calls together, ensuring we never consume too many API calls.
Here is our serverless.yml configuration
functions:
sqs-consumer:
name: sqs-consumer
handler: handlers.consume_handler
reservedConcurrency: 1 // maximum concurrency of 1
events:
- sqs:
arn: !GetAtt
- SqsQueue
- Arn
batchSize: 10000
maximumBatchingWindow: 300
timeout: 900
resources:
Resources:
SqsQueue:
Type: 'AWS::SQS::Queue'
Properties:
QueueName: sqs-queue
VisibilityTimeout: 5400 # 6x greater than the lambda timeout
The above does not give us the desired behavior. We are seeing our lambda triggered every 1 to 3 minutes (instead of 5). It indeed is using batches because we’ll see multiple messages being processed in a single invocation, but with even just one or two messages in the queue at a time it doesn’t wait 5 minutes to trigger the lambda.
Our messages are extremely small, so it's not possible we're coming anywhere close to the 6mb limit.
We would expect the only time the lambda is triggered to be when either 10,000 messages have accumulated in the queue or five minutes have transpired since the previous invocation. Instead we are seeing the lambda invoked anywhere in between every 1 to 3 minutes with a batch size that never even breaks 100, much less 10,000.
The largest batch size I’ve seen it invoke the lambda with so far has been 28, and sometimes with only one message in the queue it’ll invoke the function when it’s only been one minute since the previous invocation.
We would like to avoid using Kinesis, as the volume we’re dealing with truly doesn’t warrant it.
Reply from AWS Support:
As per the Case ID 10802672001, I understand that you have an SQS
event source mapping on Lambda with a batch size of 500 and batch
Window of 60 seconds. I further understand that you have observed the
lambda function invocation has fewer messages than 500 in a batch and
is not waiting for batch window time configured while receiving the
messages. You would like to know why lambda is being invoked prior to
meeting any of the above configured conditions and seek our assistance
in troubleshooting the same. Please correct me if I misunderstood your
query by any means.
Initially, I would like to thank you for sharing the detailed
correspondence along with the screenshot of the logs, it was indeed
very helpful in troubleshooting the issue.
Firstly, I used the internal tools to check the configuration of your
lambda function "sd_dch_archivebatterydata" and observed that there
is no throttling in the lambda function and there is no reserved
concurrency configured. As you might already be aware that Lambda is
meant to scale while polling from SQS queues and thus it is
recommended not to use reserving concurrency, as it is going against
the design of the event source. On checking log screenshot shared by
you, I observed there were no errors.
Regarding your query, please allow me to answer them as follows:
Please understand here that Batch size is the maximum number of messages that lambda will read from the queue in one batch for a
single invocation. It should be considered as the maximum number of
messages (up to) that can be received in a single batch but not as a
fixed value that can be received at all times in a single invocation.
-- Please see "When Lambda invokes the target function, the event can contain multiple items, up to a configurable maximum batch size" in
the official documentation here [1] for more information on the same.
I would also like to add that, according to the internal architecture of how the SQS service is designed, Lambda pollers will
poll the messages from the queue using the "ReceiveMessage" API
calls and invokes the Lambda function.
-- Please refer the documentation [2] which states the following "If the number of messages in the queue is small (fewer than 1,000), you
most likely get fewer messages than you requested per ReceiveMessage
call. If the number of messages in the queue is extremely small, you
might not receive any messages in a particular ReceiveMessage
response. If this happens, repeat the request".
-- Thus, we can see that the number of messages that can be obtained in a single lambda invocation with a certain batch size depends on the
number of messages in an SQS queue and the SQS service internal
implementation.
Also, batch window is the maximum amount of time that the poller waits to gather the messages from the queue before invoking the
function. However, this applies when there are no messages in the
queue. Thus, as soon as there is a message in the queue, the Lambda
function will be invoked without any further due without waiting for
the batch window time specified. You can refer to the
"WaitTimeSeconds" parameter in the "ReceiveMessage" API.
-- The batch window just ensures that lambda starts polling after certain time so that enough messages are present in the queue.
However, there are other factors like size of messages, incoming
volume, etc that can affect this behavior.
Additionally, I would like to confirm that Polls from SQS in Lambda is of Synchronous invocation type and it has an invocation payload
limit size of 6MB. Please refer the following AWS Documentation for
more information on the same [3].
Having said that, I can confirm that this Lambda polling behaviour is
by design and not a bug. Please be rest assured that there are no
issues with the lambda and SQS service.
Our scenario is to archive to S3, and we want fewer larger files. Looks like our options are potentially kinesis, or running a custom receive application on something like ECS...

SNS > AWS Lambda asyncronous invocation queue vs. SNS > SQS > Lambda

Background
This archhitecture relies solely on Lambda's asyncronous invocation mechanism as described here:
https://docs.aws.amazon.com/lambda/latest/dg/invocation-async.html
I have a collector function that is invoked once a minute and fetches a batch of data in that can vary drastically in size (tens of of KB to potentially 1-3MB). The data contains a JSON array containing one-to-many records. The collector function segregates these records and publishes them individually to an SNS topic.
A parser function is subribed the SNS topic and has a concurrency limit of 3. SNS asynchronously invokes the parser function per record, meaning that the built-in AWS managed Lambda asyncronous queue begins to fill up as the instances of the parser maxes out at 3. The Lambda queueing mechanism initiates retries at incremental backups when throttling occurs, until the invocation request can be processed by the parser function.
It is imperitive that a record does not get lost during this process as they can not be resurrected. I will be using dead letter queues where needed to ensure they ultimately end up somewhere in case of error.
Testing this method out resulted in no lost invocation. Everything worked as expected. Lambda reported hundreds of throttle responses but I'm relying on this to initiate the Lambda retry behaviour for async invocations. My understanding is that this behaivour is effectively the same as that which I'd have to develop and initiate myself if I wanted to retry consuming a message coming from SQS.
Questions
1. Is the built-in AWS managed Lambda asyncronous queue reliable?
The parser could be subject to a consistent load of 200+ invocations per minute for prelonged periods so I want to understand whether the Lambda queue can handle this as sensibly as an SQS service. The main part that concerns me is this statement:
Even if your function doesn't return an error, it's possible for it to receive the same event from Lambda multiple times because the queue itself is eventually consistent. If the function can't keep up with incoming events, events might also be deleted from the queue without being sent to the function. Ensure that your function code gracefully handles duplicate events, and that you have enough concurrency available to handle all invocations.
This implies that an incoming invocation may just be deleted out of thin air. Also in my implementation I'm relying on the retry behaviour when a function throttles.
2. When a message is in the queue, what happens when the message timeout is exceeded?
I can't find a difinitive answer but I'm hoping the message would end up in the configured dead letter queue.
3. Why would I use SQS over the Lambda queue when SQS presents other problems?
See the articles below for arguments against SQS. Overpulling (described in the second link) is of particular concern:
https://lumigo.io/blog/sqs-and-lambda-the-missing-guide-on-failure-modes/
https://medium.com/#zaccharles/lambda-concurrency-limits-and-sqs-triggers-dont-mix-well-sometimes-eb23d90122e0
I can't find any articles or discussions of how the Lambda queue performs.
Thanks for reading!
Quite an interesting question. There's a presentation that covered queues in detail. I can't find it at the moment. The premise is the same as this queues are leaky buckets
So what if I add more Leaky Buckets. We'll you've delayed the leaking, however it's now leaking into another bucket. Have you solved the problem or delayed it?
What if I vibrate the buckets at different frequencies?
Further reading:
operate lambda
message expiry
message timeout
DDIA / DDIA Online
SQS Performance
sqs failure modes
mvce is missing from this question so I cannot address the the precise problem you are having.
As for an opinion on which to choose for SQS and Lambda queue I'll point to the Meta on this
sqs faq mentions Kinesis streams
sqs sns kinesis comparison
TL;DR;
It depends
I think the biggest advantage of using your own queue is the fact that you as a user have visibility into the state of the your backpressure.
Using the Lambda async invoke method, you have the potential to get throttled exceptions with the 'guarantee' that lambda will retry over an interval. If using a SQS source queue instead, you have complete visibility into the state of your message processing at all times with no ambiguity.
Secondly regarding overpulling. In theory this is a concern but in practice its never happened to me. I've run applications requiring thousands of transactions per second and never once had problems with SQS -> Lambda. Obviously set your retry policy appropriately and use a DLQ as transient/unpredictable errors CAN occur.

Processing AWS SQS messages with separate Lambda at a time

Like the title suggests, I have a scenario that I would like to explore but do not know how to go about it.
I have a lambda function processCSVFile. I also have a SQS queue that at a set time everyday, it gets populated with link of csv files from S3, let's say about 2000 messages. Now I want to process 25 messages at a time once the SQS queue has the messages.
The scenario I am looking for is to process 25 messages concurrently, I want the 25 messages to be processed by 25 lambda invocations separately. I thought I could use SendMessageBatch function in SQS but this only delivers messages to the queue, it does not seem to apply to my use case.
My question is, am I able to perform the action explained above and if it is possible, what documentation or use cases can explain what I am looking for.
Also, if this use case is impossible, what do you recommend as an alternative way to do the processing I want done concurrently.
To process 25 messages from Amazon SQS with 25 concurrent Lambda functions (1 message per running Lambda function), you would need:
A maximum concurrency of 25 configured for the Lambda function (otherwise it might go higher than this when more messages are available)
A batch size of 1 configured on the Lambda trigger so that SQS only passes it one message at a time
See:
AWS Lambda Function Scaling (Maximum concurrency)
Configuring a Queue as an Event Source (Batch size)
I think that combination of lambda's event source mapping for sqs
and setting reserved concurrency to 25 could be the way do go.
The lambda uses long pooling to prepare message batches for concurrent processing by lambda. Thus each invocation of your function could get more than 1 message at a time.
I don't think there is a way to set event source mapping to serve just one message per batch. If you absolute must ensure only one message is processed by lambda, then you process one and disregards others (put them back to queue).
The reserved concurrency of 25 guarantees that you wont be running more than 25 functions in parallel. If you leave it at its default value, you can run up to whatever free concurrency you have in your account.
Edit:
#JohnRotenstein already confirmed that there is a way to set lambda to pass message a time to your function.
Hope this helps.

How to trigger AWS Lambda just once on multiple S3 notifications

We are designing a pipeline. We get a number of raw files which come into S3 buckets and then we apply a schema and then save them as parquet.
As of now we are triggering a lambda function for each file written but ideally we would like to start this process only after all the files are written. How we can we trigger the lambda just once?
I encourage you to use an alternative that maintains the separation between the publisher (whoever is writing) and the subscriber (you). The publisher tells you when things are written; it's your responsibility to choose when to process those things. The neat pattern here would be for the publisher to write its files in batches and publish manifests for you to trigger on: i.e. a list which says "I just wrote all these things, you can find them in these places". Since you don't have that / can't change the publisher, I suggest the following:
Send the notifications from the publisher to an SQS queue.
Schedule your lambda to run on a schedule; how often is determined by how long you're willing to delay ingestion. If you want data to be delayed at most 5min between being published and getting ingested by your system, set your lambda to trigger every 4min. You can use Cloudwatch notifications for this.
When your lambda runs, poll the queue. Keep going until you accumulate the maximum amount of notifications, X, you want to process in one go, or the queue is empty.
Process. If the queue wasn't empty when you stopped polling, immediately trigger another lambda execution.
Things to keep in mind on the above:
As written, it's not parallel, so if your rate of lambda execution is slower than the rate at which the queue fills up, you'll need to 1. run more frequently or 2. insert a load-balancing step: a lambda that is triggered on a schedule, polls the queue, and calls as many processing lambdas as necessary so that each one gets X notifications.
SNS in general and SQS non-FIFO queues specifically don't guarantee exactly-once delivery. They can send you duplicate notifications. Make sure you can handle duplicate processing cleanly.
Hook your Lambda up to a Webhook (API Gateway) and then just call it from your client app once your client app is done.
Solutions:
Zip all files together, Lambda unzip it
create a UI code and send files one by one, trigger lambda from it when the last one is sent
Lambda check files, if didn't find all files, silent quit. if it finds all files, then handle all files in one thread

Make Lambda function execute now, and/or in an hour

I'm trying to implement an AWS Lambda function that should send an HTTP request. If that request fails (response is anything but status 200) I should wait another hour before retrying (longer that the Lambda stays hot). What the best way to implement this?
What comes to mind is to persist my HTTP request in some way and being able to trigger the Lambda function again in a specified amount of time in case of a persisted HTTP request. But I'm not completely sure which AWS service that would provide that functionality for me. Is SQS an option that can help here?
Or, can I dynamically schedule Lambda execution for this? Note that the request to be retried should be identical to the first one.
Any other suggestions? What's the best practice for this?
(Lambda function is my option. No EC2 or such things are possible)
You can't directly trigger Lambda functions from SQS (at the time of writing, anyhow).
You could potentially handle the non-200 errors by writing the request data (with appropriate timestamp) to a DynamoDB table that's configured for TTL. You can use DynamoDB Streams to detect when DynamoDB deletes a record and that can trigger a Lambda function from the stream.
This is obviously a roundabout way to achieve what you want but it should be simple to test.
As jarmod mentioned, you cannot trigger Lambda functions directly by SQS. But a workaround (one I've used personally) would be to do the following:
If the request fails, push an item to an SQS Delay Queue (docs)
This SQS message will only become visible on the queue after a certain delay (you mentioned an hour).
Then have a second scheduled lambda function which is triggered by a cron value of a smaller timeframe (I used a minute).
This second function would then scan the SQS queue and if an item is on the queue, call your first Lambda function (either by SNS or with the AWS SDK) to retry it.
PS: Note that you can put data in an SQS item, since you mentioned you needed the lambda functions to be identical you can store your first function's input in here to be reused after an hour.
I suggest that you take a closer look at the AWS Step Functions for this. Basically, Step Functions is a state machine that allows you to execute a Lambda function, i.e. a task in each step.
More information can be found if you log in to your AWS Console and choose the "Step Functions" from the "Services" menu. By pressing the Get Started button, several example implementations of different Step Functions are presented. First, I would take a closer look at the "Choice state" example (to determine wether or not the HTTP request was successful). If not, then proceed with the "Wait state" example.