AWS SQS Polling Issue - amazon-web-services

I have encountered a weird sqs situation that I can't find a satisfying answer.
I created a delay queue that should delay (what a surprise) incoming events for 4 seconds and then they should be processed by lambda. Order is not an issue here.
The issue though is that the "approximate age of the oldest message" metric (stat. Max) sometimes it reaches over 1 minute which is weird since there aren't so many message as you can see in the picture. My expectation would be that the event gets processed immediately after the 4 secs delay time.
The reserved concurrency level of that lambda is 50 so the sqs poller should have no problem invoking more lambda instances if there is too much traffic. But traffic isn't really a problem as you can see.
The queue is configured like this:
Default visibility timeout: 120 sec
Delivery delay: 4 sec
Dead-letter-queue: No (It is only one event generated by aws, so no
bad pills)
Message retention period: 4 days
The lambda config:
Batch size: 5 (Tried also 1 or 10. Not much of a difference for the mentioned metric)
Batch window: None
reserved concurrency: 50
timeout: 20 secs
I can't explain the reason for those old messages (ApproximateAgeOfOldestMessage). Any help would be highly appreciated
Best
Patrick

I contacted the AWS Support. Apparently it is a bug on the aws side:
Response from AWS Support:
I have just received an update from the backend service team and the
team has confirmed that they have identified an issue of unexpected
spikes in "ApproximateAgeOfOldestMessage" metrics that triggers when
messages are sent to SQS with a configured delay. This issue's root
cause is that our internal system uses recently processed delayed
messages to calculate the "ApproximateAgeOfOldestMessage," which
results in a higher than the actual value for
"ApproximateAgeOfOldestMessage" metrics. They have now identified a
fix for this issue and will start deploying the fix soon. After this
update, when messages are sent to Amazon SQS with a configured delay,
you may see the "ApproximateAgeOfOldestMessage" metrics value come
down for the queues to the accurate value.
So if you encounter the same problem you have to wait for that mentioned fix. Hope it will come soon.

Related

Not all SQS messages end up in Lambda: most just disappear

I have an AWS SQS Queue (standard, non-FIFO) that has a Lambda function as a consumer.
Whenever I send a bunch of messages (usually around 10 at a time) to the queue, only about 2 get picked up by lambda (verified in CloudWatch Logs). The others disappear from the queue.
The Lambda batch size is set to 1, so I would expect all 10 messages to sit in the queue and get picked up by Lambda one by one, but that's not happening. I'm using CloudWatch to check what Lambda is doing, and there is no trace of the missing messages.
I verified in Lambda that it only gets one message every time, by logging the size of the event.Records array (which is always 1).
The Queue also has a Dead Letter Queue. Initially the Maximum Receives was set to 1. When I increased that to 3, more messages were getting picked up after the queues Visibility timeout, but still only a few.
My Queue settings
Visibility timeout: 2 minutes
Delivery delay: 0 seconds
Receive message wait time: 5 seconds
Message retention period: 4 days
Maximum message size: 256kb
I'm wondering why the messages aren't being processed, but instead disappear?
The typical reason why messages are lost is that the Lambda function triggered by Amazon SQS is not correctly processing all Records passed to the function.
Make sure the code loops through all Records passed in the even parameter, since multiple messages can be provided in each Lambda invocation.
Turns out this was related to the Reserved Concurrency of the Lambda function. My concurrency was set to 1, which caused issues.
My expectation
SQS messages will remain in the queue until there's a Lambda function available to pick them up.
In reality
Messages that are not picked up by Lambda because the function is throttled, and after the Visibility Timeout are treated as failed message.
There's an excellent blog post about this issue: https://data.solita.fi/lessons-learned-from-combining-sqs-and-lambda-in-a-data-project/

First Message is Not Reaching the subscribed Actor

We have a Messaging Platform built on top of Akka (2.5) using akka cluster and Distributed Pubsub. We have a cluster of 25 servers currently.
The scenario is as follows.
Actor1 created in Server1 subscribes to a topic Chat1.
Actor2 created in Server2 publishes a message over Chat1 (after around 100ms of subscription)
Sometimes the 1st message is not received by Actor1 but subsequent messages always do.
We could derive that this is happening because of the fact that a subscription takes some time to register on all the nodes of the cluster. These are the actions we took to solve this -
Decreased the gossip-interval from 1sec (default) to 50ms.
Added a delay of another 400ms thus giving the cluster 500ms in total to register the subscription. This reduced the probability of the issue happening but its still pretty frequent (1/6 times around)
So few questions here -
Is it expected for Pubsub to take more than 400ms in a cluster of just 25 (that too in private network of servers in the same data centre)
Are there additional configurations in akka which can help in tweaking the time taken for subscription propagation.
What are our options here to monitor the average time taken by Pubsub for subscription propagation within the cluster? This would help in getting the right estimate of delay to be introduced(if at all needed)
If the above mentioned delay is expected, Are there any workarounds which has been used by someone in the past to overcome this issue.

How to modify/check google cloud run's retry limit on failure?

I have got a topic, which on publish it pushes the event to a cloud run endpoint and I got a trigger on a storage bucket to publish for this topic. The container in the cloud run fails to process the event and it has been restarted over hundreds of times and I don't wanna waste money on this. How can I limit the retry on failure on a cloud run's container?
A possible answer to the puzzle might be the following notion.
If we read the documentation on PUSH subscriptions found here, we find the following:
... Pub/Sub retries delivery until the message expires after the
subscription's message retention period.
What this means is that if Pub/Sub pushes a message to Cloud Run and Cloud Run does not acknowledge the message by returning a 200 response code, then the message will be re-pushed for the "message retention period". By default, this is 7 days but according to the documentation, can be set to a minimum value of 10 minutes. What this seems to say to me is that we can stop a poison message after 10 minutes (minimum) of retries.
If a message is pushed and not acked, then it won't be pushed again immediately but instead be pushed as a function of a back-off algorithm described here.
If we look at the gcloud documentation we find reference to the concept of a maximum number of delivery attempts (--max-delivery-attempts). Associated with this is a topic called the dead letter topic (--dead-letter-topic). What this appears to define is that if an attempt to deliver a pub sub message more than the maximum number of times, the message will be removed from the queue of messages associated with the subscription and moved to the topic associated with the dead letter. If you define this for your environment, then your Cloud Run will only execute a finite number of times after which the poision messages will be moved elsewhere.

What is the expected SLA(Service-level agreement) on Amazon SNS Messages?

I was trying to evaluate SNS for a realtime application i am building and needed really fast turn around time < 2 seconds in delivering the message.
Since i am located in APAC region, i have an SNS in Singapore which has a subscriber in Lambda in Us-east-1 location.
Given this setup i ran a code to try to figure out the latency in invoking lambda and do zero processing and just log the time. One might argue you have lambda invocation latency also accounted for in this instance. Which is true. I need Lambda to be invoked and executed and replied to within < 2 seconds.
I sent 23914 messages of which i have an average of 653.520 ms for transport + lambda invocation.
with peaks around 600995 ms (~ 10 minutes ) which is terrible latency for a technology like pubsub.
About 20117 messages got sent and received by lambda in < 653 ms, which means 3797 packets or 15% took more than the average time.
2958 messages or 12.36% took over 1 second to be executed.
379 messages or 1.59% took over 2 seconds to be invoked and executed ( which means 1.6% of my messages cannot be considered realtime and have to be ignored)
82 messages over 10 seconds
64 over 20 seconds
it goes on till ~ 45 seconds, after which the delay is 10 minutes. I have 3 packets with 10 minutes delay.
what bothers me is that about 2% ( if you include the processing time as well )of my messages cannot be processed in realtime for a tiny scale of ~24K messages.
In the scale calculation i am trying to present, requires me to process about 216 billion messages per month. At this scale i am worried that i will not be able to process 4.3 billion messages in realtime.
Given this experiement i am not sure how well SNS would scale. would the #of less than real time messages (read > 2 second delay) be more ? or would it decrease?
Now there might be a tendency to question my internet connection reliability, i re-did this experiment on EC2 and have got very similar results.
Infact the delays in time kind of matched around the same time.
Specific Questions
What are the SLA to SNS performance?
Indirectly : how does these SLA translate to that of AWS Lambda services?
Any reasons as to where these delays might be happening?
Most likely what happened here was throttling on the Lambda function. The default limit for concurrent Lambda invocations is 100. If you sent 20K messages, you likely exceeded that limit, despite the short runtime of the lambda. When your lambda functions are throttled when executing an SNS request, the request goes onto a retry queue and is re-executed up to 3 times, which often occur over a long period of time (up to an hour).
You can see the number of throttles in the CloudWatch metrics for the function (unfortunately, you ran your test before 6 months CloudWatch retention was released).
Last I checked there is no SLA for SNS. SNS is designed to be horizontally scalable and (almost) never drop a message not deliver it quickly.
Update: Since March 2019 there is a SLA for SNS:
https://aws.amazon.com/messaging/sla/
Is there any reason why you can't invoke the lambda from the publisher via the API and store the data within the event passed to the invocation?

SQS - Delivery Delay of 30 minutes

From the documentation of SQS, Max time delay we can configure for a message to hide from its consumers is 15 minutes - http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-delay-queues.html
Suppose if I need to hide the messages for a day, what is the pattern?
For eg. I want to mimic a daily cron for doing some action.
Thanks
The simplest way to do this is as follows:
SQS.push_to_queue({perform_message_at : "Thursday November 2022"},delay: 15 mins)
Inside your worker
message = SQS.poll_messages
if message.perform_message_at > Time.now
SQS.push_to_queue({perform_message_at : "Thursday November
2022"},delay:15 mins)
else
process_message(message)
end
Basically push the message back to the queue with the maximum delay and only process it when its processing time is less than the current time.
HTH.
Visibility timeout can do up to 12 hours. I think you can hack something together where you process a message but don't delete it and next time it is processed its been 12 hours. So a queue with one message and visibility timeout of 12 hours. That gets you a 12 hour cron.
Cloudwatch is likely a better way to do it. You can use a createEvent API with the timer, and have it trigger either a lambda function or an API call to whatever comes next.
Another way to do is to use the "wait" utility in an AWS step function.
https://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-wait-state.html
In any case, unless you are extremely sure you will never need anything more than 15 minutes, the SQS backdoor to add the delay seems hacky.
You can do this by adding a DLQ with MaxReceives set to 1 on the first queue.
Add a simple Lambda on the first queue and fail the message vi Lambda. So message will be moved to DLQ automatically and then you can consume from DLQ.
Both primary queue and DLQ can have max 15 min delay, so finally you get 30 min delay.
So your consumer app receives the message after 30 minutes, without adding any custom logic on it.
Two thoughts.
Untested. Perhaps publish to and SNS topic that has no SQS queues. When delivery needs to happen, subscribe the queue to the topic. (I've not done this, I'm not sure if this would work as expected)
Push messages as files to a central store (like S3). Create a worker that looks at the time created timestamp and decides whether to publish them to a queue or not. If created >= 1d ago, publish.
This was a challenge for us as well and I never found a perfect solution so I ended up building a service to address it. Obviously self promotion here but the system allows you to work around the DelaySeconds limitation and set arbitrary date/times at scale.
https://anticipated.io
Some of the challenges working with Step Functions are scale of registered machines (if your system had that requirement). If you use EventBridge to fire them you run out of allowable rulesets (limit is 200 as of this posting). Example: if you need to set 150,000 arbitrary events a month you run into limits quickly.