What is the expected SLA(Service-level agreement) on Amazon SNS Messages? - amazon-web-services

I was trying to evaluate SNS for a realtime application i am building and needed really fast turn around time < 2 seconds in delivering the message.
Since i am located in APAC region, i have an SNS in Singapore which has a subscriber in Lambda in Us-east-1 location.
Given this setup i ran a code to try to figure out the latency in invoking lambda and do zero processing and just log the time. One might argue you have lambda invocation latency also accounted for in this instance. Which is true. I need Lambda to be invoked and executed and replied to within < 2 seconds.
I sent 23914 messages of which i have an average of 653.520 ms for transport + lambda invocation.
with peaks around 600995 ms (~ 10 minutes ) which is terrible latency for a technology like pubsub.
About 20117 messages got sent and received by lambda in < 653 ms, which means 3797 packets or 15% took more than the average time.
2958 messages or 12.36% took over 1 second to be executed.
379 messages or 1.59% took over 2 seconds to be invoked and executed ( which means 1.6% of my messages cannot be considered realtime and have to be ignored)
82 messages over 10 seconds
64 over 20 seconds
it goes on till ~ 45 seconds, after which the delay is 10 minutes. I have 3 packets with 10 minutes delay.
what bothers me is that about 2% ( if you include the processing time as well )of my messages cannot be processed in realtime for a tiny scale of ~24K messages.
In the scale calculation i am trying to present, requires me to process about 216 billion messages per month. At this scale i am worried that i will not be able to process 4.3 billion messages in realtime.
Given this experiement i am not sure how well SNS would scale. would the #of less than real time messages (read > 2 second delay) be more ? or would it decrease?
Now there might be a tendency to question my internet connection reliability, i re-did this experiment on EC2 and have got very similar results.
Infact the delays in time kind of matched around the same time.
Specific Questions
What are the SLA to SNS performance?
Indirectly : how does these SLA translate to that of AWS Lambda services?
Any reasons as to where these delays might be happening?

Most likely what happened here was throttling on the Lambda function. The default limit for concurrent Lambda invocations is 100. If you sent 20K messages, you likely exceeded that limit, despite the short runtime of the lambda. When your lambda functions are throttled when executing an SNS request, the request goes onto a retry queue and is re-executed up to 3 times, which often occur over a long period of time (up to an hour).
You can see the number of throttles in the CloudWatch metrics for the function (unfortunately, you ran your test before 6 months CloudWatch retention was released).

Last I checked there is no SLA for SNS. SNS is designed to be horizontally scalable and (almost) never drop a message not deliver it quickly.
Update: Since March 2019 there is a SLA for SNS:
https://aws.amazon.com/messaging/sla/
Is there any reason why you can't invoke the lambda from the publisher via the API and store the data within the event passed to the invocation?

Related

Kinesis vs SQS, which is the best for this particular case?

I have been reading about Kinesis vs SQS differences and when to use each but I'm struggling to know which is the appropriate solution for this particular problem:
Strava-like app where users record their runs
50 incoming runs per second
The processing of each run takes exactly 1 minute
I want the user to have their results in less than 5 minutes
A run is just a guid, the job that processes it will get al the info from S3
If i understand correctly in kinesis you can have 1 worker per shard, correct? That would mean 1 runs per minute. Since i have 3000 incoming runs per minute, to meet the 5 minute deadline would mean i would need to have 600 shards with 1 worker each.
Is this assumption correct?
With SQS I can just have 1 queue and as many workers as I like, up to SQS's limit of 120,000 inflight messages.
If 1 run errors during processing I want to reprocess it a few more times and then store it for further inspection.
I don't need to process messages in order, and duplicates are totally fine.
1 worker per message, after it's processed i no longer care about the message
In that case, a queuing services such as SQS should be used. Kinesis is a streaming service, which persist a data. This means that multiple works can read messages from a stream for as long as they are valid. Non of your workers would be able to remove the message from the stream.
Also with SQS you can setup dead-letter queues which would allow you capture messages with fail to process after a pre-defined number of trials.

Delay in getting messages from AWS SQS

I am adding messages in SQS on Lambda and then receiving the messages inside a container on ECS.
The problem is there is a 10-15 seconds of delay when I am receiving the messages on the container.
On the container a loop is running indefinitely every 1 second where I am getting messages and if available processing it.
Example:
Suppose the message is added in SQS at 15:20:00 but I am able to get that message at 15:20:15 on ECS. These 15 seconds are too long for my use case.
Can this time be reduced ?
Assuming that there are multiple producers and consumers is there any alternative solution ?
If your workers are continually polling the Amazon SQS queue, they can reduce the amount of requests by specifying WaitTimeSeconds=20 (which is its maximum value).
This tells Amazon SQS to wait until at least one message is available, to a maximum of 20 seconds. If no messages are available after 20 seconds, an empty set of messages is returned. However, if one or more messages appear in the queue, then the call returns immediately without waiting for 20 seconds.
This reduces the frequency of calls to SQS and might increase stability in your application.

AWS SQS Polling Issue

I have encountered a weird sqs situation that I can't find a satisfying answer.
I created a delay queue that should delay (what a surprise) incoming events for 4 seconds and then they should be processed by lambda. Order is not an issue here.
The issue though is that the "approximate age of the oldest message" metric (stat. Max) sometimes it reaches over 1 minute which is weird since there aren't so many message as you can see in the picture. My expectation would be that the event gets processed immediately after the 4 secs delay time.
The reserved concurrency level of that lambda is 50 so the sqs poller should have no problem invoking more lambda instances if there is too much traffic. But traffic isn't really a problem as you can see.
The queue is configured like this:
Default visibility timeout: 120 sec
Delivery delay: 4 sec
Dead-letter-queue: No (It is only one event generated by aws, so no
bad pills)
Message retention period: 4 days
The lambda config:
Batch size: 5 (Tried also 1 or 10. Not much of a difference for the mentioned metric)
Batch window: None
reserved concurrency: 50
timeout: 20 secs
I can't explain the reason for those old messages (ApproximateAgeOfOldestMessage). Any help would be highly appreciated
Best
Patrick
I contacted the AWS Support. Apparently it is a bug on the aws side:
Response from AWS Support:
I have just received an update from the backend service team and the
team has confirmed that they have identified an issue of unexpected
spikes in "ApproximateAgeOfOldestMessage" metrics that triggers when
messages are sent to SQS with a configured delay. This issue's root
cause is that our internal system uses recently processed delayed
messages to calculate the "ApproximateAgeOfOldestMessage," which
results in a higher than the actual value for
"ApproximateAgeOfOldestMessage" metrics. They have now identified a
fix for this issue and will start deploying the fix soon. After this
update, when messages are sent to Amazon SQS with a configured delay,
you may see the "ApproximateAgeOfOldestMessage" metrics value come
down for the queues to the accurate value.
So if you encounter the same problem you have to wait for that mentioned fix. Hope it will come soon.

SQS batching for Lambda trigger doesn't work as expected

I have 2 Lambda Functions and an SQS queue inbetween.
The first Lambda sends the messages to the Queue.
Then second Lambda has a trigger for this Queue with a batch size of 250 and a batch window of 65 seconds.
I expect the second Lambda to be triggered in batches of 250 messages after about every 65 seconds. In the second Lambda I'm calling a 3rd party API that is limited to 250 API calls per minute (I get 250 tokens per minute).
I tested this setup with for 32.000 messages being added to the queue and the second Lambda didn't pick up the messages in batches as expected. At first it got executed for 15k messages and then there were not enough tokens so it did not process those messages.
The 3rd party API is based on a token bucket with a fill rate of 250 per minute and a maximum capacity of 15.000. It managed to process the first 15.000 messages due to the bucket capacity and then didn't have enough capacity to handle the rest.
I don't understand what went wrong.
The misunderstanding is probably related to how Lambda handles scaling.
Whenever there are more events than a single Lambda execution context/instance can handle, Lambda just creates more execution contexts/instances to process these events.
What probably happened is that Lambda saw there are a bunch of messages in the queue and it tries to work on these as fast as possible. It created a Lambda instance to handle the first event and then talked to SQS and asked for more work. When it got the next batch of messages, the first instance was still busy, so it scaled out and created a second one that worked on the second batch in parallel, etc. etc.
That's how you ended up going through your token budget in a few minutes.
You can limit how many functions Lambda is allowed to execute in parallel by using reserved concurrency - here are the docs for reference. If you set the reserved concurrency to 1, there will be no parallelization and only one Lambda is allowed to work on the messages.
This however opens you up to another issue. If that single Lambda takes less than 60 seconds to process the messages, Lambda will call it again with another batch ASAP and you might go over your budget again.
At this point a relatively simple approach would be to make sure that your lambda function always takes about 60 seconds by adding a sleep for the remaining time at the end.

SQS - Delivery Delay of 30 minutes

From the documentation of SQS, Max time delay we can configure for a message to hide from its consumers is 15 minutes - http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-delay-queues.html
Suppose if I need to hide the messages for a day, what is the pattern?
For eg. I want to mimic a daily cron for doing some action.
Thanks
The simplest way to do this is as follows:
SQS.push_to_queue({perform_message_at : "Thursday November 2022"},delay: 15 mins)
Inside your worker
message = SQS.poll_messages
if message.perform_message_at > Time.now
SQS.push_to_queue({perform_message_at : "Thursday November
2022"},delay:15 mins)
else
process_message(message)
end
Basically push the message back to the queue with the maximum delay and only process it when its processing time is less than the current time.
HTH.
Visibility timeout can do up to 12 hours. I think you can hack something together where you process a message but don't delete it and next time it is processed its been 12 hours. So a queue with one message and visibility timeout of 12 hours. That gets you a 12 hour cron.
Cloudwatch is likely a better way to do it. You can use a createEvent API with the timer, and have it trigger either a lambda function or an API call to whatever comes next.
Another way to do is to use the "wait" utility in an AWS step function.
https://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-wait-state.html
In any case, unless you are extremely sure you will never need anything more than 15 minutes, the SQS backdoor to add the delay seems hacky.
You can do this by adding a DLQ with MaxReceives set to 1 on the first queue.
Add a simple Lambda on the first queue and fail the message vi Lambda. So message will be moved to DLQ automatically and then you can consume from DLQ.
Both primary queue and DLQ can have max 15 min delay, so finally you get 30 min delay.
So your consumer app receives the message after 30 minutes, without adding any custom logic on it.
Two thoughts.
Untested. Perhaps publish to and SNS topic that has no SQS queues. When delivery needs to happen, subscribe the queue to the topic. (I've not done this, I'm not sure if this would work as expected)
Push messages as files to a central store (like S3). Create a worker that looks at the time created timestamp and decides whether to publish them to a queue or not. If created >= 1d ago, publish.
This was a challenge for us as well and I never found a perfect solution so I ended up building a service to address it. Obviously self promotion here but the system allows you to work around the DelaySeconds limitation and set arbitrary date/times at scale.
https://anticipated.io
Some of the challenges working with Step Functions are scale of registered machines (if your system had that requirement). If you use EventBridge to fire them you run out of allowable rulesets (limit is 200 as of this posting). Example: if you need to set 150,000 arbitrary events a month you run into limits quickly.