We are currently implementing a distributed Spring Boot microservice architecture on Amazon's AWS, where we use SNS/SQS as our messaging system:
Events are published by a Spring Boot service to an SNS FIFO topic using Spring Cloud AWS. The topic hands over the events to multiple SQS queues subscribed to the topic, and the queues are then in turn consumed by different consumer services (again Spring Boot using Spring Cloud AWS).
Everything works as intended, but we are sometimes seeing very high latency on our production services.
Our product isn't released yet (we are currently in testing), meaning we have very, very low traffic on prod, i.e., only a few messages a day.
Unfortunately, we see very high latency until a message is delivered to its subscribers after a long period of inactivity (typically up to 6 seconds, but can be as high as 60 seconds). Things speed up considerably afterwards with message delivery times dropping to below 100ms for the next messages being sent to the topic.
Turning on logging on the SNS topic in AWS revealed that most of the delay for the first message is spent at the SNS part of things, where the SNS dwellTime roughly correlates with the delays we are seeing in message delivery. Spring Cloud AWS seems fine.
Is this something expected? Is there something like a "cold startup" time for idle SNS FIFO topics (as seen when using AWS lambdas)? Will this latency simply go away once we increase the load and heat up the topic? Or is there something we missed configuring?
We are using fairly standard SQS subscriptions, btw, no subscription throttling in place. The Spring Boot services run on a Fargate ECS cluster.
Seems like AWS inactivates unused SNS topics somehow. What we are doing now is, we are sending a "dummy" Keep-Alive message to the topic every ten minutes, which keeps the dwellTime reasonably low for us (<500ms)
Related
I have a Cloud Function that is being triggered from a Pub/Sub topic.
I want to rate limit my Cloud Function, so I set the max instances to 5. In my case, there will be a lot more produced messages than Cloud Functions (and I want to limit the number of running Cloud Functions).
I expected this process to behave like Kafka/queue - the topic messages will be accumulated, and the Cloud Function will slowly consume messages until the topic will be empty.
But it seems that all the messages that did not trigger cloud function (ack), simply sent a UNACK - and left behind. My subscription details:
The ack deadline max value is too low for me (it may take a few hours until the Cloud Function will get to messages due to the rate-limiting).
Anything I can change in the Pub/Sub to fit my needs? Or I'll need to add a queue? (Pub/Sub to send to a Task Queue, and Cloud Function consumes the Task Queue?).
BTW, The pub/sub data is actually GCS events.
If this was AWS, I would simply send S3 file-created events to SQS and have Lambdas on the other side of the queue to consume.
Any help would be appreciated.
The ideal solution is simply to change the retrying policy.
When using "Retry after exponential backoff delay", the Pub/Sub will keep retrying even after the maximum exponential delay (600 seconds).
This way, you can have a lot of messages in the Pub/Sub, and take care of them slowly with a few Cloud Functions - which fits our need of rate-limiting.
Basically, everything is the same but this configuration changed, and the result is:
Which is exactly what I was looking for :)
You cannot compare to kafka because your kafka consumer is pulling messages at its convenience, while Cloud Function(CF) creates a push subscription that is pushing messages to your CF.
So some alternatives:
Create a HTTP CF triggered by cloud scheduler that will pull messages from your PULL subscription. Max retention of unack messages are 7 days (hope it's enough)
Use Cloud for which you increase max concurrency (max concurrent request), with proper sizing for CPU and RAM. Of course your can control the max number of cloud run instances (different from max concurrency). And Use PUSH subscription pushing to cloud run. But here Also you will be limited by 10 minutes ack deadline.
I have a simple lambda app that is not in production right now, only being used for testing and debugging. The function sends a message to SQS to perform CRUD operations on an external application. I've set this function to be invoked by SQS when it receives a message, so the same function is sending and receiving.
I've just received an email saying I've used over 85% of my free tier SQS requests quota, or over 850,000 requests in just the past 2 weeks. I'm certain these requests are not messages being sent to queue, or received. The number of sends/receives has to be under 1000 for how often I've used this app. I've also verified using SQS monitoring that there are no messages stuck in queue. And the number of sent messages is more or less what I expected, a low number.
Like I said this app is only being used by myself for testing, a few days per week. Where does the 850,000+ requests come from?
Amazon SQS is charged at $0.40 per million API calls. Calls include send, receive and delete, so it is possible that a message might use 3+ API calls.
From AWS Lambda Adds Amazon Simple Queue Service to Supported Event Sources | AWS News Blog:
There are no additional charges for this feature, but because the Lambda service is continuously long-polling the SQS queue the account will be charged for those API calls at the standard SQS pricing rates.
Long-polling takes 20 seconds, which makes 4320 polls per day. This equates to 60,480 over two weeks or 129,600 per month. Admittedly, it would be more if messages are flowing, since long polling exits whenever there are messages.
So, either the queue is being used a lot (and you are getting excellent value for your $0.40) or you have something else generating lots of SQS API calls.
If you use the same function for sending to SQS and receive from SQS, it means that:
Lambda send message to SQS -> SQS receive the message -> SQS trigger Lambda -> Lambda send message to SQS
And... It's an infinite loop :)
I have over 5M subscribers to a SNS topic. I want to slowly send the push notifications to these users say at the rate of 20000 per sec. AWS tries to deliver the message to 5M as fast as possible. Is there any way I can slow down the sends ?
There is no configuration setting for Amazon SNS to limit the speed of notifications.
You would probably need to create separate SNS topics and send to each in turn (with a delay).
If you are doing some form of marketing to mobile apps, you could consider using Amazon Pinpoint instead.
I'm using Spring JMS to communicate with Amazon SQS queues. I set up a handful of queues and wired up the listeners, but the app isn't sending any messages through them currently. AWS allows 1 million requests per month for free, which I thought should be no problem, but after a month I got billed a small amount for going over that limit.
Is there a way to tune SQS or Spring JMS to keep the requests down?
I'm assuming a request is whenever my app polls the queue to check for new messages. Some queues don't need to be near realtime so I could definitely reduce those requests. I'd appreciate any insights you can offer into how SQS and Spring JMS communicate.
"Normal" JMS clients, when polled for messages, don't poll the server - the server pushes messages to the client and the poll is just done locally.
If the SQS client polls the server, that would be unusual, to say the least, but if it's using REST calls, I can see why it would happen.
Increasing the container's receiveTimeout (default 1 second) might help, but without knowing what the client is doing under the covers, it's hard to tell.
I have first web service which is used to send messages into the aws sqs, this web service is deployed on a separate ec2 instance. Web service is running under IIS 8. This web service is able to handle 500 requests per second from two machines meaning 1000 requests per second. It can handle more requests.
I have second web service deployed on another ec2 instance of the same power/configuration. This web service will be used to process the messages stored in the Sqs. For testing purpose currently, I am only receiving the message from Sqs and just deleting that.
I have a aws Sns service which tells the second web service that a message has come in the sqs, go and receive that message to process.
But I observe that my second web service is not as fast as my first web service, every time I run the test, messages are left in the sqs, but ideally no message should remain in the sqs.
Please guide me what are the possible reasons of this and area on which I should focus.
Thanks in advance.
The receiver has double the work to do since it both receives and deletes the message which is done in two separate calls. You may need double the instances to process the sent messages if you have high volume.
How many messages are you receiving at once? I highly recommend setting the MaxNumberOfMessages to 10 and then using DeleteMessageBatch with batches of 10. Not only will this greatly increase throughput, it will cut your SQS bill in by about 60%.
Also, I'm confused about the SNS topic. There is no need to have an SNS topic tell the other web service that a message exists. If every message generates a publish to that topic, then you are adding a lot of extra work and expense. Instead you should use long polling and set the WaitTimeSeconds to 20 and just always be calling SQS. Even if you get 0 messages for a month 2 servers constantly long polling will be well within the free tier. If you are above the free tier, the total cost of 2 servers constantly long polling an SQS queue is $0.13/month