Sqs metric numberofmessagesreceived growing - amazon-web-services

Number Of Messages Received is larger in SQS as compared to Number of messages sent.
Almost from a month my count for both the above metric that is number of messages received and sent are equal but suddenly the number of messages received count started increasing as compared to number of messages sent.
As per my understanding number of messages sent is the count of messages added to the SQS and number of messages received is the count of number of messages received by the consumer. So how the number of received messages can be larger than sent messages.
Also the approximate number of visible messages are growing.
Can anyone please help by explaining how this can happen.
Thanks in advance.

The metric includes retries. From docs:
These metrics are calculated from a service perspective, and can include retries. Don't rely on the absolute values of these metrics, or use them to estimate current queue status.
So your application may be trying to process same messages multiple types.

Related

Why is NumberOfMessagesDeleted > NumberOfMessagesSent

I don't understand the metrics for my SQS non-FIFO queue (images below) I'm looking at so I'm hoping someone can help me. The attached images show how I've configured the metrics and are the sum the number of messages sent and the number of messages deleted for the lifetime of this SQS queue (the queue is less than 1 week old but I've set the metrics period for 2 weeks).
It's my understanding that NumberOfMessagesSent refers to the number of messages that have been successfully enqueued and that NumberOfMessagesDeleted is the number of messages that have been successfully dequeued. Given that line of thinking I would think that NumberOfMessagesDeleted should always be <= than NumberOfMessagesSent but this is clearly not the case.
What am I missing here?
For every message you consume you have a receipt handle. You can call DeleteMessage using this handle multiple time, these calls are recorded successfully increasing the value for NumberOfMessagesDeleted metric.
In fact the AWS docs provide 2 examples when will the NumberOfMessagesDeleted larger then expected:
In case of multiple consumers for the same queue:
If the message is not processed before the visibility timeout expires, the message becomes available to other consumers that can process it and delete it again, increasing the value of the NumberOfMessagesDeleted metric.
Calling DeleteMessage multiple times for the same message:
If the message is processed and deleted but you call the DeleteMessage action again using the same receipt handle, a success status is returned, increasing the value of the NumberOfMessagesDeleted metric.
The second one may occur if you have a bug in your code. For example, the library used automatically deletes the message after it is received, but you are also attempt to delete the message manually.
Furthermore, non-FIFO SQS queues may encounter message duplications, which can also increase the number of messages deleted.

AWS SQS - Relationship between number of queue consumers and the number of in-flight messages

I've a standard AWS SQS queue and have multiple EC2 instances(~2K) actively polling that queue in an interval of 2 seconds.
I'm using the AWS Java SDK to poll the queue and using the ReceiveMessageRequest with a single message in response for each request.
My expectation is that the number of in flight messages that shown in the SQS console is the number of messages received by the consumers and not yet deleted from queue(i.e it is the number of active messages under process in an instant). But The problem is that the Number of in flight messages is very much less than the number of consumers I've at an instant. As I mentioned I've ~2K consumers but I only see In-flight messages count in aprox. 300-600 range.
Is my assumption is wrong that the in-flight messages is equal to the number of messages currently under process. Also is there any limitation in the SQS/ EC2 or the SQS Java SDK that limits the number of messages that can be processed in an instant?
This might point to a larger than expected amount of time that your hosts are NOT actively processing messages.
From your example of 2000 consumers polling at an interval of 2s, but only topping out at 600 in flight messages - some very rough math (600/2000=0.3) would indicate your hosts are only spending 30% of their time actually processing. In the simplest case, this would happen if a poll/process/delete of a message takes only 600ms, leaving average of 1400ms of idle time between deleting one message and receiving the next.
A good pattern for doing high volume message processing is to think of message processing in terms of thread pools - one for fetching messages, one for processing, and one for deleting (with a local in-memory queue to transition messages between each pool). Each pool has a very specific purpose, and can be more easily tuned to do that purpose really well:
Have enough fetchers (using the batch ReceiveMessage API) to keep your processors unblocked
Limit the size of the in-memory queue between fetchers and processors so that a single host doesn't put too many messages in flight (blocking other hosts from handling them)
Add as many processor threads as your host can handle
Keep metrics on how long processing takes, and provide ability to abort processing if it exceeds a certain time threshold (related to visibility timeout)
Use enough deleters to keep up with processing (also using the batch DeleteMessage API)
By recording metrics on each stage and the in-memory queues between each stage, you can easily pinpoint where your bottlenecks are and fine-tune the system further.
Other things to consider:
Use long polling - set WaitTimeSeconds property in the ReceiveMessage API to minimize empty responses
When you see low throughput, make sure your queue is saturated - if there are very few items in the queue and a lot of processors, many of those processors are going to sit idle waiting for messages.
Don't poll on an interval - poll as soon as you're done processing the previous messages.
Use batching to request/delete multiple messages at once, reducing time spent on round-trip calls to SQS
Generally speaking, as the number of consumers goes up, the number of messages in flight will go up as well - and each consumer can request unto 10 messages per read request - but in reality if each consumer alwaysrequests 10, they will get anywhere from 0-10 messages, especially when the number of messages is low and the number of consumers is high.
So your thinking is more or less correct, but you can't accurately predict precisely how many messages are in flight at any given time based on the number of consumers currently running, but there is a non-precise correlation between the two.

GCloud Pub/Sub Push Subscription: Limit max outstanding messages

Is there a way in a push subscription configuration to limit the maximum number of outstanding messages. In the high level subscriber docs (https://cloud.google.com/pubsub/docs/push) it says "With slow-start, Google Cloud Pub/Sub starts by sending a single message at a time, and doubles up with each successful delivery, until it reaches the maximum number of concurrent messages outstanding." I want to be able to limit the maximum number of messages being processed, can this be done through the pub/sub config?
I've also thought of a number of other ways to effectively achieve this, but none seem great:
Have some semaphore type system implemented in my push endpoint that returns a 429 once my max concurrency level is hit?
Similar, but have it deregister the push endpoint (turning it into a pull subscription) until the current messages have been processed
My push endpoints are all on gae, so there could also be something in the gae configs to limit the simultaneous push subscription requests?
Push subscriptions do not offer any way to limit the number of outstanding messages. If one wants that level of control, the it is necessary to use pull subscriptions and flow control.
Returning 429 errors as a means to limit outstanding messages may have undesirable side effects. On errors, Cloud Pub/Sub will reduce the rate of sending messages to a push subscriber. If a sufficient number of 429 errors are returned, it is entirely possible that the subscriber will receive a smaller number of messages than it can handle for a time while Cloud Pub/Sub ramps the delivery rate back up.
Switching from push to pull is a possibility, though still may not be a good solution. It would really depend on the frequency with which the push subscriber exceeds the desired number of outstanding messages. The change between push and pull and back may not take place instantaneously, meaning the subscriber could still exceed the desired limit for some period of time and may also experience a delay in receiving new messages when switching back to a push subscriber.

Cloud pubsub slow poll rate

I have a pubsub topic, with one subscription, and two different subscribers are pulling from it.
Using stackdriver, I can see that the subscription has ~1000 messages.
Each subscriber runs the following poll loop:
client = pubsub.Client()
topic = client.topic(topic_name)
subscription = pubsub.Subscription(subscription_name)
while True:
messages = subscription.pull(return_immediately=True, max_messages=100, client=client)
print len(messages)
# put messages in local queue for later processing. Those processes will ack the subsription
My issue is a slow poll rate - even though I have plenty of messages waiting to be polled, I'm getting only several messages each time. Also, lots of responses are back without any messages. According to stackdriver, my messages pulled rate is ~1.5 messages/sec.
I tried to use return_immediately=False, and it improved it a bit - the pull rate increased to ~2.5 messages/sec, but still - not the rate I would expect to have.
Any ideas how to increase pull rate? Any pubsub poll best practices?
In order to increase your pull rate, you need to have more than one outstanding pull request at a time. How many depends on how fast and from how many places you publish. You'll need at least a few outstanding at all times. As soon as one of them returns, create another pull request. That way, whenever Cloud Pub/Sub is ready to deliver messages to your subscriber, you have requests waiting to receive messages.

How do I limit the number of messages in my SQS queue?

I want to have only 200 messages there.
All others should move to the dead letter queue.
We just don't have the capacity to process more messages due to dependency on other services.
I don't think it is possible to limit the number of messages in a queue. You can set a limit to the size of a message in a queue but not the number of messages.
Source: SetQueueAttributes
You definitely can't limit the number of messages in the queue.
What is the nature of your application? Maybe there is a better solution if we knew more about why you need to limit the queue size...
SQS does not have such a limiting feature.
So don't try to do it at the SQS level. Instead, implement this limiting logic as you're pulling messages from the queue.
Keep track of the messages you pull from the queue and send to the 3rd party service. Once you hit your limit (of 20?), then junk the message.
Have a counter of messages that are "being processed".
Pull a message from the queue.
Check the counter and if it's less than 20, increment the counter and send the message to the 3rd party service.
When the 3rd party service call returns, decrement the counter.
When you check the counter in #2 above and it's 20, then junk the message.
UPDATE:
If you don't have a delayed visibility set to them, you could call ApproximateNumberOfMessagesVisible via the cloudwatch API before allowing the message to go through.
The number of messages available for retrieval from the queue.
Units: Count
Valid Statistics: Average
If you do have delayed visibility greater than 0, you could do two checks, the second with ApproximateNumberOfMessagesNotVisible.
If this solution doesn't work, yes, this seems a bit much, but you could do a call on NumberOfMessagesDeleted and NumberOfMessagesSent and get the number of messages still in the queue.
So the (pseudo) "code" would look like:
if (ApproximateNumberOfMessagesVisible < 200)
//Send message
//OR
var x = NumberOfMessagesSent - NumberOfMessagesDeleted;
if (x < 200)
//Send message
HERE is the documentation of the above calls
Information Check?
After a second look, I do not believe the below configuration will solve this problem. I will leave it in the answer until confirmed incorrect.
I believe this is possible during the set up process by adjusting the Dead Letter Queue Settings
In the SQS set up you will see: Use Redrive Policy that states:
Send messages into a dead letter queue after exceeding the Maximum Receives.
And just below that: Maximum Receives that states:
The maximum number of times a message can be received before it is sent to the Dead Letter Queue.
This setting should send all overflow to a secondary queue that is an optional value. So in other words, you can enable Redrive Policy leave Dead Letter Queue blank with the Maximum Receives set to 200
amazon-web-services amazon-sqs amazon-cloudwatch