Why do SqS messages sometimes remain in-flight on queue - amazon-web-services

I'm using Amazon SQS queues in a very simple way. Usually, messages are written and immediately visible and read. Occasionally, a message is written, and remains In-Flight(Not Visible) on the queue for several minutes. I can see it from the console. Receive-message-wait time is 0, and Default Visibility is 5 seconds. It will remain that way for several minutes, or until a new message gets written that somehow releases it. A few seconds delay is ok, but more than 60 seconds is not ok.
There a 8 reader threads that are long polling always, so its not that something is not trying to read it, they are.
Edit : To be clear, none of the consumer reads are returning any messages at all and it happens regardless of whether or not the console is open. In this scenario, only one message is involved, and it is just sitting in the queue invisible to the consumers.
Has anyone else seen this behavior and what I can do to improve it?
Here is the sdk for java I am using:
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk</artifactId>
<version>1.5.2</version>
</dependency>
Here is the code that does the reading (max=10,maxwait=0 startup config):
void read(MessageConsumer consumer) {
List<Message> messages = read(max, maxWait);
for (Message message : messages) {
if (tryConsume(consumer, message)) {
delete(message.getReceiptHandle());
}
}
}
private List<Message> read(int max, int maxWait) {
AmazonSQS sqs = getClient();
ReceiveMessageRequest rq = new ReceiveMessageRequest(queueUrl);
rq.setMaxNumberOfMessages(max);
rq.setWaitTimeSeconds(maxWait);
List<Message> messages = sqs.receiveMessage(rq).getMessages();
if (messages.size() > 0) {
LOG.info("read {} messages from SQS queue",messages.size());
}
return messages;
}
The log line for "read .." never appears when this is happening, and its what causes me to go in with the console and see if the message is there or not, and it is.

It sounds like you are misinterpreting what you are seeing.
Messages "in flight" are not pending delivery, they're messages that have already been delivered but not further acted on by the consumer.
Messages are considered to be in flight if they have been sent to a client but have not yet been deleted or have not yet reached the end of their visibility window.
— https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-available-cloudwatch-metrics.html
When a consumer receives a message, it has to -- at some point -- either delete the message, or send a request to increase the timeout for that message; otherwise the message becomes visible again after the timeout expires. If a consumer fails to do one of these things, the message automatically becomes visible again. The visibility timeout is how long the consumer has before one of these things must be done.
Messages should not be "in flight" without something having already received them -- but that "something" can include the console itself, as you'll note on the pop-up you see when you choose "View/Delete Messages" in the console (unless you already checked the "Don't show this again" checkbox):
Messages displayed in the console will not be available to other applications until the console stops polling for messages.
Messages displayed in the console are "in flight" while the console is observing the queue from the "View/Delete Messages" screen.
The part that does not make obvious sense is messages being in flight "for several minutes" if your default visibility timeout is only 5 seconds and nothing in your code is increasing that timeout... however... that could be explained almost perfectly by your consumers not properly disposing of the message, causing it to timeout and immediately be redelivered, giving the impression that a single instance of the message was remaining in-flight, when in fact, the message is briefly transitioning back to visible, only to be claimed almost immediately by another consumer, taking it back to in-flight again.

It may happen when you send or lock a message and within some seconds you try to get the fresh list of messages. Amazon SQS stores the data into multiple servers and in multiple data centers http://aws.amazon.com/sqs/faqs/#How_reliably_is_my_data_stored_in_Amazon_SQS.
To get rid of these issues you need to wait more so that queue would have more time to give appropriate results.

Related

Understanding SQS message receive amount

I have a queue which is supposed to receive the messages sent by a lambda function. This function is supposed to send each different message once only. However, I saw a scary amount of receive count on the console:
Since I cannot find any explanation about receive count in the plain English, I need to consult StackOverflow Community. I have 2 theories to verify:
There are actually not so many messages and the reason why "receive count" is that high is simply because I polled the messages for a looooong time so the messages were captured more than once;
the function that sends the messages to the queue is SQS-triggered, those messages might be processed by multiple processors. Though I set VisibilityTimeout already, are the messages which are processed going to be deleted? If they aren't remained, there are no reasons for them to be caught and processed for a second time.
Any debugging suggestion will be appreciated!!
So, receive count is basically the amount of times the lambda (or any other consumer) has received the message. It can be that a consumer receives a message more than once (this is by design, and you should handle that in your logic).
That being said, the receive count also increases if your lambda fails to process the message (or even hits the execution limits). The default is 3 times, so if something with your lambda is wrong, you will have at least 3 receives per message.
Also, when you are polling the message, via the AWS console, you are basically increasing the receive count.

SQS queue sometimes freezes

SQS sometimes stops receiving messages or allowing message consumption, then resumes after ~5 mins. Do you know if there is a setting that can produce this behavior? I was playing around with the settings but could not change this behavior.
Notice: When I send a message, I get the ID and the OK as it was received, but the message is not in the queue.
If you are getting an ID and message is not in the queue,I believe you are using FIFO and it ignores dupliate messages within a default time frame (5 min. ?). Whatever is feeding the queue need to use a good deduplication id in case if you want to process duplicate messages.
Read this

Amazon SQS message disappeared

I have an Amazon SQS queue and a dead letter queue.
My python program gets a message from the SQS queue and then, if it raise an exception, it will send the message to the dead letter queue.
Now I have a program that checks dead letter queue if those messages can still be processed. If it is, it will be sent back to main SQS queue. You see, what I expect here is an infinite loop of sorts in my testing but apparently, the message disappears after 2 tries. Why is it like this?
When I put an extra field in the message (which is random value) it somehow does what I expect (infinite loop of sending back and forth). Is there a mechanism in SQS that prevents what I do when message is the same?
def handle_retrieved_messages(self):
if not self._messages:
return None
for message in self._messages:
try:
logger.info(
"Processing Dead Letter message: {}".format(
message.get("Body")
)
)
message_body = self._convert_json_to_dict(message.get("Body"))
reprocessed = self._process_message(
message_body, None, message_body
)
except Exception as e:
logger.exception(
"Failed to process the following SQS message:\n"
"Message Body: {}\n"
"Error: {}".format(message.get("Body", "<empty body>"), e)
)
# Send to error queue
self._delete_message(message)
self._sqs_sender.send_message(message_body)
else:
self._delete_message(message)
if not reprocessed:
# Send to error queue
self._sqs_sender.send_message(message_body)
self._process_message will check if message_body has reprocess flag set to true. If true, send it back to main queue.
Now I made the contents of the message with error so every time it is processed in Main queue, it will go to dead letter. And then I expect this to keep on loop but SQS looks like has a mechanism to stop this from happening (which is good).
Question is what setting is that?
The normal way that an Amazon SQS queue works is:
Messages are sent to the queue
An application calls ReceiveMessage() on the queue to receive a message (or multiple messages). This increments the Receive Count on a message.
This puts the message(s) into an invisible state. This means that the message is still in the queue, but it is not visible if another application tries to receive messages from the queue
Once the application has finished processing the message, it calls DeleteMessage(), providing the message handle of the message. This removes the message from the queue.
However, if the application does not delete the message within the invisibility timeout period, then the message appears on the queue again. This is done in case the application has crashed. Instead of losing the message, it is put back on the queue so that another (or the same) application can process it again.
If a message exceeds the invisibility timeout period AND its Receive Count now exceeds the Maximum Receives setting, it is not put back on the queue. Instead, it is placed on the Dead Letter Queue (DLQ).
So, the normal process is that Amazon SQS moves messages to the DLQ after the message has been received more than (in your case) 10 attempted Receives. It is NOT the job of your application to move the message to the Dead Letter Queue!
If you want to handle all the 'dead letter' processing yourself (eg moving to different queues), then turn off the DLQ functionality on the queue itself. This is probably causing your messages to disappear or go to the wrong location.
By the way, when deleting a message, you need to provide the MessageHandle of the message, not the message itself.

How to prevent other workers from accessing a message which is being currently processed?

I am working on a project that will require multiple workers to access the same queue to get information about a file which they will manipulate. Files are ranging from size, from mere megabytes to hundreds of gigabytes. For this reason, a visibility timeout doesn't seem to make sense because I cannot be certain how long it will take. I have though of a couple of ways but if there is a better way, please let me know.
The message is deleted from the original queue and put into a
‘waiting’ queue. When the program finished processing the file, it
deletes it, otherwise the message is deleted from the queue and put
back into the original queue.
The message id is checked with a database. If the message id is
found, it is ignored. Otherwise the program starts processing the
message and inserts the message id into the database.
Thanks in advance!
Use the default-provided SQS timeout but take advantage of ChangeMessageVisibility.
You can specify the timeout in several ways:
When the queue is created (default timeout)
When the message is retrieved
By having the worker call back to SQS and extend the timeout
If you are worried that you do not know the appropriate processing time, use a default value that is good for most situations, but don't make it so big that things become unnecessarily delayed.
Then, modify your workers to make a ChangeMessageVisiblity call to SQS periodically to extend the timeout. If a worker dies, the message stops being extended and it will reappear on the queue to be processed by another worker.
See: MessageVisibility documentation

Celery on SQS - Handling Duplicates [duplicate]

I know that it is possible to consume a SQS queue using multiple threads. I would like to guarantee that each message will be consumed once. I know that it is possible to change the visibility timeout of a message, e.g., equal to my processing time. If my process spend more time than the visibility timeout (e.g. a slow connection) other thread can consume the same message.
What is the best approach to guarantee that a message will be processed once?
What is the best approach to guarantee that a message will be processed once?
You're asking for a guarantee - you won't get one. You can reduce probability of a message being processed more than once to a very small amount, but you won't get a guarantee.
I'll explain why, along with strategies for reducing duplication.
Where does duplication come from
When you put a message in SQS, SQS might actually receive that message more than once
For example: a minor network hiccup while sending the message caused a transient error that was automatically retried - from the message sender's perspective, it failed once, and successfully sent once, but SQS received both messages.
SQS can internally generate duplicates
Simlar to the first example - there's a lot of computers handling messages under the covers, and SQS needs to make sure nothing gets lost - messages are stored on multiple servers, and can this can result in duplication.
For the most part, by taking advantage of SQS message visibility timeout, the chances of duplication from these sources are already pretty small - like fraction of a percent small.
If processing duplicates really isn't that bad (strive to make your message consumption idempotent!), I'd consider this good enough - reducing chances of duplication further is complicated and potentially expensive...
What can your application do to reduce duplication further?
Ok, here we go down the rabbit hole... at a high level, you will want to assign unique ids to your messages, and check against an atomic cache of ids that are in progress or completed before starting processing:
Make sure your messages have unique identifiers provided at insertion time
Without this, you'll have no way of telling duplicates apart.
Handle duplication at the 'end of the line' for messages.
If your message receiver needs to send messages off-box for further processing, then it can be another source of duplication (for similar reasons to above)
You'll need somewhere to atomically store and check these unique ids (and flush them after some timeout). There are two important states: "InProgress" and "Completed"
InProgress entries should have a timeout based on how fast you need to recover in case of processing failure.
Completed entries should have a timeout based on how long you want your deduplication window
The simplest is probably a Guava cache, but would only be good for a single processing app. If you have a lot of messages or distributed consumption, consider a database for this job (with a background process to sweep for expired entries)
Before processing the message, attempt to store the messageId in "InProgress". If it's already there, stop - you just handled a duplicate.
Check if the message is "Completed" (and stop if it's there)
Your thread now has an exclusive lock on that messageId - Process your message
Mark the messageId as "Completed" - As long as this messageId stays here, you won't process any duplicates for that messageId.
You likely can't afford infinite storage though.
Remove the messageId from "InProgress" (or just let it expire from here)
Some notes
Keep in mind that chances of duplicate without all of that is already pretty low. Depending on how much time and money deduplication of messages is worth to you, feel free to skip or modify any of the steps
For example, you could leave out "InProgress", but that opens up the small chance of two threads working on a duplicated message at the same time (the second one starting before the first has "Completed" it)
Your deduplication window is as long as you can keep messageIds in "Completed". Since you likely can't afford infinite storage, make this last at least as long as 2x your SQS message visibility timeout; there is reduced chances of duplication after that (on top of the already very low chances, but still not guaranteed).
Even with all this, there is still a chance of duplication - all the precautions and SQS message visibility timeouts help reduce this chance to very small, but the chance is still there:
Your app can crash/hang/do a very long GC right after processing the message, but before the messageId is "Completed" (maybe you're using a database for this storage and the connection to it is down)
In this case, "Processing" will eventually expire, and another thread could process this message (either after SQS visibility timeout also expires or because SQS had a duplicate in it).
Store the message, or a reference to the message, in a database with a unique constraint on the Message ID, when you receive it. If the ID exists in the table, you've already received it, and the database will not allow you to insert it again -- because of the unique constraint.
AWS SQS API doesn't automatically "consume" the message when you read it with API,etc. Developer need to make the call to delete the message themselves.
SQS does have a features call "redrive policy" as part the "Dead letter Queue Setting". You just set the read request to 1. If the consume process crash, subsequent read on the same message will put the message into dead letter queue.
SQS queue visibility timeout can be set up to 12 hours. Unless you have a special need, then you need to implement process to store the message handler in database to allow it for inspection.
You can use setVisibilityTimeout() for both messages and batches, in order to extend the visibility time until the thread has completed processing the message.
This could be done by using a scheduledExecutorService, and schedule a runnable event after half the initial visibility time. The code snippet bellow creates and executes the VisibilityTimeExtender every half of the visibilityTime with a period of half the visibility time. (The time should to guarantee the message to be processed, extended with visibilityTime/2)
private final ScheduledExecutorService scheduler = Executors.newScheduledThreadPool(1);
ScheduledFuture<?> futureEvent = scheduler.scheduleAtFixedRate(new VisibilityTimeExtender(..), visibilityTime/2, visibilityTime/2, TimeUnit.SECONDS);
VisibilityTimeExtender must implement Runnable, and is where you update the new visibility time.
When the thread is done processing the message, you can delete it from the queue, and call futureEvent.cancel(true) to stop the scheduled event.