Amazon SQS message disappeared - amazon-web-services

I have an Amazon SQS queue and a dead letter queue.
My python program gets a message from the SQS queue and then, if it raise an exception, it will send the message to the dead letter queue.
Now I have a program that checks dead letter queue if those messages can still be processed. If it is, it will be sent back to main SQS queue. You see, what I expect here is an infinite loop of sorts in my testing but apparently, the message disappears after 2 tries. Why is it like this?
When I put an extra field in the message (which is random value) it somehow does what I expect (infinite loop of sending back and forth). Is there a mechanism in SQS that prevents what I do when message is the same?
def handle_retrieved_messages(self):
if not self._messages:
return None
for message in self._messages:
try:
logger.info(
"Processing Dead Letter message: {}".format(
message.get("Body")
)
)
message_body = self._convert_json_to_dict(message.get("Body"))
reprocessed = self._process_message(
message_body, None, message_body
)
except Exception as e:
logger.exception(
"Failed to process the following SQS message:\n"
"Message Body: {}\n"
"Error: {}".format(message.get("Body", "<empty body>"), e)
)
# Send to error queue
self._delete_message(message)
self._sqs_sender.send_message(message_body)
else:
self._delete_message(message)
if not reprocessed:
# Send to error queue
self._sqs_sender.send_message(message_body)
self._process_message will check if message_body has reprocess flag set to true. If true, send it back to main queue.
Now I made the contents of the message with error so every time it is processed in Main queue, it will go to dead letter. And then I expect this to keep on loop but SQS looks like has a mechanism to stop this from happening (which is good).
Question is what setting is that?

The normal way that an Amazon SQS queue works is:
Messages are sent to the queue
An application calls ReceiveMessage() on the queue to receive a message (or multiple messages). This increments the Receive Count on a message.
This puts the message(s) into an invisible state. This means that the message is still in the queue, but it is not visible if another application tries to receive messages from the queue
Once the application has finished processing the message, it calls DeleteMessage(), providing the message handle of the message. This removes the message from the queue.
However, if the application does not delete the message within the invisibility timeout period, then the message appears on the queue again. This is done in case the application has crashed. Instead of losing the message, it is put back on the queue so that another (or the same) application can process it again.
If a message exceeds the invisibility timeout period AND its Receive Count now exceeds the Maximum Receives setting, it is not put back on the queue. Instead, it is placed on the Dead Letter Queue (DLQ).
So, the normal process is that Amazon SQS moves messages to the DLQ after the message has been received more than (in your case) 10 attempted Receives. It is NOT the job of your application to move the message to the Dead Letter Queue!
If you want to handle all the 'dead letter' processing yourself (eg moving to different queues), then turn off the DLQ functionality on the queue itself. This is probably causing your messages to disappear or go to the wrong location.
By the way, when deleting a message, you need to provide the MessageHandle of the message, not the message itself.

Related

SQS queue sometimes freezes

SQS sometimes stops receiving messages or allowing message consumption, then resumes after ~5 mins. Do you know if there is a setting that can produce this behavior? I was playing around with the settings but could not change this behavior.
Notice: When I send a message, I get the ID and the OK as it was received, but the message is not in the queue.
If you are getting an ID and message is not in the queue,I believe you are using FIFO and it ignores dupliate messages within a default time frame (5 min. ?). Whatever is feeding the queue need to use a good deduplication id in case if you want to process duplicate messages.
Read this

How to ensure SQS FIFO is blocked while having a message in the corresponding deadletter queue

Imagine the following lifetime of an Order.
Order is Paid
Order is Approved
Order is Completed
We chose to use an SQS FIFO to ensure all these messages are processed in the order they are produced, to avoid for example changing the status of an order to Approved only after it was Paid and not after has been Completed.
But let's say that there is an error while trying to Approve an order, and after several attempts the message will be moved to the Deadletter queue.
The problem we noticed is the subsequent message, that is "Order is completed", it is processed, even though the previous message, "Approved", it is in the deadletter queue.
How we should handle this?
Should we check the contents of deadletter queue for having messages with the same MessageGroupID as the consuming one, assuming we could do this?
Is there a mechanism that we are missing?
Sounds to me like you are using a single Queue for multiple types of events, where I would probably recommend (at least) three seperate queues:
An order paid event queue
An order approved event queue
An order completed event queue
When a order payment comes in, an event is put into the first queue, once your system has successfully processed that payment, it removes the item from the first queue (deletes the message), and then inserts 'Order Approved' event into the 2nd queue.
The process responsible for processing those events, only watches that queue and does what it needs to do, and once complete, deletes the message and inserts a third message into the third queue so that yet another process can see and act on that message - process it and then delete it.
If anything fails along the way the message will eventually endup in a dead letter queue - either the same on, or one per queue - that makes no difference, but nothing that was supposed to happen AFTER the event failed would happen.
Doesn't even sound to me like you need a FIFO queue at all in this case, though there is no real harm (except for the slighlty higher cost, and lower throughput limits).
Source from AWS https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-dead-letter-queues.html:
Don't use a dead-letter queue with a FIFO queue if you don't want to break the exact order of messages or operations. For example, don't use a dead-letter queue with instructions in an Edit Decision List (EDL) for a video editing suite, where changing the order of edits changes the context of subsequent edits.

Process messages from Amazon SQS Dead Letter Queue

I want to process messages from an Amazon SQS Dead Letter Queue.
What is the best way to process them?
Receive messages from dead letter queue and process it.
Receive messages from dead letter queue put back in main queue and then process it?
I just need to process messages from dead letter queue once in a while.
After careful consideration of various options, I am going with the option 2 "Receive messages from dead letter queue put back in main queue and then process it" you mentioned.
Make sure that while transferring the messages from one queue messages are not lost.
Before putting messages from DLQ to main queue, make sure that the errors faced in the main listener (mainly coding errors if any) are resolved or if any network issues are resolved.
The listener of the main queue has retried the message already and retrying it again. So please make sure to either skip already successful steps of message processing in case message is being retried. Also revert successfully processed steps in case of any errors. (This will will help in the message retry as well.)
DLQ is meant for unexpected errors. So you may have an on-demand job for doing this.
Presumably the message ended up in the Dead Letter Queue for a reason, after failing several times.
It would not be a good idea to put it back in the main queue because, presumably, it would fail again and you would create an infinite loop.
Initially, dead messages should be examined manually to determine the causes of failure. Then, based on this information, an alternate flow could be developed.

MFC Message Queue Limit

My understanding of the size limit on the message queue in a MFC thread comes from the explanation on PostThreadMessage page of MSDN.
https://msdn.microsoft.com/en-us/library/windows/desktop/ms644946%28v=vs.85%29.aspx
As stated, the limit by default is 10000 messages. I am trying to understand exactly what this limit is. I see it being one of two thing.
Scenario A
I have a GUI that is handling messages. The rate at which the messages are being placed in the queue is greater than that at which these messages are being pulled off the queue and handled. In this case messages accumulate, eventually there are 10000 messages on the queue, another message tries to join the queue, but it then fails.
Scenario B
I have a GUI that is handling messages. The rate at which message are being placed in the queue is less that then rate at which these messages are being pulled of the queue and handled. Messages do no accumulate on the queue. But after my queue has seen 10000 messages, it is rendered useless, so effectively, my message queue has a limited operational life.
The more I think about it, the answer should be Scenario A... but stranger things have happened..
From the linked article: GetLastError returns ERROR_NOT_ENOUGH_QUOTA when the message limit is hit. So, every attempt to send/post message when the queue is full fails, that's all.
Generally, destination thread handles the messages and removes them from the queue. PeekMessage with PM_NOREMOVE flag allows to handle the message without removing it. For reference, PeekMessage function: https://msdn.microsoft.com/en-us/library/windows/desktop/ms644943%28v=vs.85%29.aspx

Why do SqS messages sometimes remain in-flight on queue

I'm using Amazon SQS queues in a very simple way. Usually, messages are written and immediately visible and read. Occasionally, a message is written, and remains In-Flight(Not Visible) on the queue for several minutes. I can see it from the console. Receive-message-wait time is 0, and Default Visibility is 5 seconds. It will remain that way for several minutes, or until a new message gets written that somehow releases it. A few seconds delay is ok, but more than 60 seconds is not ok.
There a 8 reader threads that are long polling always, so its not that something is not trying to read it, they are.
Edit : To be clear, none of the consumer reads are returning any messages at all and it happens regardless of whether or not the console is open. In this scenario, only one message is involved, and it is just sitting in the queue invisible to the consumers.
Has anyone else seen this behavior and what I can do to improve it?
Here is the sdk for java I am using:
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk</artifactId>
<version>1.5.2</version>
</dependency>
Here is the code that does the reading (max=10,maxwait=0 startup config):
void read(MessageConsumer consumer) {
List<Message> messages = read(max, maxWait);
for (Message message : messages) {
if (tryConsume(consumer, message)) {
delete(message.getReceiptHandle());
}
}
}
private List<Message> read(int max, int maxWait) {
AmazonSQS sqs = getClient();
ReceiveMessageRequest rq = new ReceiveMessageRequest(queueUrl);
rq.setMaxNumberOfMessages(max);
rq.setWaitTimeSeconds(maxWait);
List<Message> messages = sqs.receiveMessage(rq).getMessages();
if (messages.size() > 0) {
LOG.info("read {} messages from SQS queue",messages.size());
}
return messages;
}
The log line for "read .." never appears when this is happening, and its what causes me to go in with the console and see if the message is there or not, and it is.
It sounds like you are misinterpreting what you are seeing.
Messages "in flight" are not pending delivery, they're messages that have already been delivered but not further acted on by the consumer.
Messages are considered to be in flight if they have been sent to a client but have not yet been deleted or have not yet reached the end of their visibility window.
— https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-available-cloudwatch-metrics.html
When a consumer receives a message, it has to -- at some point -- either delete the message, or send a request to increase the timeout for that message; otherwise the message becomes visible again after the timeout expires. If a consumer fails to do one of these things, the message automatically becomes visible again. The visibility timeout is how long the consumer has before one of these things must be done.
Messages should not be "in flight" without something having already received them -- but that "something" can include the console itself, as you'll note on the pop-up you see when you choose "View/Delete Messages" in the console (unless you already checked the "Don't show this again" checkbox):
Messages displayed in the console will not be available to other applications until the console stops polling for messages.
Messages displayed in the console are "in flight" while the console is observing the queue from the "View/Delete Messages" screen.
The part that does not make obvious sense is messages being in flight "for several minutes" if your default visibility timeout is only 5 seconds and nothing in your code is increasing that timeout... however... that could be explained almost perfectly by your consumers not properly disposing of the message, causing it to timeout and immediately be redelivered, giving the impression that a single instance of the message was remaining in-flight, when in fact, the message is briefly transitioning back to visible, only to be claimed almost immediately by another consumer, taking it back to in-flight again.
It may happen when you send or lock a message and within some seconds you try to get the fresh list of messages. Amazon SQS stores the data into multiple servers and in multiple data centers http://aws.amazon.com/sqs/faqs/#How_reliably_is_my_data_stored_in_Amazon_SQS.
To get rid of these issues you need to wait more so that queue would have more time to give appropriate results.