According to the documentation for SQS (emphasis mine):
Amazon SQS stores copies of your messages on multiple servers for redundancy and high availability. On rare occasions, one of the servers storing a copy of a message might be unavailable when you receive or delete the message. If that occurs, the copy of the message will not be deleted on that unavailable server, and you might get that message copy again when you receive messages. Because of this, you must design your application to be idempotent (i.e., it must not be adversely affected if it processes the same message more than once).
What time period can reasonably occur between the original and duplicate messages being received? (seconds? hours? months?)
I have no specific proof or link to show you, but in my experience working with SQS you are talking about a range of time that is under a few minutes in most cases. The possibility of a duplicate message happening will be because of activity that took place on the message during the very small lag of time as the message is replicated via very high speed connections to redundant queues within the AWS infrastructure, so in other words, very quickly. It is also likely going to be affected by the visibility timeouts you have specified.
Related
I understand that standard SQS uses "at least once" delivery while FIFO messages are delivered exactly once. I'm trying to weigh standard queues vs FIFO for my application, and one factor is how long it takes for the duplicated message to arrive.
I intend to consume messages from SQS then post the data I received to an idempotent third-party API. I understand that with standard SQS, there's always a risk of me overwriting more recent data with the old duplicated data.
For example:
Message A arrives, I post it onwards.
Message A duplicate arrives, I post it onwards.
Message B arrives, I post it onwards.
All fine ✓
On the other hand:
Message A arrives, I post it onwards.
Message B arrives, I post it onwards.
Message A duplicate arrives - I post it and overwrite the latest data, which was B! ✖
I want to measure this risk, i.e. I want to know how long the duplicate message should take to arrive. Will the duplicate message take roughly the same amount of time to arrive, as the original message?
Maybe it's useful to understand how message duplication occurs. As far as I know this isn't documented in the official docs, but instead it's my mental model of how it works. This is an educated guess.
Whenever you send a message to SQS (SendMessage API), this message arrives at the SQS webservice endpoint, which is one of probably thousands of servers. This endpoint receives your message, duplicates it one or more times and stores these duplicates on more than one SQS server. After it has received confirmation from at least two SQS servers, it acknowledges to the client that the message has been received.
When you call the ReceiveMessage API only a subset of the SQS servers that handle your queue are queried for messages. When a message is returned, these servers communicate to their peers, that this message is currently in-flight and the visibility timeout starts. This doesn't happen instantaneously, as it's a distributed system. While this ReceiveMessage call takes place another consumer might also do a ReceiveMessage call and happen to query one of the servers that have a replica of the message, before it's marked as in-flight. That server hands out the message and now you have to consumers working on it.
This is just one scenario, which is the result of this being a distributed system.
There are a couple of edge cases that can happen as the result of network issues, e.g. when the SQS response to the initial SendMessage gets lost and the client thinks the message didn't arrive and sends it again - poof, you got another duplicate.
The point being: things fail in weird and complex ways. That makes measuring the risk of a delayed message difficult. If your use case can't handle duplicate and out of order messages, you should go for FIFO, but that will inherently limit your throughput. Alternatives are based on distributed locking mechanisms and keeping track of which messages you have already processed, which are complex tools to solve a complex problem.
I've set up an S3 bucket to emit an event on PUT object to SQS, and I'm handling the SQS queue in an EB worker tier.
The schema for the message that SQS sends is here: http://docs.aws.amazon.com/AmazonS3/latest/dev/notification-content-structure.html
Records is an array, implying that there can be multiple records sent in one POST to my worker's endpoint. Does this actually happen? Or will my worker only ever receive one record per message?
The worker can only return one response, either 200 (message handled successfully) or non-200 (message not handled successfully, which puts it back into the queue), regardless of how many records in the message it receives.
So if my worker receives multiple records in a message, and it handles some successfully (say by doing something with side effects such as inserting into a database) but fails on one or more, how should I handle that? If I return 200, then the ones that failed will not be retried. But if I return non-200, then the ones that were handled successfully will be retried unnecessarily, and possibly re-inserted. So I'd have to make my worker smart enough to retry only the failed ones -- which is logic I'd prefer not having to write.
This would be much easier if only one record was ever sent per message. So if that's the case in practice, despite records being an array, I'd really like to know!
To be clear, it's not the records that "SQS sends." It's the records that S3 sends to SQS (or to SNS, or to Lambda).
Currently, all S3 event notifications have a single event per notification message. We might include multiple records as we add new event types in the future. This is also a message format that is shared across other AWS services, and other services can include multiple records.
— https://forums.aws.amazon.com/thread.jspa?messageID=592264򐦈
So, for the moment, it appears there's only one record per message.
But... you are making a mistake if you assume your application need not be prepared to handle repeated or duplicate messages. In any massive and distributed system like SQS it is extremely difficult to absolutely guarantee that this can never happen, however unlikely:
Q: How many times will I receive each message?
Amazon SQS is engineered to provide “at least once” delivery of all messages in its queues. Although most of the time each message will be delivered to your application exactly once, you should design your system so that processing a message more than once does not create any errors or inconsistencies.
— http://aws.amazon.com/sqs/faqs/
Incidentally, in my platform, more than one entry in the records array is considered an error, causing the message to be abandoned and sent to the dead letter queue for review.
According to the documentation for SQS (emphasis mine):
Amazon SQS stores copies of your messages on multiple servers for redundancy and high availability. On rare occasions, one of the servers storing a copy of a message might be unavailable when you receive or delete the message. If that occurs, the copy of the message will not be deleted on that unavailable server, and you might get that message copy again when you receive messages. Because of this, you must design your application to be idempotent (i.e., it must not be adversely affected if it processes the same message more than once).
Are these duplicate messages guaranteed to have the same (SQS) message ID?
Yes, they will have the same IDs, so if you are trying to keep track of what has already been processed, keeping a list (in a cache or db) of what you have done 'recently' should work fine.
I understand the concept of delay queue of Amazon SQS, but I wonder why it is useful.
What's the usage of SQS delay queue?
Thanks
One use case which i can think of is usage in distributed applications which have eventual consistency semantics. The system consuming the message may have an dependency like a co-relation identifier to be available and hence may need to wait for certain guaranteed duration of time before seeing the co-relation data. In this case, it makes sense for the message to be delayed for certain duration of time.
Like you I was confused as to a use-case for delay queues, until I stumbled across one in my own work. My application needs to have an internal queue with each item waiting at least one minute between each check for completion.
So instead of having to manage a "last-checked-time" on every object, I just shove the object's ID into an SQS queue messagewith a delay time of 60 seconds, and my main loop then becomes a simple long-poll against the queue.
A few off the top of my head:
Emails - Let's say you have a service that sends reminder emails triggered from queue messages. You'd have to delay enqueueing the message in that case.
Race conditions - Delivery delays can be used to overcome race conditions in distributed systems. For example, a service could insert a row into a table, and sends a message about its availability to other services. They can't use the new entry just yet, so you have to delay publishing the SQS message.
Handling retries - Sometimes if a message fails you want to retry with exponential backoffs. This requires re-enqueuing the message with longer delays.
I've built a suite of API's to make queue message scheduling easy. You can call our API's to schedule queue messages, cancel, edit, and check on the status of such messages. Think of it like a scheduler microservice.
www.schedulerapi.com
If you are looking for a solution, let me know. I've built these schedulers before at work for delivering emails at high scale, so I have experience with similar use cases.
One use-case can be:
Think of a time critical expression like a scheduled equity trade order.
If one of your system is fetching all the order scheduled in next 60 minutes and putting them in queue (which will be fetched by another sub system).
If you send these order directly, then they will be visible immediately to process in queue and will be processed depending upon their order.
But most likely, they will not execute in exact time (Hour:Minute:Seconds) in which Customer wanted and this will impact the outcome.
So to solve this, what first sub system will do, it will add delay seconds (difference between current and execution time) so message will only be visible after that much delay or at exact time when user wanted.
I am new to Amazon Web Services and am currently trying to get my head around how Simple Queue Service (SQS) works.
In the link ReceiveMessage the following is mentioned:
Short poll is the default behavior where a weighted random set of
machines is sampled on a ReceiveMessage call. This means only the
messages on the sampled machines are returned. If the number of
messages in the queue is small (less than 1000), it is likely you will
get fewer messages than you requested per ReceiveMessage call. If the
number of messages in the queue is extremely small, you might not
receive any messages in a particular ReceiveMessage response; in which
case you should repeat the request.
What I understand there is one queue and many machines/instances can read the messages. What is not clear to me is what does "weighted random set of machines" means? Is there more than one queue on a number of machines? Clearly I am lacking some knowledge on on SQS works.
I believe what this means is that because SQS is geographically distributed, not all of the machines (amazon's servers that have your queue) will have the exact same queue content at all times because they won't always be in sync with each other at every instant.
You don't know or control from which of amazons servers it will serve messages from, it uses an algorithm to figure out which messages are sent to you when you request some. That is why you don't always get messages when you ask for them, and occasionally the same message will get served up more than once; you need to make sure whatever your processing entails it can deal with the possibility that it is processing something that has already been processed by another of your worker machines.