AWS SQS - when will the duplicated message arrive? - amazon-web-services

I understand that standard SQS uses "at least once" delivery while FIFO messages are delivered exactly once. I'm trying to weigh standard queues vs FIFO for my application, and one factor is how long it takes for the duplicated message to arrive.
I intend to consume messages from SQS then post the data I received to an idempotent third-party API. I understand that with standard SQS, there's always a risk of me overwriting more recent data with the old duplicated data.
For example:
Message A arrives, I post it onwards.
Message A duplicate arrives, I post it onwards.
Message B arrives, I post it onwards.
All fine ✓
On the other hand:
Message A arrives, I post it onwards.
Message B arrives, I post it onwards.
Message A duplicate arrives - I post it and overwrite the latest data, which was B! ✖
I want to measure this risk, i.e. I want to know how long the duplicate message should take to arrive. Will the duplicate message take roughly the same amount of time to arrive, as the original message?

Maybe it's useful to understand how message duplication occurs. As far as I know this isn't documented in the official docs, but instead it's my mental model of how it works. This is an educated guess.
Whenever you send a message to SQS (SendMessage API), this message arrives at the SQS webservice endpoint, which is one of probably thousands of servers. This endpoint receives your message, duplicates it one or more times and stores these duplicates on more than one SQS server. After it has received confirmation from at least two SQS servers, it acknowledges to the client that the message has been received.
When you call the ReceiveMessage API only a subset of the SQS servers that handle your queue are queried for messages. When a message is returned, these servers communicate to their peers, that this message is currently in-flight and the visibility timeout starts. This doesn't happen instantaneously, as it's a distributed system. While this ReceiveMessage call takes place another consumer might also do a ReceiveMessage call and happen to query one of the servers that have a replica of the message, before it's marked as in-flight. That server hands out the message and now you have to consumers working on it.
This is just one scenario, which is the result of this being a distributed system.
There are a couple of edge cases that can happen as the result of network issues, e.g. when the SQS response to the initial SendMessage gets lost and the client thinks the message didn't arrive and sends it again - poof, you got another duplicate.
The point being: things fail in weird and complex ways. That makes measuring the risk of a delayed message difficult. If your use case can't handle duplicate and out of order messages, you should go for FIFO, but that will inherently limit your throughput. Alternatives are based on distributed locking mechanisms and keeping track of which messages you have already processed, which are complex tools to solve a complex problem.

Related

Google PubSub Python multiple subscriber clients receiving duplicate messages

I have a pretty straightforward app that starts a PubSub subscriber StreamingPull client. I have this deployed on Kubernetes so I can scale. When I have a single pod deployed, everything works as expected. When I scale to 2 containers, I start getting duplicate messages. I know that some small of duplicate messages is to be expected, but almost half the messages, sometimes more, are received multiple times.
My process takes about 600ms to process a message. The subscription acknowledgement deadline is set to 600s. I published 1000 messages, and the subscription was emptied in less than a minute, but the acknowledge_message_operation metric shows ~1500 calls, with a small amount with response_code expired. There were no failures in my process and all messages were acked upon processing. Logs show that the same message was received by the two containers at the exact same time. The minute to process all the messages was well below the acknowledgement deadline of the subscription, and the Python client is supposed to handle lease management, so I'm not sure why there were any expired messages at all. I also don't understand why the same message is sent to multiple subscriber clients at the same time.
Minimal working example:
import time
from google.cloud import pubsub_v1
PROJECT_ID = 'my-project'
PUBSUB_TOPIC_ID = 'duplicate-test'
PUBSUB_SUBSCRIPTION_ID = 'duplicate-test'
def subscribe(sleep_time=None):
subscriber = pubsub_v1.SubscriberClient()
subscription_path = subscriber.subscription_path(
PROJECT_ID, PUBSUB_SUBSCRIPTION_ID)
def callback(message):
print(message.data.decode())
if sleep_time:
time.sleep(sleep_time)
print(f'acking {message.data.decode()}')
message.ack()
future = subscriber.subscribe(
subscription_path, callback=callback)
print(f'Listening for messages on {subscription_path}')
future.result()
def publish(num_messages):
publisher = pubsub_v1.PublisherClient()
topic_path = publisher.topic_path(PROJECT_ID, PUBSUB_TOPIC_ID)
for i in range(num_messages):
publisher.publish(topic_path, str(i).encode())
In two terminals, run subscribe(1). In a third terminal, run publish(200). For me, this will give duplicates in the two subscriber terminals.
It is unusual for two subscribers to get the same message at the same time unless:
The message got published twice due to a retry (and therefore as far as Cloud Pub/Sub is concerned, there are two messages). In this case, the content of the two messages would be the same, but their message IDs would be different. Therefore, it might be worth ensuring that you are looking at the service-provided message ID to ensure the messages are indeed duplicates.
The subscribers are on different subscriptions, which means each of the subscribers would receive all of the messages.
If neither of these is the case, then duplicates should be relatively rare. There is an edge case in dealing with large backlogs of small messages with streaming pull (which is what the Python client library uses). Basically, if messages that are very small are published in a burst and subscribers then consume that burst, it is possible to see the behavior you are seeing. All of the messages would end up being sent to one of the two subscribers and would be buffered behind the flow control limits of the number of outstanding messages. These messages may exceed their ack deadline, resulting in redelivery, likely to the other subscriber. The first subscriber still has these messages in its buffer and will see these messages, too.
However, if you are consistently seeing two subscribers freshly started immediately receive the same messages with the same message IDs, then you should contact Google Cloud support with your project name, subscription name, and a sample of the message IDs. They will better be able to investigate why this immediate duplication is happening.
(Edited as I misread the deadlines)
Looking at the Streaming Pull docs, this seems like an expected behavior:
The gRPC StreamingPull stack is optimized for high throughput and therefore
buffers messages. This can have some consequences if you are attempting to
process large backlogs of small messages (rather than a steady stream of new
messages). Under these conditions, you may see messages delivered multiple times
and they may not be load balanced effectively across clients.
From: https://cloud.google.com/pubsub/docs/pull#streamingpull

SQS Queues/ Visibility Timeouts/ message groups

I am new to AWS. I am trying to understand SQS here. I have gone over a few trainings also but I still could not get some answers there in the discussion forum. I am re-iterating my question here. Note that I know that a few questions below have obvious answers and are therefore more of a rhetoric. My confusion stems from the fact that my understanding of the topic at present leads me to give conflicting answers to the follow up questions that spring up in my mind after the obvious known ones and takes away the confidence of whatever I think I understand alright.
If I have a Standard queue named MyQueue and there are 100 messages, and if there are 2 completely separate applications (as consumers; note they are not a consumer group of the same applications like you have in Kafka; instead they are 2 separate applications) for this queue, then the consumers may receive
(i) out of order messages and
(ii) multiple copies of the messages
Both of my applications do not need to bother about the order of the messages. But for the sake of the question lets say we have a perfect order of delivery, no multiple copies and no network issues and both consumers finish their processing if each message well within the Visibility Timeout window.
Q1: Will both the applications individually receive 100 messages each or will a message that is made available to one consumer won't ever be delivered to the other consumer? If the latter is true ( with no network issues, out of order delivery, multiple deliveries), then:
Is SNS-SQS fanout the way to ensure that the same message is processed by multiple consumers?
Is the consumer supposed to delete the message from the queue after processing? So, if a message is picked up by a processor, and it goes into visibility timeout while the processing happens and then is not deleted by the consumer even after the processing is complete before the visibility timeout, then will the message appear back for other consumers possibly to consume it? If that is the case, then won't the same thing apply to a FIFO queue as well?
Other Questions:
Q2: Is the Visibility timeout applicable to both Standard Queue and FIFO Queue? If it is also applicable to FIFO Queue which promises exactly once delivery, then, if the Visibility Timeout appears before the consumer ends processing a message, then it reappears in the queue only to be delivered again thereby going back to at least once processing. Can someone confirm?
Q3: What are multiple message Groups within a FIFO Queue? Are they like partitions of a queue?
Q: Will both the applications individually receive 100 messages each?
A consumer can request up to 10 messages per API call. These will become 'invisible' and will not be provided to other consumers. (Well, there actually is a small possibility that a message might be provided to multiple consumers. It is rare, but it can happen. If this is bad for your use-case, then you should track the messages in a database to ensure they are only processed once each.)
Q: Is SNS-SQS fanout the way to ensure that the same message is processed by multiple consumers?
It is very strange to want to want a single message consumed by 'multiple consumers'. The normal desire is to process each message once. If you do want a message processed by multiple consumers then, yes, you could send the message to SNS, which could then send it to multiple queues.
Q:Is the consumer supposed to delete the message from the queue after processing?
Yes. Amazon SQS does not know when a message is processed. The consumer must delete the message via the ReceiptHandle provided when the message was received. If a message times-out and another consumer receives it, SQS will provide a different ReceiptHandle so it knows which process requested the delete.
This also applies to FIFO queues.
Q: Is the Visibility timeout applicable to both Standard Queue and FIFO Queue?
Yes. If the visibility timeout expires, the message will be provided to another consumer. The "exactly once delivery" avoids the rare situation mentioned above when a message in a Standard queue might be provided more than once. However, if visibility times-out, even in a FIFO queue, then it will intentionally be visible on the queue again.
Q: What are multiple message Groups within a FIFO Queue? Are they like partitions of a queue?
A message group is a way of grouping messages that must be delivered in-order.
Let's say there are two message groups, A and B, and they send messages in this order: A1, B1, A2, B2
Message B1 can be provided even if A1 is not yet deleted. However, message A2 will not be provided until A1 is deleted. Think of them as 'mini-queues'. This allows processing of lots of messages are are unrelated, without having to wait for all previous messages to be deleted.
See: Using the Amazon SQS Message Group ID - Amazon Simple Queue Service
Q1: Will both the applications individually receive 100 messages each or will a message that is made available to one consumer won't ever be delivered to the other consumer?
Neither of these is quite accurate.
Standard queues never intentionally deliver a message more than once. It is possible that messages may occasionally be delivered more than once -- but this is the exception and is an artifact of the fact that SQS is a distributed system and situations could arise where, for example, the queue had a message stored in multiple replicas and the fact that a message was not known to all replicas due to an internal failure.
If a message is inadvertently delivered more than once, it could be to multiple consumers or the same consumer. The consumer "connections" to SQS are actually stateless, resetting each time a list of messages is delivered, so SQS does not have a sense of which consumer it delivered each message to.
Consumers delete their messages after processing, otherwise their visibilitt timeout expires and they are delivered again and again -- to whichever consumer the luck of the draw delivers them to, each time. As noted, SQS has no concept of consumer identity or state. (In high volume applications, a single consumer may actually have multiple connections to SQS, all receiving messages in parallel, because the network round-trips and cycle of receive/delete will otherwise limit a single consumer to a few hundred messages per second. Whether these connections are handled using asynchronous I/O, threads, etc., is unimportant to SQS, which doesn't care which consumer is on a given connection.)
If you want all messages sent to all consumers, you need fan-out from SNS to SQS.
Q2: Is the Visibility timeout applicable to both Standard Queue and FIFO Queue?
Yes. Because (noted above) the connection to SQS is not a persistent, stateful connection, SQS uses visibility timeout as the indication that a consumer has lost the message or failed ungracefully, so the message needs to be made accessible again. (Dead letter queues prevent this from happening endlessly, moving a message to a different queue, since repeated failures indicate a problem with a consumer, or a "poison pill" message.)
FIFO queues retain in-order delivery, here, and you could argue that they revert to "at least once" delivery, but the idea is that this should never happen. If it does, then your visibility timeout is too short or your consumer is crashing or otherwise misplacing messages.
Q3: What are multiple message Groups within a FIFO Queue?
Message groups allow FIFO queues to support in-order, parallel processing of groups messages whose ordering relative to each other across group boundaries doesn't matter. Messages are delivered in order, within each group.
If a FIFO queue, if all messages are sent with the same group ID, then only one consumer can be working at a time.
In-order delivery (simple illustration) means that message 2 will not be delivered to any consumer until message 1 has been received and deleted -- finished -- by a consumer. In order delivery includes all processing (not merely the initial "delivery"). Or if 20 messages in the queue have the same group ID and two consumers request 10 messages each, one consumer gets 10 and the other gets nothing -- yet -- because those second 10 messages have to be sequestered, until the first 10 have been processed (else we are no longer "in order").
In the 20 messages scenario, if 14 were in group A and 6 were in group B, one consumer would receive A1-A10, A11-A14 would be sequestered until A1-A10 were complete, but while the first consumer is busy, another consumer could have B1-B6 at the same time.
Note again that there is no consumer affinity. If A1-A10 and B1-B6 were deleted at the same instant, A11-A14 would next be delivered to one consumer, but not necessarily the one that handled A1-A10.

AWS SQS FIFO - How to get more than 10 messages at a time?

Currently we want to pull down an entire FIFO queue, and process the contents, and if any issues, release messages back into the queue.
The problem is, that currently AWS only gives us 10 messages, and won't give us 10 more (which is the way you get bulk messages in SQS, multiple 10 max message requests) until we delete or release the first 10.
We need to get more than 10 though. Is this not possible? We understand we can set the group_id to a random string, and that allows processing more, but then the order isn't guaranteed, which defeats the purpose of FIFO.
I managed to reproduce your results -- I could retrieve 10 messages, but then running the same command again would not return another set of messages.
The relevant documentation seems to be:
While messages with a particular MessageGroupId are invisible, no more messages belonging to the same MessageGroupId are returned until the visibility timeout expires. You can still receive messages with another MessageGroupId as long as it is also visible.
I suspect (just a theory!) that this is to preserve the ordering of messages... If a client asked for a set of messages and they are still being processed, there is the chance that the messages might be returned to the queue. Therefore, no further messages are provided until the original messages are deleted or pass their visibility timeout.
This is only a behaviour of FIFO queues.
It seems that you will need to receive and delete all messages to be able to access them all. I would suggest:
Receive one (or more) message.
Process it. If everything worked, delete the message.
If there were problems, push the message to a new queue.
Once the queue is empty, you would need to read from the new queue and send them back to the original queue (which should preserve ordering).
If you frequently require more capabilities that Amazon SQS provides, you could consider using Amazon MQ – Managed message broker service for ActiveMQ. It has many more capabilities (but is accordingly less 'simple').
If you set another MessageGroupId, you can get another 10 messages, even you don't release or delete the previous ones.
https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/using-messagegroupid-property.html

Does SQS really send multiple S3 PUT object records per message?

I've set up an S3 bucket to emit an event on PUT object to SQS, and I'm handling the SQS queue in an EB worker tier.
The schema for the message that SQS sends is here: http://docs.aws.amazon.com/AmazonS3/latest/dev/notification-content-structure.html
Records is an array, implying that there can be multiple records sent in one POST to my worker's endpoint. Does this actually happen? Or will my worker only ever receive one record per message?
The worker can only return one response, either 200 (message handled successfully) or non-200 (message not handled successfully, which puts it back into the queue), regardless of how many records in the message it receives.
So if my worker receives multiple records in a message, and it handles some successfully (say by doing something with side effects such as inserting into a database) but fails on one or more, how should I handle that? If I return 200, then the ones that failed will not be retried. But if I return non-200, then the ones that were handled successfully will be retried unnecessarily, and possibly re-inserted. So I'd have to make my worker smart enough to retry only the failed ones -- which is logic I'd prefer not having to write.
This would be much easier if only one record was ever sent per message. So if that's the case in practice, despite records being an array, I'd really like to know!
To be clear, it's not the records that "SQS sends." It's the records that S3 sends to SQS (or to SNS, or to Lambda).
Currently, all S3 event notifications have a single event per notification message. We might include multiple records as we add new event types in the future. This is also a message format that is shared across other AWS services, and other services can include multiple records.
— https://forums.aws.amazon.com/thread.jspa?messageID=592264&#592264
So, for the moment, it appears there's only one record per message.
But... you are making a mistake if you assume your application need not be prepared to handle repeated or duplicate messages. In any massive and distributed system like SQS it is extremely difficult to absolutely guarantee that this can never happen, however unlikely:
Q: How many times will I receive each message?
Amazon SQS is engineered to provide “at least once” delivery of all messages in its queues. Although most of the time each message will be delivered to your application exactly once, you should design your system so that processing a message more than once does not create any errors or inconsistencies.
— http://aws.amazon.com/sqs/faqs/
Incidentally, in my platform, more than one entry in the records array is considered an error, causing the message to be abandoned and sent to the dead letter queue for review.

sampled machines when using queues

I am new to Amazon Web Services and am currently trying to get my head around how Simple Queue Service (SQS) works.
In the link ReceiveMessage the following is mentioned:
Short poll is the default behavior where a weighted random set of
machines is sampled on a ReceiveMessage call. This means only the
messages on the sampled machines are returned. If the number of
messages in the queue is small (less than 1000), it is likely you will
get fewer messages than you requested per ReceiveMessage call. If the
number of messages in the queue is extremely small, you might not
receive any messages in a particular ReceiveMessage response; in which
case you should repeat the request.
What I understand there is one queue and many machines/instances can read the messages. What is not clear to me is what does "weighted random set of machines" means? Is there more than one queue on a number of machines? Clearly I am lacking some knowledge on on SQS works.
I believe what this means is that because SQS is geographically distributed, not all of the machines (amazon's servers that have your queue) will have the exact same queue content at all times because they won't always be in sync with each other at every instant.
You don't know or control from which of amazons servers it will serve messages from, it uses an algorithm to figure out which messages are sent to you when you request some. That is why you don't always get messages when you ask for them, and occasionally the same message will get served up more than once; you need to make sure whatever your processing entails it can deal with the possibility that it is processing something that has already been processed by another of your worker machines.