Google PubSub Python multiple subscriber clients receiving duplicate messages - google-cloud-platform

I have a pretty straightforward app that starts a PubSub subscriber StreamingPull client. I have this deployed on Kubernetes so I can scale. When I have a single pod deployed, everything works as expected. When I scale to 2 containers, I start getting duplicate messages. I know that some small of duplicate messages is to be expected, but almost half the messages, sometimes more, are received multiple times.
My process takes about 600ms to process a message. The subscription acknowledgement deadline is set to 600s. I published 1000 messages, and the subscription was emptied in less than a minute, but the acknowledge_message_operation metric shows ~1500 calls, with a small amount with response_code expired. There were no failures in my process and all messages were acked upon processing. Logs show that the same message was received by the two containers at the exact same time. The minute to process all the messages was well below the acknowledgement deadline of the subscription, and the Python client is supposed to handle lease management, so I'm not sure why there were any expired messages at all. I also don't understand why the same message is sent to multiple subscriber clients at the same time.
Minimal working example:
import time
from google.cloud import pubsub_v1
PROJECT_ID = 'my-project'
PUBSUB_TOPIC_ID = 'duplicate-test'
PUBSUB_SUBSCRIPTION_ID = 'duplicate-test'
def subscribe(sleep_time=None):
subscriber = pubsub_v1.SubscriberClient()
subscription_path = subscriber.subscription_path(
PROJECT_ID, PUBSUB_SUBSCRIPTION_ID)
def callback(message):
print(message.data.decode())
if sleep_time:
time.sleep(sleep_time)
print(f'acking {message.data.decode()}')
message.ack()
future = subscriber.subscribe(
subscription_path, callback=callback)
print(f'Listening for messages on {subscription_path}')
future.result()
def publish(num_messages):
publisher = pubsub_v1.PublisherClient()
topic_path = publisher.topic_path(PROJECT_ID, PUBSUB_TOPIC_ID)
for i in range(num_messages):
publisher.publish(topic_path, str(i).encode())
In two terminals, run subscribe(1). In a third terminal, run publish(200). For me, this will give duplicates in the two subscriber terminals.

It is unusual for two subscribers to get the same message at the same time unless:
The message got published twice due to a retry (and therefore as far as Cloud Pub/Sub is concerned, there are two messages). In this case, the content of the two messages would be the same, but their message IDs would be different. Therefore, it might be worth ensuring that you are looking at the service-provided message ID to ensure the messages are indeed duplicates.
The subscribers are on different subscriptions, which means each of the subscribers would receive all of the messages.
If neither of these is the case, then duplicates should be relatively rare. There is an edge case in dealing with large backlogs of small messages with streaming pull (which is what the Python client library uses). Basically, if messages that are very small are published in a burst and subscribers then consume that burst, it is possible to see the behavior you are seeing. All of the messages would end up being sent to one of the two subscribers and would be buffered behind the flow control limits of the number of outstanding messages. These messages may exceed their ack deadline, resulting in redelivery, likely to the other subscriber. The first subscriber still has these messages in its buffer and will see these messages, too.
However, if you are consistently seeing two subscribers freshly started immediately receive the same messages with the same message IDs, then you should contact Google Cloud support with your project name, subscription name, and a sample of the message IDs. They will better be able to investigate why this immediate duplication is happening.

(Edited as I misread the deadlines)
Looking at the Streaming Pull docs, this seems like an expected behavior:
The gRPC StreamingPull stack is optimized for high throughput and therefore
buffers messages. This can have some consequences if you are attempting to
process large backlogs of small messages (rather than a steady stream of new
messages). Under these conditions, you may see messages delivered multiple times
and they may not be load balanced effectively across clients.
From: https://cloud.google.com/pubsub/docs/pull#streamingpull

Related

Why am I getting 50% of GCP Pub/Sub messages duplicated?

I'm running an analytics pipeline.
Throughput is ~11 messages per second.
My Pub/Sub topic holds around 2M messages scheduled.
80 GCE instances are pulling messages in parallel.
Here is my topic and the subscription:
gcloud pubsub topics create pipeline-input
gcloud beta pubsub subscriptions create pipeline-input-sub \
--topic pipeline-input \
--ack-deadline 600 \
--expiration-period never \
--dead-letter-topic dead-letter
Here is how I pull messages:
import { PubSub, Message } from '#google-cloud/pubsub'
const pubSubClient = new PubSub()
const queue: Message[] = []
const populateQueue = async () => {
const subscription = pubSubClient.subscription('pipeline-input-sub', {
flowControl: {
maxMessages: 5
}
})
const messageHandler = async (message: Message) => {
queue.push(message)
}
subscription.on('message', messageHandler)
}
const processQueueMessage = () => {
const message = queue.shift()
try {
...
message.ack()
} catch {
...
message.nack()
}
processQueueMessage()
}
processQueueMessage()
Processing time is ~7 seconds.
Here is one of the many similar dup cases.
The same message is delivered 5 (!!!) times to different GCE instances:
03:37:42.377
03:45:20.883
03:48:14.262
04:01:33.848
05:57:45.141
All 5 times the message was successfully processed and .ack()ed. The output includes 50% more messages than the input! I'm well aware of the "at least once" behavior, but I thought it may duplicate like 0.01% of messages, not 50% of them.
The topic input is 100% free of duplicates. I verified both the topic input method AND the number of un-acked messages through the Cloud Monitor. Numbers match: there are no duplicates in the pub/sub topic.
UPDATE:
It looks like all those duplicates created due to ack deadline expiration. I'm 100% sure that I'm acknowledging 99.9% of messages before the 600 seconds deadline.
Some duplicates are expected, though a 50% duplicate rate is definitely high. The first question is, are these publish-side duplicates or subscribe-side duplicates? The former are created when a publish of the same message is retried, resulting in multiple publishes of the same message. These messages will have different message IDs. The latter is caused by redeliveries of the same message to the subscriber. These messages have the same message ID (though different ack IDs).
It sounds like you have verified that these are subscribe-side duplicates. Therefore, the likely cause, as you mention is an expired ack deadline. The question is, why are the messages exceeding the ack deadline? One thing to note is that when using the client library, the ack deadline set in the subscription is not the one used. Instead, the client library tries to optimize ack deadlines based on client library settings and the 99th percentile ack latency. It then renews leases on messages until the max_lease_duration property of the FlowControl object passed into the subscribe method. This defaults to one hour.
Therefore, in order for messages to remain leased, it is necessary for the client library to be able to send modifyAckDeadline requests to the server. One possible cause of duplicates would be the inability of the client to send these requests, possibly due to overload on the machine. Are the machines running this pipeline doing any other work? If so, it is possible they are overloaded in terms of CPU, memory, or network and are unable to send the modifyAckDeadline requests and unable to process messages in a timely fashion.
It is also possible that message batching could be affecting your ability to ack messages. As an optimization, the Pub/Sub system stores acknowledgements for batches of messages instead of individual messages. As a result, all messages in a batch must be acknowledged in order for all of them to be acknowledged. Therefore, if you have five messages in a batch and acknowledge four of them, but then do not ack the final message, all five will be redelivered. There are some caches in place to try to minimize this, but it is still a possibility. There is a Medium post that discusses this in more detail (see the "Message Redelivery & Duplication Rate" section). It might be worth checking that all messages are acked and not nacked in your code by printing out the message ID as soon as the message is received and right before the calls to ack and nack. If your messages were published in batches, it is possible that a single nack is causing redelivery of more messages.
This coupling between batching and duplicates is something we are actively working on improving. I would expect this issue to stop at some point. In the meantime, if you have control over the publisher, you could set the max_messages property in the batch settings to 1 to prevent the batching of messages.
If none of that helps, it would be best to open up a support case and provide the project name, subscription name, and message IDs of some duplicated messages. Engineers can investigate in more detail why individual messages are getting redelivered.

Pub/Sub - unable to pull undelivered messages

There is an issue with my company's Pub/Sub. Some of our messages are stuck and the oldest unacked message age is increasing over time.
1 day charts:
and when I go to metrics explorer and select Expired ack deadlines count this is the one week chart.
I decided to find out why these messages are stuck, but when I ran the pull command (below), I got Listed 0 items response. It is therefore not possible to see them.
Is there a way how I can figure out why some of the messages are displayed as unacknowledged?
Also, the Unacked message count shows the same amount (around 2k) messages for the whole month, even though there are new messages published every day.
Here are the parameters we use for this subscription:
I tried to fix this error by setting the deadline to 600 seconds, but it didn't help.
Additionally, I want to mention that we use node.js Pub/Sub client library to handle the messages.
The most common causes of messages not being able to be pulled are:
The subscriber client already received the messages and "forgot" about them, perhaps due to an exception being thrown and not handled. In this case, the message will continue to be leased by the client until the deadline passes. The client libraries all extend the lease automatically until the maxExtension time is reached. If these are messages that are always forgotten, then it could be that they are redelivered to the subscriber and forgotten again, resulting in them not being pullable via the gcloud command-line tool or UI.
There could be a rogue subscriber. It could be that another subscriber is running somewhere for the same subscription and is "stealing" these messages. Sometimes this can be a test job or something that was used early on to see if the subscription works as expected and wasn't turned down.
You could be falling into the case of a large backlog of small messages. This should be fixed in more recent versions of the client library (v2.3.0 of the Node client has the fix).
The gcloud pubsub subscription pull command and UI are not guaranteed to return messages, even if there are some available to pull. Sometimes, rerunning the command multiple times in quick succession helps to pull messages.
The fact that you see expired ack deadlines likely points to 1, 2, or 3, so it is worth checking for those things. Otherwise, you should open a support case so the engineers can look more specifically at the backlog and determine where the messages are.

AWS SQS - when will the duplicated message arrive?

I understand that standard SQS uses "at least once" delivery while FIFO messages are delivered exactly once. I'm trying to weigh standard queues vs FIFO for my application, and one factor is how long it takes for the duplicated message to arrive.
I intend to consume messages from SQS then post the data I received to an idempotent third-party API. I understand that with standard SQS, there's always a risk of me overwriting more recent data with the old duplicated data.
For example:
Message A arrives, I post it onwards.
Message A duplicate arrives, I post it onwards.
Message B arrives, I post it onwards.
All fine ✓
On the other hand:
Message A arrives, I post it onwards.
Message B arrives, I post it onwards.
Message A duplicate arrives - I post it and overwrite the latest data, which was B! ✖
I want to measure this risk, i.e. I want to know how long the duplicate message should take to arrive. Will the duplicate message take roughly the same amount of time to arrive, as the original message?
Maybe it's useful to understand how message duplication occurs. As far as I know this isn't documented in the official docs, but instead it's my mental model of how it works. This is an educated guess.
Whenever you send a message to SQS (SendMessage API), this message arrives at the SQS webservice endpoint, which is one of probably thousands of servers. This endpoint receives your message, duplicates it one or more times and stores these duplicates on more than one SQS server. After it has received confirmation from at least two SQS servers, it acknowledges to the client that the message has been received.
When you call the ReceiveMessage API only a subset of the SQS servers that handle your queue are queried for messages. When a message is returned, these servers communicate to their peers, that this message is currently in-flight and the visibility timeout starts. This doesn't happen instantaneously, as it's a distributed system. While this ReceiveMessage call takes place another consumer might also do a ReceiveMessage call and happen to query one of the servers that have a replica of the message, before it's marked as in-flight. That server hands out the message and now you have to consumers working on it.
This is just one scenario, which is the result of this being a distributed system.
There are a couple of edge cases that can happen as the result of network issues, e.g. when the SQS response to the initial SendMessage gets lost and the client thinks the message didn't arrive and sends it again - poof, you got another duplicate.
The point being: things fail in weird and complex ways. That makes measuring the risk of a delayed message difficult. If your use case can't handle duplicate and out of order messages, you should go for FIFO, but that will inherently limit your throughput. Alternatives are based on distributed locking mechanisms and keeping track of which messages you have already processed, which are complex tools to solve a complex problem.

Google Cloud PubSub Message Delivered More than Once before reaching deadline acknowledgement time

Background:
We configured cloud pubsub topic to interact within multiple app engine services,
There we have configured push based subscribers. We have configured its acknowledgement deadline to 600 seconds
Issue:
We have observed pubsub has pushed same message twice (more than twice from some other topics) to its subscribers, Looking at the log I can see this message push happened with the gap of just 1 Second, Ideally as we have configured ackDeadline to 600 seconds, pubsub should re-attempt message delivery only after 600 seconds.
Need following answers:
Why same message has got delivered more than once in 1 second only
Does pubsub doesn’t honors ackDeadline configuration before
reattempting message delivery?
References:
- https://cloud.google.com/pubsub/docs/subscriber
Message redelivery can happen for a couple of reasons. First of all, it is possible that a message got published twice. Sometimes the publisher will get back an error like a deadline exceeded, meaning the publish took longer than anticipated. The message may or may not have actually been published in this situation. Often, the correct action is for the publisher to retry the publish and in fact that is what the Google-provided client libraries do by default. Consequently, there may be two copies of the message that were successfully published, even though the client only got confirmation for one of them.
Secondly, Google Cloud Pub/Sub guarantees at-least-once delivery. This means that occasionally, messages can be redelivered, even if the ackDeadline has not yet passed or an ack was sent back to the service. Acknowledgements are best effort and most of the time, they are successfully processed by the service. However, due to network glitches, server restarts, and other regular occurrences of that nature, sometimes the acknowledgements sent by the subscriber will not be processed, resulting in message redelivery.
A subscriber should be designed to be resilient to these occasional redeliveries, generally by ensuring that operations are idempotent, i.e., that the results of processing the message multiple times are the same, or by tracking and catching duplicates. Alternatively, one can use Cloud Dataflow as a subscriber to remove duplicates.

Cloud pubsub slow poll rate

I have a pubsub topic, with one subscription, and two different subscribers are pulling from it.
Using stackdriver, I can see that the subscription has ~1000 messages.
Each subscriber runs the following poll loop:
client = pubsub.Client()
topic = client.topic(topic_name)
subscription = pubsub.Subscription(subscription_name)
while True:
messages = subscription.pull(return_immediately=True, max_messages=100, client=client)
print len(messages)
# put messages in local queue for later processing. Those processes will ack the subsription
My issue is a slow poll rate - even though I have plenty of messages waiting to be polled, I'm getting only several messages each time. Also, lots of responses are back without any messages. According to stackdriver, my messages pulled rate is ~1.5 messages/sec.
I tried to use return_immediately=False, and it improved it a bit - the pull rate increased to ~2.5 messages/sec, but still - not the rate I would expect to have.
Any ideas how to increase pull rate? Any pubsub poll best practices?
In order to increase your pull rate, you need to have more than one outstanding pull request at a time. How many depends on how fast and from how many places you publish. You'll need at least a few outstanding at all times. As soon as one of them returns, create another pull request. That way, whenever Cloud Pub/Sub is ready to deliver messages to your subscriber, you have requests waiting to receive messages.