Why am I getting 50% of GCP Pub/Sub messages duplicated? - google-cloud-platform

I'm running an analytics pipeline.
Throughput is ~11 messages per second.
My Pub/Sub topic holds around 2M messages scheduled.
80 GCE instances are pulling messages in parallel.
Here is my topic and the subscription:
gcloud pubsub topics create pipeline-input
gcloud beta pubsub subscriptions create pipeline-input-sub \
--topic pipeline-input \
--ack-deadline 600 \
--expiration-period never \
--dead-letter-topic dead-letter
Here is how I pull messages:
import { PubSub, Message } from '#google-cloud/pubsub'
const pubSubClient = new PubSub()
const queue: Message[] = []
const populateQueue = async () => {
const subscription = pubSubClient.subscription('pipeline-input-sub', {
flowControl: {
maxMessages: 5
}
})
const messageHandler = async (message: Message) => {
queue.push(message)
}
subscription.on('message', messageHandler)
}
const processQueueMessage = () => {
const message = queue.shift()
try {
...
message.ack()
} catch {
...
message.nack()
}
processQueueMessage()
}
processQueueMessage()
Processing time is ~7 seconds.
Here is one of the many similar dup cases.
The same message is delivered 5 (!!!) times to different GCE instances:
03:37:42.377
03:45:20.883
03:48:14.262
04:01:33.848
05:57:45.141
All 5 times the message was successfully processed and .ack()ed. The output includes 50% more messages than the input! I'm well aware of the "at least once" behavior, but I thought it may duplicate like 0.01% of messages, not 50% of them.
The topic input is 100% free of duplicates. I verified both the topic input method AND the number of un-acked messages through the Cloud Monitor. Numbers match: there are no duplicates in the pub/sub topic.
UPDATE:
It looks like all those duplicates created due to ack deadline expiration. I'm 100% sure that I'm acknowledging 99.9% of messages before the 600 seconds deadline.

Some duplicates are expected, though a 50% duplicate rate is definitely high. The first question is, are these publish-side duplicates or subscribe-side duplicates? The former are created when a publish of the same message is retried, resulting in multiple publishes of the same message. These messages will have different message IDs. The latter is caused by redeliveries of the same message to the subscriber. These messages have the same message ID (though different ack IDs).
It sounds like you have verified that these are subscribe-side duplicates. Therefore, the likely cause, as you mention is an expired ack deadline. The question is, why are the messages exceeding the ack deadline? One thing to note is that when using the client library, the ack deadline set in the subscription is not the one used. Instead, the client library tries to optimize ack deadlines based on client library settings and the 99th percentile ack latency. It then renews leases on messages until the max_lease_duration property of the FlowControl object passed into the subscribe method. This defaults to one hour.
Therefore, in order for messages to remain leased, it is necessary for the client library to be able to send modifyAckDeadline requests to the server. One possible cause of duplicates would be the inability of the client to send these requests, possibly due to overload on the machine. Are the machines running this pipeline doing any other work? If so, it is possible they are overloaded in terms of CPU, memory, or network and are unable to send the modifyAckDeadline requests and unable to process messages in a timely fashion.
It is also possible that message batching could be affecting your ability to ack messages. As an optimization, the Pub/Sub system stores acknowledgements for batches of messages instead of individual messages. As a result, all messages in a batch must be acknowledged in order for all of them to be acknowledged. Therefore, if you have five messages in a batch and acknowledge four of them, but then do not ack the final message, all five will be redelivered. There are some caches in place to try to minimize this, but it is still a possibility. There is a Medium post that discusses this in more detail (see the "Message Redelivery & Duplication Rate" section). It might be worth checking that all messages are acked and not nacked in your code by printing out the message ID as soon as the message is received and right before the calls to ack and nack. If your messages were published in batches, it is possible that a single nack is causing redelivery of more messages.
This coupling between batching and duplicates is something we are actively working on improving. I would expect this issue to stop at some point. In the meantime, if you have control over the publisher, you could set the max_messages property in the batch settings to 1 to prevent the batching of messages.
If none of that helps, it would be best to open up a support case and provide the project name, subscription name, and message IDs of some duplicated messages. Engineers can investigate in more detail why individual messages are getting redelivered.

Related

Message stays in GCP Pub/Sub even after acknowledge using GCP REST API

I am using the following GCP Pub/Sub REST APIs for pulling and Acknowledging messages.
For pulling message:-
POST https://pubsub.googleapis.com/v1/projects/myproject/subscriptions/mysubscription:pull
{
"returnImmediately": "false",
"maxMessages": "10"
}
To acknowledge message:-
POST https://pubsub.googleapis.com/v1/projects/myproject/subscriptions/mysubscription:acknowledge
{
"ackIds": [
"dQNNHlAbEGEIBERNK0EPKVgUWQYyODM2LwgRHFEZDDsLRk1SK..."
]
}
I am using the postman tool for calling the above APIs.But I can see the same message with same messageId and a different ackId even after the acknowledgement, when I pull the messages next time.Is there any mechanism available to exclude the acknowledged messages in gcp pull (subscriptions/mysubscription:pull)
Cloud Pub/Sub is an at-least-once delivery system, so some duplicates are expected. However, if you are always seeing duplicates, it is likely that you are not acknowledging the message before the ack deadline passes. The default ack deadline is 10 seconds. If you do not call ack within that time period, then the message will be redelivered. You can set the ack deadline on a subscription to up to 600 seconds.
If all of your messages are expected to take a longer time to process, then it is best to increase the ack deadline. If only a couple of messages will be slow and most will be processed quickly, then it's better to use the modifyAckDeadline call to increase the ack deadline on a per-message basis.

Pub/Sub - unable to pull undelivered messages

There is an issue with my company's Pub/Sub. Some of our messages are stuck and the oldest unacked message age is increasing over time.
1 day charts:
and when I go to metrics explorer and select Expired ack deadlines count this is the one week chart.
I decided to find out why these messages are stuck, but when I ran the pull command (below), I got Listed 0 items response. It is therefore not possible to see them.
Is there a way how I can figure out why some of the messages are displayed as unacknowledged?
Also, the Unacked message count shows the same amount (around 2k) messages for the whole month, even though there are new messages published every day.
Here are the parameters we use for this subscription:
I tried to fix this error by setting the deadline to 600 seconds, but it didn't help.
Additionally, I want to mention that we use node.js Pub/Sub client library to handle the messages.
The most common causes of messages not being able to be pulled are:
The subscriber client already received the messages and "forgot" about them, perhaps due to an exception being thrown and not handled. In this case, the message will continue to be leased by the client until the deadline passes. The client libraries all extend the lease automatically until the maxExtension time is reached. If these are messages that are always forgotten, then it could be that they are redelivered to the subscriber and forgotten again, resulting in them not being pullable via the gcloud command-line tool or UI.
There could be a rogue subscriber. It could be that another subscriber is running somewhere for the same subscription and is "stealing" these messages. Sometimes this can be a test job or something that was used early on to see if the subscription works as expected and wasn't turned down.
You could be falling into the case of a large backlog of small messages. This should be fixed in more recent versions of the client library (v2.3.0 of the Node client has the fix).
The gcloud pubsub subscription pull command and UI are not guaranteed to return messages, even if there are some available to pull. Sometimes, rerunning the command multiple times in quick succession helps to pull messages.
The fact that you see expired ack deadlines likely points to 1, 2, or 3, so it is worth checking for those things. Otherwise, you should open a support case so the engineers can look more specifically at the backlog and determine where the messages are.

Google PubSub Python multiple subscriber clients receiving duplicate messages

I have a pretty straightforward app that starts a PubSub subscriber StreamingPull client. I have this deployed on Kubernetes so I can scale. When I have a single pod deployed, everything works as expected. When I scale to 2 containers, I start getting duplicate messages. I know that some small of duplicate messages is to be expected, but almost half the messages, sometimes more, are received multiple times.
My process takes about 600ms to process a message. The subscription acknowledgement deadline is set to 600s. I published 1000 messages, and the subscription was emptied in less than a minute, but the acknowledge_message_operation metric shows ~1500 calls, with a small amount with response_code expired. There were no failures in my process and all messages were acked upon processing. Logs show that the same message was received by the two containers at the exact same time. The minute to process all the messages was well below the acknowledgement deadline of the subscription, and the Python client is supposed to handle lease management, so I'm not sure why there were any expired messages at all. I also don't understand why the same message is sent to multiple subscriber clients at the same time.
Minimal working example:
import time
from google.cloud import pubsub_v1
PROJECT_ID = 'my-project'
PUBSUB_TOPIC_ID = 'duplicate-test'
PUBSUB_SUBSCRIPTION_ID = 'duplicate-test'
def subscribe(sleep_time=None):
subscriber = pubsub_v1.SubscriberClient()
subscription_path = subscriber.subscription_path(
PROJECT_ID, PUBSUB_SUBSCRIPTION_ID)
def callback(message):
print(message.data.decode())
if sleep_time:
time.sleep(sleep_time)
print(f'acking {message.data.decode()}')
message.ack()
future = subscriber.subscribe(
subscription_path, callback=callback)
print(f'Listening for messages on {subscription_path}')
future.result()
def publish(num_messages):
publisher = pubsub_v1.PublisherClient()
topic_path = publisher.topic_path(PROJECT_ID, PUBSUB_TOPIC_ID)
for i in range(num_messages):
publisher.publish(topic_path, str(i).encode())
In two terminals, run subscribe(1). In a third terminal, run publish(200). For me, this will give duplicates in the two subscriber terminals.
It is unusual for two subscribers to get the same message at the same time unless:
The message got published twice due to a retry (and therefore as far as Cloud Pub/Sub is concerned, there are two messages). In this case, the content of the two messages would be the same, but their message IDs would be different. Therefore, it might be worth ensuring that you are looking at the service-provided message ID to ensure the messages are indeed duplicates.
The subscribers are on different subscriptions, which means each of the subscribers would receive all of the messages.
If neither of these is the case, then duplicates should be relatively rare. There is an edge case in dealing with large backlogs of small messages with streaming pull (which is what the Python client library uses). Basically, if messages that are very small are published in a burst and subscribers then consume that burst, it is possible to see the behavior you are seeing. All of the messages would end up being sent to one of the two subscribers and would be buffered behind the flow control limits of the number of outstanding messages. These messages may exceed their ack deadline, resulting in redelivery, likely to the other subscriber. The first subscriber still has these messages in its buffer and will see these messages, too.
However, if you are consistently seeing two subscribers freshly started immediately receive the same messages with the same message IDs, then you should contact Google Cloud support with your project name, subscription name, and a sample of the message IDs. They will better be able to investigate why this immediate duplication is happening.
(Edited as I misread the deadlines)
Looking at the Streaming Pull docs, this seems like an expected behavior:
The gRPC StreamingPull stack is optimized for high throughput and therefore
buffers messages. This can have some consequences if you are attempting to
process large backlogs of small messages (rather than a steady stream of new
messages). Under these conditions, you may see messages delivered multiple times
and they may not be load balanced effectively across clients.
From: https://cloud.google.com/pubsub/docs/pull#streamingpull

Google Cloud PubSub Message Delivered More than Once before reaching deadline acknowledgement time

Background:
We configured cloud pubsub topic to interact within multiple app engine services,
There we have configured push based subscribers. We have configured its acknowledgement deadline to 600 seconds
Issue:
We have observed pubsub has pushed same message twice (more than twice from some other topics) to its subscribers, Looking at the log I can see this message push happened with the gap of just 1 Second, Ideally as we have configured ackDeadline to 600 seconds, pubsub should re-attempt message delivery only after 600 seconds.
Need following answers:
Why same message has got delivered more than once in 1 second only
Does pubsub doesn’t honors ackDeadline configuration before
reattempting message delivery?
References:
- https://cloud.google.com/pubsub/docs/subscriber
Message redelivery can happen for a couple of reasons. First of all, it is possible that a message got published twice. Sometimes the publisher will get back an error like a deadline exceeded, meaning the publish took longer than anticipated. The message may or may not have actually been published in this situation. Often, the correct action is for the publisher to retry the publish and in fact that is what the Google-provided client libraries do by default. Consequently, there may be two copies of the message that were successfully published, even though the client only got confirmation for one of them.
Secondly, Google Cloud Pub/Sub guarantees at-least-once delivery. This means that occasionally, messages can be redelivered, even if the ackDeadline has not yet passed or an ack was sent back to the service. Acknowledgements are best effort and most of the time, they are successfully processed by the service. However, due to network glitches, server restarts, and other regular occurrences of that nature, sometimes the acknowledgements sent by the subscriber will not be processed, resulting in message redelivery.
A subscriber should be designed to be resilient to these occasional redeliveries, generally by ensuring that operations are idempotent, i.e., that the results of processing the message multiple times are the same, or by tracking and catching duplicates. Alternatively, one can use Cloud Dataflow as a subscriber to remove duplicates.

Cloud pubsub slow poll rate

I have a pubsub topic, with one subscription, and two different subscribers are pulling from it.
Using stackdriver, I can see that the subscription has ~1000 messages.
Each subscriber runs the following poll loop:
client = pubsub.Client()
topic = client.topic(topic_name)
subscription = pubsub.Subscription(subscription_name)
while True:
messages = subscription.pull(return_immediately=True, max_messages=100, client=client)
print len(messages)
# put messages in local queue for later processing. Those processes will ack the subsription
My issue is a slow poll rate - even though I have plenty of messages waiting to be polled, I'm getting only several messages each time. Also, lots of responses are back without any messages. According to stackdriver, my messages pulled rate is ~1.5 messages/sec.
I tried to use return_immediately=False, and it improved it a bit - the pull rate increased to ~2.5 messages/sec, but still - not the rate I would expect to have.
Any ideas how to increase pull rate? Any pubsub poll best practices?
In order to increase your pull rate, you need to have more than one outstanding pull request at a time. How many depends on how fast and from how many places you publish. You'll need at least a few outstanding at all times. As soon as one of them returns, create another pull request. That way, whenever Cloud Pub/Sub is ready to deliver messages to your subscriber, you have requests waiting to receive messages.