Is there a way to retrieve the count of messages in a PubSub subscription (in realtime)? - google-cloud-platform

I want to achieve batch consuming of a PubSub subscription, retrieving all the messages that were in the subscription at the begining of my process. To do so, I use PubSub's asynchronous pulling for Java, and the consumer.ack() and consumer.nack() functions to process exactly the number of messages that I want, and make the subscription redeliver the messages that I have received but not processed yet. My problem being that I did not managed to find a way to retrieve the real time count of messages in my subscription.
I have started to request pubsub.googleapis.com/subscription/num_undelivered_messages metric from Google Cloud Monitoring, but unfortunately the metric has a ~3 minutes latency with the real count of undelivered messages in the subscription.
Is there any way to retrieve this message count on real time ?

There is no way to retrieve the message count in real time, no. Also keep in mind that such a number would not be sufficient to retrieve all of the messages that were in the subscription at the beginning of the process unless you can guarantee that no publishing is happening at the same time.
If there is publishing, then your subscriber could get those messages before messages published earlier, unless you are using ordered message delivery and even still, those delivery guarantees are per ordering key, not a total ordering guarantee. If you can guarantee that there are no publishes during this time and/or you are only bringing the subscriber up periodically, then it sounds more like a batch case, which means you may want to consider a database or a GCS file as an alternative place to store the messages for processing.

Related

Google Cloud Pub/Sub retrieve message by ID

Problem: My use case is I want to publish thousends of messages to Google Cloud Pub/Sub with a 5min retention period but only retrieve specific messages by their ID - So a cloud function will retrieve one message by ID using the Nodejs SDK and all the untreated messages will be deleted by the retention policy. All the current examples mention are to handle random messages from the subscriber.
Is it possible to just pull 1 message by id or any other metadata and close the connection.
There is no way to retrieve individual messages by ID, no. It doesn't really fit into the expected use cases for Cloud Pub/Sub where the publishers and subscribers are meant to be decoupled, meaning the subscriber inherently doesn't know the message IDs prior to receiving the messages.
You may instead want to transmit the messages via whatever mechanism you are using to making the subscribers aware of the message IDs. Or, if you know at publish time which messages will ultimately need to be retrieved, you could add an attribute to the message to indicate this and use filtering.

Verify the data has reached sent to GCP Pub/Sub

We have a project which receives data from sensors and then we send this data to GCP. For this we have used GCP's Pub/Sub model. Issue here is when we pull the messages, they are not in ordered manner. So we are not able to verify that the data we have sent to GCP has reached there or not.
Also GCP has mentioned that they don't guarantee the order of messages https://cloud.google.com/pubsub/docs/ordering
Any better way to verify this messages, other than the solutions recommended by GCP.
Ordering is not guaranteed in general in Pub/Sub, it is true. However, when using ordering keys as described in the ordering documentation to which you link, ordering is guaranteed. You would need to set an ordering key on published messages and enable message ordering on your subscription. Right now, the documentation only shows how to do this in Java, though other language examples will be coming soon.
Without using ordering, you could potentially monitor the backlog to see when num_undelivered_messages is 0. However, this has some drawbacks:
You would have to continuously query the metric to see its value.
The delay in computing the metric is O(minutes) and so it may be stale, resulting in either not tracking messages that were very recently published (resulting in it showing a value less than the actual size of the backlog) or not recording the fact that some messages were delivered and acked (resulting in it showing a value greater than the actual size of the backlog).
In general, it is preferred with Pub/Sub that your subscribers are always running and ready to receive data when it is published. Cloud Pub/Sub guarantees that messages successfully published will be received by subscribers, assuming subscribers are able to receive the messages within the message retention duration, which defaults to seven days.

How is Google Cloud Pub/Sub avoiding clock skew

I am looking into ways to order list of messages from google cloud pub/sub. The documentation says:
Have a way to determine from all messages it has currently received whether or not there are messages it has not yet received that it needs to process first.
...is possible by using Cloud Monitoring to keep track of the pubsub.googleapis.com/subscription/oldest_unacked_message_age metric. A subscriber would temporarily put all messages in some persistent storage and ack the messages. It would periodically check the oldest unacked message age and check against the publish timestamps of the messages in storage. All messages published before the oldest unacked message are guaranteed to have been received, so those messages can be removed from persistent storage and processed in order.
I tested it locally and this approach seems to be working fine.
I have one gripe with it however, and this is not something easily testable by myself.
This solution relies on server-side assigned (by google) publish_time attribute. How does Google avoid the issues of skewed clocks?
If my producer publishes messages A and then immediately B, how can I be sure that A.publish_time < B.publish_time is true? Especially considering that the same documentation page mentions internal load-balancers in the architecture of the solution. Is Google Pub/Sub using atomic clocks to synchronize time on the very first machines which see messages and enrich those messages with the current time?
There is an implicit assumption in the recommended solution that the clocks on all the servers are synchronized. But the documentation never explains if that is true or how it is achieved so I feel a bit uneasy about the solution. Does it work under very high load?
Notice I am only interested in relative order of confirmed messages published after each other. If two messages are published simultaneously, I don't care about the order of them between each other. It can be A, B or B, A. I only want to make sure that if B is published after A is published, then I can sort them in that order on retrieval.
Is the aforementioned solution only "best-effort" or are there actual guarantees about this behavior?
There are two sides to ordered message delivery: establishing an order of messages on the publish side and having an established order of processing messages on the subscribe side. The document to which you refer is mostly concerned with the latter, particularly when it comes to using oldest_unacked_message_age. When using this method, one can know that if message A has a publish timestamp that is less than the publish timestamp for message B, then a subscriber will always process message A before processing message B. Essentially, once the order is established (via publish timestamps), it will be consistent. This works if it is okay for the Cloud Pub/Sub service itself to establish the ordering of messages.
Publish timestamps are not synchronized across servers and so if it is necessary for the order to be established by the publishers, it will be necessary for the publishers to provide a timestamp (or sequence number) as an attribute that is used for ordering in the subscriber (and synchronized across publishers). The subscriber would sort message by this user-provided timestamp instead of by the publish timestamp. The oldest_unacked_message_age will no longer be exact because it is tied to the publish timestamp. One could be more conservative and only consider messages ordered that are older than oldest_unacked_message_age minus some delta to account for this discrepancy.
Google Cloud Pub-sub does not guarantee order of events receive to consumers as they were produced. Reason behind that is Google Cloud Pub-sub also running on a cluster of nodes. The possibility is there an event B can reach the consumer before event A. To Ensure ordering you have to make changes on both producer and consumer to identify the order of events. Here is section from docs.

GCloud Pub/Sub Push Subscription: Limit max outstanding messages

Is there a way in a push subscription configuration to limit the maximum number of outstanding messages. In the high level subscriber docs (https://cloud.google.com/pubsub/docs/push) it says "With slow-start, Google Cloud Pub/Sub starts by sending a single message at a time, and doubles up with each successful delivery, until it reaches the maximum number of concurrent messages outstanding." I want to be able to limit the maximum number of messages being processed, can this be done through the pub/sub config?
I've also thought of a number of other ways to effectively achieve this, but none seem great:
Have some semaphore type system implemented in my push endpoint that returns a 429 once my max concurrency level is hit?
Similar, but have it deregister the push endpoint (turning it into a pull subscription) until the current messages have been processed
My push endpoints are all on gae, so there could also be something in the gae configs to limit the simultaneous push subscription requests?
Push subscriptions do not offer any way to limit the number of outstanding messages. If one wants that level of control, the it is necessary to use pull subscriptions and flow control.
Returning 429 errors as a means to limit outstanding messages may have undesirable side effects. On errors, Cloud Pub/Sub will reduce the rate of sending messages to a push subscriber. If a sufficient number of 429 errors are returned, it is entirely possible that the subscriber will receive a smaller number of messages than it can handle for a time while Cloud Pub/Sub ramps the delivery rate back up.
Switching from push to pull is a possibility, though still may not be a good solution. It would really depend on the frequency with which the push subscriber exceeds the desired number of outstanding messages. The change between push and pull and back may not take place instantaneously, meaning the subscriber could still exceed the desired limit for some period of time and may also experience a delay in receiving new messages when switching back to a push subscriber.

Cloud pubsub slow poll rate

I have a pubsub topic, with one subscription, and two different subscribers are pulling from it.
Using stackdriver, I can see that the subscription has ~1000 messages.
Each subscriber runs the following poll loop:
client = pubsub.Client()
topic = client.topic(topic_name)
subscription = pubsub.Subscription(subscription_name)
while True:
messages = subscription.pull(return_immediately=True, max_messages=100, client=client)
print len(messages)
# put messages in local queue for later processing. Those processes will ack the subsription
My issue is a slow poll rate - even though I have plenty of messages waiting to be polled, I'm getting only several messages each time. Also, lots of responses are back without any messages. According to stackdriver, my messages pulled rate is ~1.5 messages/sec.
I tried to use return_immediately=False, and it improved it a bit - the pull rate increased to ~2.5 messages/sec, but still - not the rate I would expect to have.
Any ideas how to increase pull rate? Any pubsub poll best practices?
In order to increase your pull rate, you need to have more than one outstanding pull request at a time. How many depends on how fast and from how many places you publish. You'll need at least a few outstanding at all times. As soon as one of them returns, create another pull request. That way, whenever Cloud Pub/Sub is ready to deliver messages to your subscriber, you have requests waiting to receive messages.