How to modify/check google cloud run's retry limit on failure?

How to modify/check google cloud run's retry limit on failure? - google-cloud-platform

I have got a topic, which on publish it pushes the event to a cloud run endpoint and I got a trigger on a storage bucket to publish for this topic. The container in the cloud run fails to process the event and it has been restarted over hundreds of times and I don't wanna waste money on this. How can I limit the retry on failure on a cloud run's container?

A possible answer to the puzzle might be the following notion.
If we read the documentation on PUSH subscriptions found here, we find the following:
... Pub/Sub retries delivery until the message expires after the
subscription's message retention period.
What this means is that if Pub/Sub pushes a message to Cloud Run and Cloud Run does not acknowledge the message by returning a 200 response code, then the message will be re-pushed for the "message retention period". By default, this is 7 days but according to the documentation, can be set to a minimum value of 10 minutes. What this seems to say to me is that we can stop a poison message after 10 minutes (minimum) of retries.
If a message is pushed and not acked, then it won't be pushed again immediately but instead be pushed as a function of a back-off algorithm described here.

If we look at the gcloud documentation we find reference to the concept of a maximum number of delivery attempts (--max-delivery-attempts). Associated with this is a topic called the dead letter topic (--dead-letter-topic). What this appears to define is that if an attempt to deliver a pub sub message more than the maximum number of times, the message will be removed from the queue of messages associated with the subscription and moved to the topic associated with the dead letter. If you define this for your environment, then your Cloud Run will only execute a finite number of times after which the poision messages will be moved elsewhere.

Related

Cloud Run: 429: The request was aborted because there was no available instance

We (as a company) experience large spikes every day. We use Pub/Sub -> Cloud Run combination.
The issue we experience is that when high traffic hits, Pub/Sub tries to push messages to Cloud/Run all at the same time without any flow control. The result?
429: The request was aborted because there was no available instance.
Although this is marked as a warning, every 4xx HTTP response results in the message retry delivery.
Messages, therefore, come back to the queue and wait. If a message repeats this process and the instances are still taken, Cloud Run returns 429 again, and the message is sent back to the queue. This process repeats x times (depends on what value we set in Maximum delivery attempts). After that, the message goes to the dead-letter queue.
We want to avoid this and ideally don't get any 429, so the message won't travel back and forth, and it won't end up in the dead-letter subscription because it is not one of the application errors we want to keep there, but rather a warning caused by Pub/Sub not controlling the flow and coordinating with Cloud Run.
Neither Pub/Sub nor a push subscription (which is required to use for Cloud Run) have any flow control feature.
Is there any way to control how many messages are sent to Cloud Run to avoid getting the 429 response? And also, why does Pub/Sub even try to deliver when it is obvious that Cloud Run hit the limit of instances. The best would be to keep the messages in a queue until the instances free up.
Most of the answers would probably suggest increasing the limit of instances. We already set 1000. This would not be scalable because even if we set the limit to 1500 and a huge spike comes, we would pass the limit and get the 429 messages again.
The only option I can think of is some flow control. So far, we have read about Cloud Tasks, but we are not sure if this can help us. Ideally, we don't want to introduce any new service, but if necessary, we will do.
Thank you for all your tips and time! :)

Pub/Sub - unable to pull undelivered messages

There is an issue with my company's Pub/Sub. Some of our messages are stuck and the oldest unacked message age is increasing over time.
1 day charts:
and when I go to metrics explorer and select Expired ack deadlines count this is the one week chart.
I decided to find out why these messages are stuck, but when I ran the pull command (below), I got Listed 0 items response. It is therefore not possible to see them.
Is there a way how I can figure out why some of the messages are displayed as unacknowledged?
Also, the Unacked message count shows the same amount (around 2k) messages for the whole month, even though there are new messages published every day.
Here are the parameters we use for this subscription:
I tried to fix this error by setting the deadline to 600 seconds, but it didn't help.
Additionally, I want to mention that we use node.js Pub/Sub client library to handle the messages.

The most common causes of messages not being able to be pulled are:
The subscriber client already received the messages and "forgot" about them, perhaps due to an exception being thrown and not handled. In this case, the message will continue to be leased by the client until the deadline passes. The client libraries all extend the lease automatically until the maxExtension time is reached. If these are messages that are always forgotten, then it could be that they are redelivered to the subscriber and forgotten again, resulting in them not being pullable via the gcloud command-line tool or UI.
There could be a rogue subscriber. It could be that another subscriber is running somewhere for the same subscription and is "stealing" these messages. Sometimes this can be a test job or something that was used early on to see if the subscription works as expected and wasn't turned down.
You could be falling into the case of a large backlog of small messages. This should be fixed in more recent versions of the client library (v2.3.0 of the Node client has the fix).
The gcloud pubsub subscription pull command and UI are not guaranteed to return messages, even if there are some available to pull. Sometimes, rerunning the command multiple times in quick succession helps to pull messages.
The fact that you see expired ack deadlines likely points to 1, 2, or 3, so it is worth checking for those things. Otherwise, you should open a support case so the engineers can look more specifically at the backlog and determine where the messages are.

Move Google Pub/Sub Messages Between Topics

How can I bulk move messages from one topic to another in GCP Pub/Sub?
I am aware of the Dataflow templates that provide this, however unfortunately restrictions do not allow me to use Dataflow API.
Any suggestions on ad-hoc movement of messages between topics (besides one-by-one copy and pasting?)
Specifically, the use case is for moving messages in a deadletter topic back into the original topic for reprocessing.

You can't use snapshots, because snapshots can be applied only on subscriptions of the same topics (to avoid message ID overlapping).
The easiest way is to write a function that pull your subscription. Here, how I will do it:
Create a topic (named, for example, "transfer-topic") with a push subscription. Set the timeout to 10 minutes
Create a Cloud Functions HTTP triggered by PubSub push subscription (or a CLoud Run service). When you deploy it, set the timeout to 9 minutes for Cloud Function and to 10 minutes for Cloud Run. The content of the processing is the following
Read a chunk of messages (for examples 1000) from the deadletter pull subscription
Publish the messages (in bulk mode) into the initial topic
Acknowledge the messages of the dead letter subscription
Repeat this up to the pull subscription is empty
Return code 200.
The global process:
Publish a message in the transfer-topic
The message trigger the function/cloud run with a push HTTP
The process pull the messages and republish them into the initial topic
If the timeout is reached, the function crash and PubSub perform a retry of the HTTP request (according with an exponential backoff).
If all the message are processed, the HTTP 200 response code is returned and the process stopped (and the message into the transfer-topic subscription is acked)
this process allow you to process a very large amount of message without being worried about the timeout.

I suggest that you use a Python script for that.
You can use the PubSub CLI to read the messages and publish to another topic like below:
from google.cloud import pubsub
from google.cloud.pubsub import types
# Defining parameters
PROJECT = "<your_project_id>"
SUBSCRIPTION = "<your_current_subscription_name>"
NEW_TOPIC = "projects/<your_project_id>/topics/<your_new_topic_name>"
# Creating clients for publishing and subscribing. Adjust the max_messages for your purpose
subscriber = pubsub.SubscriberClient()
publisher = pubsub.PublisherClient(
batch_settings=types.BatchSettings(max_messages=500),
)
# Get your messages. Adjust the max_messages for your purpose
subscription_path = subscriber.subscription_path(PROJECT, SUBSCRIPTION)
response = subscriber.pull(subscription_path, max_messages=500)
# Publish your messages to the new topic
for msg in response.received_messages:
publisher.publish(NEW_TOPIC, msg.message.data)
# Ack the old subscription if necessary
ack_ids = [msg.ack_id for msg in response.received_messages]
subscriber.acknowledge(subscription_path, ack_ids)
Before running this code you will need to install the PubSub CLI in your Python environment. You can do that running pip install google-cloud-pubsub
An approach to execute your code is using Cloud Functions. If you decide to use it, pay attention in two points:
The maximum time that you function can take to run is 9 minutes. If this timeout get exceeded, your function will terminate without finishing the job.
In Cloud Functions you can just put google-cloud-pubsub in a new line of your requirements file instead of running a pip command.

Google Cloud PubSub Message Delivered More than Once before reaching deadline acknowledgement time

Background:
We configured cloud pubsub topic to interact within multiple app engine services,
There we have configured push based subscribers. We have configured its acknowledgement deadline to 600 seconds
Issue:
We have observed pubsub has pushed same message twice (more than twice from some other topics) to its subscribers, Looking at the log I can see this message push happened with the gap of just 1 Second, Ideally as we have configured ackDeadline to 600 seconds, pubsub should re-attempt message delivery only after 600 seconds.
Need following answers:
Why same message has got delivered more than once in 1 second only
Does pubsub doesn’t honors ackDeadline configuration before
reattempting message delivery?
References:
- https://cloud.google.com/pubsub/docs/subscriber

Message redelivery can happen for a couple of reasons. First of all, it is possible that a message got published twice. Sometimes the publisher will get back an error like a deadline exceeded, meaning the publish took longer than anticipated. The message may or may not have actually been published in this situation. Often, the correct action is for the publisher to retry the publish and in fact that is what the Google-provided client libraries do by default. Consequently, there may be two copies of the message that were successfully published, even though the client only got confirmation for one of them.
Secondly, Google Cloud Pub/Sub guarantees at-least-once delivery. This means that occasionally, messages can be redelivered, even if the ackDeadline has not yet passed or an ack was sent back to the service. Acknowledgements are best effort and most of the time, they are successfully processed by the service. However, due to network glitches, server restarts, and other regular occurrences of that nature, sometimes the acknowledgements sent by the subscriber will not be processed, resulting in message redelivery.
A subscriber should be designed to be resilient to these occasional redeliveries, generally by ensuring that operations are idempotent, i.e., that the results of processing the message multiple times are the same, or by tracking and catching duplicates. Alternatively, one can use Cloud Dataflow as a subscriber to remove duplicates.

GCloud Pub/Sub Push Subscription: Limit max outstanding messages

Is there a way in a push subscription configuration to limit the maximum number of outstanding messages. In the high level subscriber docs (https://cloud.google.com/pubsub/docs/push) it says "With slow-start, Google Cloud Pub/Sub starts by sending a single message at a time, and doubles up with each successful delivery, until it reaches the maximum number of concurrent messages outstanding." I want to be able to limit the maximum number of messages being processed, can this be done through the pub/sub config?
I've also thought of a number of other ways to effectively achieve this, but none seem great:
Have some semaphore type system implemented in my push endpoint that returns a 429 once my max concurrency level is hit?
Similar, but have it deregister the push endpoint (turning it into a pull subscription) until the current messages have been processed
My push endpoints are all on gae, so there could also be something in the gae configs to limit the simultaneous push subscription requests?

Push subscriptions do not offer any way to limit the number of outstanding messages. If one wants that level of control, the it is necessary to use pull subscriptions and flow control.
Returning 429 errors as a means to limit outstanding messages may have undesirable side effects. On errors, Cloud Pub/Sub will reduce the rate of sending messages to a push subscriber. If a sufficient number of 429 errors are returned, it is entirely possible that the subscriber will receive a smaller number of messages than it can handle for a time while Cloud Pub/Sub ramps the delivery rate back up.
Switching from push to pull is a possibility, though still may not be a good solution. It would really depend on the frequency with which the push subscriber exceeds the desired number of outstanding messages. The change between push and pull and back may not take place instantaneously, meaning the subscriber could still exceed the desired limit for some period of time and may also experience a delay in receiving new messages when switching back to a push subscriber.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js