I have enabled a Dead Lettering and set the maximum delivery attempts for my pull subscription. I have tested that the delivery attempt increases as its supposed to if I do not acknowledge the pulled message.
However if I do not pull the message for some time (seems to be around 10 min), the delivery attempt seems to go back to 1, or get reset...
Is this the expected behavior or is this only happening to me?
In the docs there is nothing regarding a reset after a specific period of time: https://cloud.google.com/pubsub/docs/handling-failures
Delivery attempts are set on a best-effort basis and are not persisted so it is possible that if a message is not retrieved for awhile, it will be reset to zero, yes.
Related
Today I experienced something I found rather interesting.
I had a batch of unacknowledged messages that were all published within the same second, and for an expected reason, one of these messages were being unacknowledged. However, the remaining messages kept being attempted delivered and were being processed and acknowledged successfully.
Why does this happen? Is this expected behavior? The messages did not have an ordering key, nor was message ordering enabled on the given subscription.
Also, I even attempted to ACK these messages manually in Google Cloud, but it did not seem to do anything. When I pulled after ACKing, the same messages showed up.
You are probably running into the case described in the note in the "dealing with duplicates" section of the documentation. If messages are batched together, all messages in the batch must be acknowledged or the entire batch of messages may be redelivered. This means that if 100 messages were batched together in a single publish request and 99 of them are acked, but 1 is not acked, all 100 may be redelivered. There are some efforts to avoid this duplicate delivery as much as possible in the service, but it is not guaranteed.
We have set up alerts in my GCP environments. Basically GCP Stackdriver will raise alerts based on certain parameters which we configured (both at infrastructure level and application level).
The issue is that we are getting too many alerts, if the problem is not resolved quickly enough. For example, if a compute engine is down, we are investigating and still we get alerts. Looking for some help to reduce alert noise so that once we acknowledge an issue, the alert frequency should be reduced till we resolve the issue (maybe once every three hours rather than sending one mail each for every 10 minutes OR after the problem is fixed).
Posting this as an answer for better usability.
When the alert is triggered you will be receiving notifications every 10 minutes or so until you acknowledge the incident.
When you do notifications will stop coming, but the incident will be kept open until you close it.
You can also silence the incident, however it may & will close other incidents that were triggered by the same condition that triggered this one.
You may also have a look at the alerting behavior docs since they may prove useful in such cases.
We (as a company) experience large spikes every day. We use Pub/Sub -> Cloud Run combination.
The issue we experience is that when high traffic hits, Pub/Sub tries to push messages to Cloud/Run all at the same time without any flow control. The result?
429: The request was aborted because there was no available instance.
Although this is marked as a warning, every 4xx HTTP response results in the message retry delivery.
Messages, therefore, come back to the queue and wait. If a message repeats this process and the instances are still taken, Cloud Run returns 429 again, and the message is sent back to the queue. This process repeats x times (depends on what value we set in Maximum delivery attempts). After that, the message goes to the dead-letter queue.
We want to avoid this and ideally don't get any 429, so the message won't travel back and forth, and it won't end up in the dead-letter subscription because it is not one of the application errors we want to keep there, but rather a warning caused by Pub/Sub not controlling the flow and coordinating with Cloud Run.
Neither Pub/Sub nor a push subscription (which is required to use for Cloud Run) have any flow control feature.
Is there any way to control how many messages are sent to Cloud Run to avoid getting the 429 response? And also, why does Pub/Sub even try to deliver when it is obvious that Cloud Run hit the limit of instances. The best would be to keep the messages in a queue until the instances free up.
Most of the answers would probably suggest increasing the limit of instances. We already set 1000. This would not be scalable because even if we set the limit to 1500 and a huge spike comes, we would pass the limit and get the 429 messages again.
The only option I can think of is some flow control. So far, we have read about Cloud Tasks, but we are not sure if this can help us. Ideally, we don't want to introduce any new service, but if necessary, we will do.
Thank you for all your tips and time! :)
There is an issue with my company's Pub/Sub. Some of our messages are stuck and the oldest unacked message age is increasing over time.
1 day charts:
and when I go to metrics explorer and select Expired ack deadlines count this is the one week chart.
I decided to find out why these messages are stuck, but when I ran the pull command (below), I got Listed 0 items response. It is therefore not possible to see them.
Is there a way how I can figure out why some of the messages are displayed as unacknowledged?
Also, the Unacked message count shows the same amount (around 2k) messages for the whole month, even though there are new messages published every day.
Here are the parameters we use for this subscription:
I tried to fix this error by setting the deadline to 600 seconds, but it didn't help.
Additionally, I want to mention that we use node.js Pub/Sub client library to handle the messages.
The most common causes of messages not being able to be pulled are:
The subscriber client already received the messages and "forgot" about them, perhaps due to an exception being thrown and not handled. In this case, the message will continue to be leased by the client until the deadline passes. The client libraries all extend the lease automatically until the maxExtension time is reached. If these are messages that are always forgotten, then it could be that they are redelivered to the subscriber and forgotten again, resulting in them not being pullable via the gcloud command-line tool or UI.
There could be a rogue subscriber. It could be that another subscriber is running somewhere for the same subscription and is "stealing" these messages. Sometimes this can be a test job or something that was used early on to see if the subscription works as expected and wasn't turned down.
You could be falling into the case of a large backlog of small messages. This should be fixed in more recent versions of the client library (v2.3.0 of the Node client has the fix).
The gcloud pubsub subscription pull command and UI are not guaranteed to return messages, even if there are some available to pull. Sometimes, rerunning the command multiple times in quick succession helps to pull messages.
The fact that you see expired ack deadlines likely points to 1, 2, or 3, so it is worth checking for those things. Otherwise, you should open a support case so the engineers can look more specifically at the backlog and determine where the messages are.
Background:
We configured cloud pubsub topic to interact within multiple app engine services,
There we have configured push based subscribers. We have configured its acknowledgement deadline to 600 seconds
Issue:
We have observed pubsub has pushed same message twice (more than twice from some other topics) to its subscribers, Looking at the log I can see this message push happened with the gap of just 1 Second, Ideally as we have configured ackDeadline to 600 seconds, pubsub should re-attempt message delivery only after 600 seconds.
Need following answers:
Why same message has got delivered more than once in 1 second only
Does pubsub doesn’t honors ackDeadline configuration before
reattempting message delivery?
References:
- https://cloud.google.com/pubsub/docs/subscriber
Message redelivery can happen for a couple of reasons. First of all, it is possible that a message got published twice. Sometimes the publisher will get back an error like a deadline exceeded, meaning the publish took longer than anticipated. The message may or may not have actually been published in this situation. Often, the correct action is for the publisher to retry the publish and in fact that is what the Google-provided client libraries do by default. Consequently, there may be two copies of the message that were successfully published, even though the client only got confirmation for one of them.
Secondly, Google Cloud Pub/Sub guarantees at-least-once delivery. This means that occasionally, messages can be redelivered, even if the ackDeadline has not yet passed or an ack was sent back to the service. Acknowledgements are best effort and most of the time, they are successfully processed by the service. However, due to network glitches, server restarts, and other regular occurrences of that nature, sometimes the acknowledgements sent by the subscriber will not be processed, resulting in message redelivery.
A subscriber should be designed to be resilient to these occasional redeliveries, generally by ensuring that operations are idempotent, i.e., that the results of processing the message multiple times are the same, or by tracking and catching duplicates. Alternatively, one can use Cloud Dataflow as a subscriber to remove duplicates.