Error code 503 in GCP pubsub.v1.Subscriber.StreamingPull - google-cloud-platform

I am trying to utilize pub/sub service and I noticed in my dashboard following error code.
Here link what is
code 503
Is there anything that allow me to prevent that?
-Askar

As explained in the documentation link about Error Codes that you shared, the HTTP code 503 ("UNAVAILABLE") is returned when the Pub/Sub service was not able to process a request. In general, one could say that these types of errors tend to be transient, and there's no way to avoid them, you can just work around them following a retry strategy such as the one I will comment shortly.
The Google Cloud Pub/Sub SLA shows the guaranteed uptime for this service. As you can see, it is not 100%, as transient errors may happen, which should not disturb your service greatly, considering that you follow the recommended practice of implementing a retry strategy with exponential backoff.
This documentation page shows an example implementation of an Exponential Backoff retry strategy. This example is for Google Cloud Storage, but it can (and should) be applied to any other similar service. It consists in retrying the failed Pub/Sub requests with an increasing backoff in order to increase the probability of a request being successful. This is a recommended best practice and the recommended approach to overcome transient issues.

StreamingPull has a 100% error rate.
StreamingPull streams are always terminated with a non-OK status(HTTP 503). Note that, unlike in regular RPCs, the status here is simply an indication that the stream has been broken, not that requests are failing.
https://cloud.google.com/pubsub/docs/pull#streamingpull_has_a_100_error_rate_this_is_to_be_expected

Related

GCP PubSub: "The request was aborted because there was no available instance." - Doesn't Retry on Failure

We have a pubsub subscription setup passing requests to a Google Cloud Function.
Both the cloud function and the subscription to it are set to "Retry on Failure" (both with exponential back-off policies fwiw).
The Google Cloud Function is limited to 40 concurrent instances.
When the subscription queue is larger than the available instances, the expected behaviour is delivery will fail and be retried later.
What seems to be happening is the logs are filled with messages saying:
{
"textPayload": "The request was aborted because there was no available instance.",
"insertId": "6109fbbb0007ec4aaa3855a9",
...
}
And the subscription messages are just dropped and not retried.
Is this the expected behaviour? It seems crazy to me but if so, what architecture should you put in place to catch these dropped messages?
Edit: These issues started showing up in our logs on July 5 2021 and can't be found in logs before that date. Before that, the pubsub/gcf combo used to work as expected.
The error you are encountering is a known issue and the updates can be tracked through this Issue Tracker. You can also STAR the issue to receive automatic updates and give it traction by referring to this link. The tracker also discusses work-arounds to mitigate the request aborts. Since you have already implemented retries with exponential backoff, please take a look at the other solutions provided here.
If your concern is to do with Google Cloud Functions scalability or in general require further investigation of these errors, please reach out to GCP support in case you have a support plan. Otherwise, please open an issue in the issue tracker.

Elastic search 403 Request throttled due to too many requests /_bulk

I am trying to sync 1 million record to ES, and I am doing it using bulk API in batch of 2k.
But after inserting around 25k-32k, elastic search is giving following exception.
Unable to parse response body: org.elasticsearch.ElasticsearchStatusException
ElasticsearchStatusException[Unable to parse response body]; nested: ResponseException[method [POST], host [**********], URI [/_bulk?timeout=1m], status line [HTTP/1.1 403 Request throttled due to too many requests]
403 Request throttled due to too many requests /_bulk]; nested: ResponseException[method [POST], host [************], URI [/_bulk?timeout=1m], status line [HTTP/1.1 403 Request throttled due to too many requests]
403 Request throttled due to too many requests /_bulk];
I am using aws elastic search.
I think, I need to implement wait strategy to handle it, something like keep checking es status and call bulk insert if status all of ES okay.
But not sure how to implement it? Does ES offers anything pre-build for it?
Or Anything better way to handle this?
Thanks in advance.
Update:
I am using AWS elastic search version 6.8
Thanks #dravit for including my previous SO answer in the comment, after following the comments it seems OP wants to improve the performance of bulk indexing and want exponential backoff, which i don't think Elasticsearch provides out of the box.
I see that you are putting a pause of 1 second after every second which will not work in all the cases, and if you have large number of batches and documents to be indexed, for sure it will take a lot of time. There are few more suggestions from my side to improve the performance.
Follow my tips to improve the reindex speed in Elasticsearch and see what all things listed here is applicable and doing them improves speed by what factor.
Find a batching strategy which best suits to your environment, I am not sure but this article from #spinscale who is the developer of java high level rest client might help or you can ask a question on https://discuss.elastic.co/, I remembered he shared a very good batching strategy in one of his webinar but couldn't find the link of it.
Notice various ES metrics apart from bulk threadpool and queue size, and see if your ES still has capacity can you increase the queue size and increase the rate by which you can send requests to ES.
Check the error handling guide here
If you receive persistent 403 Request throttled due to too many requests or 429 Too Many Requests errors, consider scaling vertically. Amazon Elasticsearch Service throttles requests if the payload would cause memory usage to exceed the maximum size of the Java heap.
Scale your application vertically or increase delay between requests.

How do I handle Google DLP rate limiting when using the Java library?

At one point when doing some testing with the Google DLP Java library, I got an exception that indicated that I had exceeded the API rate limit. Unfortunately I don't have the stack trace anymore, so I can't give any more detail at this point. However, it made me realize that I'm not handling that situation in the code. What is the recommended way of dealing with this from a Java application? I haven't seen any examples in the GitHub repo that gives any guidance on this. I'm aware of the ability to request quota increases, and I have already put in a request. My question is on how to gracefully handle this in the code, should I run into the quota exceeded situation again. Thanks.
It depends greatly on your design and place in which you are making the call from.
Can you afford to retry until it succeeds?
Are user's waiting on the response and errors are not acceptable?
Is this a batch pipeline working offline where taking longer is okay?
If you never want to hit the error, you'll need to implement your own client side rate throttling with accompanying monitoring to assure that you know it's time to request more quota.
If you can retry and wait, try retrying using exponential backoff.

Authentication with Cognito - where to find logs

We have 2 React Native app are using AWS Cognito for authentication. We use library react-native-aws-cognito-js in our code. The apps are working fine until these 2 days. Apps are experiencing intermittent "Internal Server Error".
How can I find more information about this error? Any tool can help us pinpoint the cause?
Update
From CloudTrail, each API call has an event "CreateNetworkInterface". Many of such API calls have error code "Client.NetworkInterfaceLimitExceeded". What is the cause and solution to this?
According to this AWS Doc (in Chinese), CloudWatch will not write to log when error is due to insufficient IP/ENI. That explains the increase in error number but no logs in CloudWatch.
Upate 2
We have found a scheduled Lambda job which may exhausted IP addresses. We stopped the batch job. But still can't have too many user login to server due to "Client.NetworkInterfaceLimitExceeded" error. I realized that there are many "CreateNetworkInterface" event and few "DeleteNetworkInterface" event. How can I "clean up / reset" all network interface in VPC?
Short answer: Cloud Trail.
Long answer with a suggestion
Assuming your application code is fine, most likely the cause of your 500 error is based on Cognito's initial limitations (e.g., number of calls per user): https://docs.aws.amazon.com/cognito/latest/developerguide/limits.html.
AWS suggests to use Cloud Trail, for logging Api calls.
However I would suggest, to prove the limitations first, add some logs around the api call yourself, and in development you could call your app/api with a high number of calls; and most likely you will see the 500 error due to the limitations.
You could do the following in the terminal:
for i in `seq 1 1000`; do curl --cookie SecureCookie=TokenValueFromAWS http://localhost:desirablePort/SecuredPath; done

Google Places API error 502 - The server encountered a temporary error

we run a website that obtains location data through the Google Place API. We have 150k daily searches available, which we haven´t met yet as the website has been live for few weeks only. We have suddenly received a 502 error. A notification in the Console says: “The server encountered a temporary error and could not complete your request.”. Is this a temporary error? Is there any suggestions on what we can do? The website hasn’t been available for 40 minutes.
When you receive 5xx status or UNKNOWN_ERROR in the response, you should implement a retrying logic. Google has a following recommendation in their web services documentation:
In rare cases something may go wrong serving your request; you may receive a 4XX or 5XX HTTP response code, or the TCP connection may simply fail somewhere between your client and Google's server. Often it is worthwhile re-trying the request as the followup request may succeed when the original failed. However, it is important not to simply loop repeatedly making requests to Google's servers. This looping behavior can overload the network between your client and Google causing problems for many parties.
A better approach is to retry with increasing delays between attempts. Usually the delay is increased by a multiplicative factor with each attempt, an approach known as Exponential Backoff.
https://developers.google.com/maps/documentation/directions/web-service-best-practices#exponential-backoff
However, if retrying logic with Exponential Backoff doesn't help and the error persists for a long time you should file a bug in Google issue tracker
I hope this addresses your doubt!
UPDATE
There was an issue on Google side yesterday (November 6, 2017), you can refer to the following bug that explains the issue:
https://issuetracker.google.com/issues/68938173