Cannot achieve 1000 concurrent function requests with Google Cloud Functions - google-cloud-platform

I am trying to run an RNA-Seq application with Google Cloud Functions. To run this application I need to be able to have over 800 functions running concurrently. This has been achieved using AWS Lambda, but I have not been able to do this on Google Cloud Functions.
When I attempt to run hundreds of basic HTTP requests with the default HTTP trigger, I start getting tons of 500 errors:
<html><head>
<meta http-equiv="content-type" content="text/html;charset=utf-8">
<title>500 Server Error</title>
</head>
<body text=#000000 bgcolor=#ffffff>
<h1>Error: Server Error</h1>
<h2>The server encountered an error and could not complete your request.<p>Please try again in 30 seconds.</h2>
<h2></h2>
</body></html>
If I check the logs for my function, I see no error messages! The cloud console makes it seem like everything is perfect even though my requests are failing.
How should I got about diagnosing this problem? It looks like it's something wrong on Google's end, as my code works fine when requests do go through. Does Google limit the amount of HTTP requests you can make?
Any help would be really appreciated.

The scaling limit for HTTP type functions is different than background functions. Please read the documentation about scalability to be clear on the limits.
For background functions, it will scale gradually up to 1000 current invocations. Since you're writing an HTTP function, this does not apply.
For HTTP functions, note that rates are limited by the amount of outbound network bandwidth generated by the function (among other things). You will have to take a close look at what your function is actually doing to figure out if it's exceeding the documented rate limits.
If you can limit what the function is doing internally to meet the scalability limits, one thing you can try is to shard your functions. Instead of one HTTP function, create two, and split traffic between them. The stated limits are per-function (not per-project), so you should be able to handle more load that way.

Related

Elastic search 403 Request throttled due to too many requests /_bulk

I am trying to sync 1 million record to ES, and I am doing it using bulk API in batch of 2k.
But after inserting around 25k-32k, elastic search is giving following exception.
Unable to parse response body: org.elasticsearch.ElasticsearchStatusException
ElasticsearchStatusException[Unable to parse response body]; nested: ResponseException[method [POST], host [**********], URI [/_bulk?timeout=1m], status line [HTTP/1.1 403 Request throttled due to too many requests]
403 Request throttled due to too many requests /_bulk]; nested: ResponseException[method [POST], host [************], URI [/_bulk?timeout=1m], status line [HTTP/1.1 403 Request throttled due to too many requests]
403 Request throttled due to too many requests /_bulk];
I am using aws elastic search.
I think, I need to implement wait strategy to handle it, something like keep checking es status and call bulk insert if status all of ES okay.
But not sure how to implement it? Does ES offers anything pre-build for it?
Or Anything better way to handle this?
Thanks in advance.
Update:
I am using AWS elastic search version 6.8
Thanks #dravit for including my previous SO answer in the comment, after following the comments it seems OP wants to improve the performance of bulk indexing and want exponential backoff, which i don't think Elasticsearch provides out of the box.
I see that you are putting a pause of 1 second after every second which will not work in all the cases, and if you have large number of batches and documents to be indexed, for sure it will take a lot of time. There are few more suggestions from my side to improve the performance.
Follow my tips to improve the reindex speed in Elasticsearch and see what all things listed here is applicable and doing them improves speed by what factor.
Find a batching strategy which best suits to your environment, I am not sure but this article from #spinscale who is the developer of java high level rest client might help or you can ask a question on https://discuss.elastic.co/, I remembered he shared a very good batching strategy in one of his webinar but couldn't find the link of it.
Notice various ES metrics apart from bulk threadpool and queue size, and see if your ES still has capacity can you increase the queue size and increase the rate by which you can send requests to ES.
Check the error handling guide here
If you receive persistent 403 Request throttled due to too many requests or 429 Too Many Requests errors, consider scaling vertically. Amazon Elasticsearch Service throttles requests if the payload would cause memory usage to exceed the maximum size of the Java heap.
Scale your application vertically or increase delay between requests.

How to handle long requests in Google Cloud Run?

I have hosted my node app in Cloud Run and all of my requests served within 300 - 600ms time. But one endpoint that gets data from a 3rd party service so that request takes 1.2s - 2.5s to complete the request.
My doubts regarding this are
Is 1.2s - 2.5s requests suitable for cloud run? Or is there any rule that the requests should be completed within xx ms?
Also see the screenshot, I got a message along with the request in logs "The request caused a new container instance to be started and may thus take longer and use more CPU than a typical request"
What caused a new container instance to be started?
Is there any alternative or work around to handle long requests?
Any advice / suggestions would be greatly appreciated.
Thanks in advance.
I don't think that will be an issue unless you're worried about the cost of the CPU/memory time, which honestly should only matter if you're getting 10k+ requests/day. So, probably doesn't matter and cloud run can handle that just fine (my own app does requests longer than that with no problem)
It's possible that your service was "scaled to zero" meaning that there were no containers left running to serve requests. In that case, it would be necessary to start up a new instance and wait for whatever initializing/startup costs are associated with that process. It's also possible that it was auto-scaled due to all other instances being at their request limits. Make sure that your setting for max concurrent requests per instance is set greater than one - Node/Express can handle multiple requests at once. Plus, you'll only get charged for the total time spend, not per request:
In situations where you get very long (30 seconds, minutes+) operations, it may be a good idea to switch to some different data transfer method. You could use polling, where the client makes a request every 5 seconds and checks if the response is ready. You could also switch to some kind of push-based system like WebSockets, but Cloud Run doesn't have support for that.
TL;DR longer requests (~10-30 seconds) should be fine unless you're worried about the cost of the increased compute time they may occur at scale.

Error code 503 in GCP pubsub.v1.Subscriber.StreamingPull

I am trying to utilize pub/sub service and I noticed in my dashboard following error code.
Here link what is
code 503
Is there anything that allow me to prevent that?
-Askar
As explained in the documentation link about Error Codes that you shared, the HTTP code 503 ("UNAVAILABLE") is returned when the Pub/Sub service was not able to process a request. In general, one could say that these types of errors tend to be transient, and there's no way to avoid them, you can just work around them following a retry strategy such as the one I will comment shortly.
The Google Cloud Pub/Sub SLA shows the guaranteed uptime for this service. As you can see, it is not 100%, as transient errors may happen, which should not disturb your service greatly, considering that you follow the recommended practice of implementing a retry strategy with exponential backoff.
This documentation page shows an example implementation of an Exponential Backoff retry strategy. This example is for Google Cloud Storage, but it can (and should) be applied to any other similar service. It consists in retrying the failed Pub/Sub requests with an increasing backoff in order to increase the probability of a request being successful. This is a recommended best practice and the recommended approach to overcome transient issues.
StreamingPull has a 100% error rate.
StreamingPull streams are always terminated with a non-OK status(HTTP 503). Note that, unlike in regular RPCs, the status here is simply an indication that the stream has been broken, not that requests are failing.
https://cloud.google.com/pubsub/docs/pull#streamingpull_has_a_100_error_rate_this_is_to_be_expected

Authentication with Cognito - where to find logs

We have 2 React Native app are using AWS Cognito for authentication. We use library react-native-aws-cognito-js in our code. The apps are working fine until these 2 days. Apps are experiencing intermittent "Internal Server Error".
How can I find more information about this error? Any tool can help us pinpoint the cause?
Update
From CloudTrail, each API call has an event "CreateNetworkInterface". Many of such API calls have error code "Client.NetworkInterfaceLimitExceeded". What is the cause and solution to this?
According to this AWS Doc (in Chinese), CloudWatch will not write to log when error is due to insufficient IP/ENI. That explains the increase in error number but no logs in CloudWatch.
Upate 2
We have found a scheduled Lambda job which may exhausted IP addresses. We stopped the batch job. But still can't have too many user login to server due to "Client.NetworkInterfaceLimitExceeded" error. I realized that there are many "CreateNetworkInterface" event and few "DeleteNetworkInterface" event. How can I "clean up / reset" all network interface in VPC?
Short answer: Cloud Trail.
Long answer with a suggestion
Assuming your application code is fine, most likely the cause of your 500 error is based on Cognito's initial limitations (e.g., number of calls per user): https://docs.aws.amazon.com/cognito/latest/developerguide/limits.html.
AWS suggests to use Cloud Trail, for logging Api calls.
However I would suggest, to prove the limitations first, add some logs around the api call yourself, and in development you could call your app/api with a high number of calls; and most likely you will see the 500 error due to the limitations.
You could do the following in the terminal:
for i in `seq 1 1000`; do curl --cookie SecureCookie=TokenValueFromAWS http://localhost:desirablePort/SecuredPath; done

Google Places API error 502 - The server encountered a temporary error

we run a website that obtains location data through the Google Place API. We have 150k daily searches available, which we haven´t met yet as the website has been live for few weeks only. We have suddenly received a 502 error. A notification in the Console says: “The server encountered a temporary error and could not complete your request.”. Is this a temporary error? Is there any suggestions on what we can do? The website hasn’t been available for 40 minutes.
When you receive 5xx status or UNKNOWN_ERROR in the response, you should implement a retrying logic. Google has a following recommendation in their web services documentation:
In rare cases something may go wrong serving your request; you may receive a 4XX or 5XX HTTP response code, or the TCP connection may simply fail somewhere between your client and Google's server. Often it is worthwhile re-trying the request as the followup request may succeed when the original failed. However, it is important not to simply loop repeatedly making requests to Google's servers. This looping behavior can overload the network between your client and Google causing problems for many parties.
A better approach is to retry with increasing delays between attempts. Usually the delay is increased by a multiplicative factor with each attempt, an approach known as Exponential Backoff.
https://developers.google.com/maps/documentation/directions/web-service-best-practices#exponential-backoff
However, if retrying logic with Exponential Backoff doesn't help and the error persists for a long time you should file a bug in Google issue tracker
I hope this addresses your doubt!
UPDATE
There was an issue on Google side yesterday (November 6, 2017), you can refer to the following bug that explains the issue:
https://issuetracker.google.com/issues/68938173