Google App Engine (GAE) basic scaling backend instance serves one request and undeploys - flask

I have deployed an application (frontend and backend) in App Engine. First of all, I am using the free tier and I chose the default F1 for the frontend and B2 for the backend. I don't exactly understand the difference between B and F instances but based on their names, I chose them for backend and frontend respectively.
My backend is a Flask application that reads some data from Firestore on #app.before_first_request and "pre-caches" it for all future requests. This takes about 20-30 seconds before the first request is served so I really don't want the backend instance to become undeployed all the time.
Right now, my backend successfully serves one request (that I am making from the browser) and then immediately gets undeployed (basically I see no active instances in App Engine dashboard after the request is served). This means that every request once again has the same long delay upon server start that I don't want. I am not sure why this is happening because I've set idle timeout to 5 minutes. I know it is not a problem with my Flask application because it does not crash after a request on a local machine and I've done its memory profiling which is in bounds of B2 limits. This is my app.yaml for the backend:
runtime: python38
service: api
env_variables:
PORT: 8080
instance_class: B2
basic_scaling:
max_instances: 1
idle_timeout: 5m
Any insight would be appreciated!

Based on the information and behavior that you are exposing, please allow me to explain to you that both Scaling models are behaving as they are designed to do so.
“Automatic Scaling: It creates instances based on request rate, response latencies, and other application metrics. You can specify thresholds for each of these metrics, and a minimum number instances to keep running always.
Basic Scaling: Basic scaling creates instances only when your application receives requests. Each instance will be shut down when the application becomes idle. Basic scaling is ideal for work that is intermittent or driven by user activity.”
Use the following URL’s documentation as reference for those models and more of them How Instances are Managed.
Information added on 10/12/2021:
Hi,
I think the correct term is “shutdown” instead of “undeployed” Disabling your application. Looking at Instance States "an instance of a manual or basic scaled service can be either running or stopped. All instances of the same service and version share the same state." then looking at Scaling types "Basic scaling creates instances when your application receives requests. Each instance will be shut down when the application becomes idle. Basic scaling is ideal for work that is intermittent or driven by user activity." and the table's Startup and shutdown row for basic scaling "Instances are created on demand to handle requests and automatically shut down when idle, based on the idle_timeout configuration parameter. An instance that is manually stopped has 30 seconds to finish handling requests before it is forcibly terminated." and Scaling down "You can specify a minimum number of idle instances. Setting an appropriate number of idle instances for your application based on request volume allows your application to serve every request with little latency".
Could you please verify:
that the instance was not manually halted?
that instance is becoming idle?
that there are no background threads?
if functionality is the same when setting the max_instances to 2
that there are no logs showcasing an instance shutdown
that they are reaching the version with the updated the idle_timeout set

Related

Spring Data Neo4J - Unable to acquire connection from pool within configured maximum time

We have a Reactive REST API using Spring Data Neo4j (SpringBoot v2.7.5) deployed to Kubernetes. When running a stress test to determine the breaking point, once the volume of requests that the service can handle has been breached, the service does not auto-recover, even after the load has dropped to a level at which the service can handle.
After the service has fallen over the Neo4J health indicator shows:
“org.neo4j.driver.exceptions.ClientException: Unable to acquire connection from the pool within configured maximum time of 60000ms”
With respect to connection/configuration settings we are using defaults configured by SDN.
Observations:
Up until the point at which the service breaks only a small number of connections are utilised, at the point at which it breaks the connections in use jumps up to the max pool size and the above mentioned error is observed. No matter how much time passes (even well beyond the max connection lifetime) the service is unable to acquire a connection from the pool. Upon manually shutting down and restarting the service/pod the service returns to a healthy state.
As an interim solution we now check the Neo4J health indicator, if the mentioned error is present the liveness state is set to down which triggers Kubernetes to restart the service automatically. However, I’m wondering if there is an underlying issue with the connections in the pool not getting ‘cleaned up’?
You can take a look at this discussion https://github.com/spring-projects/spring-data-neo4j/issues/2632
I had the same issue. The problem is that either Spring Framework or Neo4j reactive transaction manager doesn't close connections in a complex reactive flow e.g. when there are a lot of inner calls/mappings and somewhere inside an exception is thrown.
So as a workaround you can add #Transactional in such places to avoid multiple transactions to be created.

How does Cloud Run scaling down to zero affect long-computation jobs or external API requests?

I'm new to using Cloud Run and the idea of scaling down to zero is very appealing to me, but I have question about a few scenarios about its usage:
If I have a Cloud Run instance querying an external API endpoint, would the instance winds down while waiting for the response if no additional requests come in (i.e. I set the query time out to 60min, and no requests are received in that 60 min)?
If the Cloud Run instance is running computation that lasts for longer than 24 hour, or perhaps even days, without receiving requests, could it be trusted to carry out the computation until it's done without being randomly shutdown or restarted for servicing or other purposes (I ask this because Cloud Run is primarily intended as for stateless applications, but I have infrequent computation jobs that may take a long time that may be considered "stateful" in short-term context).
Does CPU utilization impact auto-scaling (e.g. if I have a computationally intensive job not configured for distributed computing running on one instance, would this trigger Cloud Run to spawn additional instances?)
If you deep dive in the documentation, I'm quite sure that you can find your answers. So, here a summary
(Interesting read).The Cloud Run instances are shut down only when they aren't in used (usually 15 minutes (can change at any time, no commitment, only observations) without request handling). In your case, if you are in a request handling context, no worries, your instance won't be killed, it is in use! Note: don't send an HTTP response before the end of the processing. Background process/jobs aren't considered in a request context. The context is considered from the receipt of the request to the response (OK or KO) back. Partial response/streaming is accepted.
Cloud run instance can, potentially, live more than 24h, but nothing is guaranteed. And, because the request handling is limited to 1h, you can't run process longer that that. I recommend you to have a look to GKE autopilot or to run a container on a Compute Engine and stop the VM at the end of the processing to save resources and money (or a hack to run your container on AI PLatform custom training; even if you train nothing, you run a custom container on a serverless platform!). If you can, I recommend you to design your workload to be split in several small and parallelizable jobs
Yes, it's described here. But keep in mind that only 1 request is processed on one instance. If you send a request that trigger an intensive compute job, the request will be only processed on the same instance (that can have several CPUs if your workload is compliant with that). And if another request comes in during the intensive processing, another Cloud Run instance will be spawn to handle it; only the new request.

VMware Tanzu (former PCF) App Autoscaler force scale-down?

I am autoscaling my application based on the HTTP throughput.
My question here is when it reaches min threshold it tries to reduce the instance created. But during reducing the instance count if my instance is running or it is processing prev HTTP request.
In this case, it will wait till the processing completes or it forcibly reduces the instance count when reached threshold.
I have the same question and as far as I understood from App Container Lifecycle it’s up to your app to gracefully shutdown but that might not be possible in given 10 seconds as some processes might take longer.
Shutdown
CF requests a shutdown of your app instance in the following scenarios:
When a user runs cf scale, cf stop, cf push, cf delete, or cf restart-app-instance
As a result of a system event, such as the replacement procedure during Diego Cell evacuation or when an app instance stops because of a failed health check probe
To shut down the app, CF sends the app process in the container a SIGTERM. By default, the process has ten seconds to shut down gracefully. If the process has not exited after ten seconds, CF sends a SIGKILL.
By default, apps must finish their in-flight jobs within ten seconds of receiving the SIGTERM before CF terminates the app with a SIGKILL. For instance, a web app must finish processing existing requests and stop accepting new requests.
Note: One exception to the cases mentioned above is when monit restarts a crashed Diego Cell rep or Garden server. In this case, CF immediately stops the apps that are still running using SIGKILL.
"In this case it will wait till the processing completes or it
forcibly reduces the instance count when reached threshold."
Answer:
No, the App Autoscaler will not force anything, after the decision cycle, it will prepare the instance to be escalated-down (shutdown), so the intention is to avoid lose requests or data during this process.
Please, take a look into the documentation below, it will help you to understand better the App Autoscaler mechanism.
How App Autoscaler Determines When to Scale:
Every 35 seconds, App Autoscaler makes a decision about whether to
scale up, scale down, or keep the same number of instances.
To make a scaling decision, App Autoscaler averages the values of a
given metric for the most recent 120 seconds.
The following diagram provides an example of how App Autoscaler makes scaling decisions:
Reference:
VMWare Tanzu App Autoscaler documentation
VMWare Tanzu is the former Pivotal Cloud Foundry (PCF).

How to handle long requests in Google Cloud Run?

I have hosted my node app in Cloud Run and all of my requests served within 300 - 600ms time. But one endpoint that gets data from a 3rd party service so that request takes 1.2s - 2.5s to complete the request.
My doubts regarding this are
Is 1.2s - 2.5s requests suitable for cloud run? Or is there any rule that the requests should be completed within xx ms?
Also see the screenshot, I got a message along with the request in logs "The request caused a new container instance to be started and may thus take longer and use more CPU than a typical request"
What caused a new container instance to be started?
Is there any alternative or work around to handle long requests?
Any advice / suggestions would be greatly appreciated.
Thanks in advance.
I don't think that will be an issue unless you're worried about the cost of the CPU/memory time, which honestly should only matter if you're getting 10k+ requests/day. So, probably doesn't matter and cloud run can handle that just fine (my own app does requests longer than that with no problem)
It's possible that your service was "scaled to zero" meaning that there were no containers left running to serve requests. In that case, it would be necessary to start up a new instance and wait for whatever initializing/startup costs are associated with that process. It's also possible that it was auto-scaled due to all other instances being at their request limits. Make sure that your setting for max concurrent requests per instance is set greater than one - Node/Express can handle multiple requests at once. Plus, you'll only get charged for the total time spend, not per request:
In situations where you get very long (30 seconds, minutes+) operations, it may be a good idea to switch to some different data transfer method. You could use polling, where the client makes a request every 5 seconds and checks if the response is ready. You could also switch to some kind of push-based system like WebSockets, but Cloud Run doesn't have support for that.
TL;DR longer requests (~10-30 seconds) should be fine unless you're worried about the cost of the increased compute time they may occur at scale.

How to determine that a jvm app does more GC than normal work?

We recently had a problem that our EC2 instances had 90-100 percent cpu load cause of a bug in a library we include that created to many objects instead of reusing them (which was easy solvable), so we spent too much time in GC.
Unfortunately the AWS health checks and instance status metrics didn't cause the overloaded instances to be stopped and then new ones restarted, so after some time we hit the max autoscaling number and....died. Also our own health checks inside the app which are used for the ELB are so simple that they answered often enough to obviously not cause the instances to be terminated...and restarted, which would mitigate that problem for quite some time.
My idea is now to use our custom health check which is already included in the ELB health checks to report a failure if we spent to much time in GC.
How would I do such a thing inside the app?
There are a number of JVM parameters that allow GC monitoring
-Xloggc:<file> // logs gc activity to a file
-XX:+PrintGCDetails // tells you how different generations are impacted
You can either parse these logs yourself or use specific tool such as GCViewer to analyse gc activity.
Use GarbageCollectorMXBean:
long gcTime = 0;
for (GarbageCollectorMXBean gcBean : ManagementFactory.getGarbageCollectorMXBeans()) {
gcTime += gcBean.getCollectionTime();
}
long jvmUptime = ManagementFactory.getRuntimeMXBean().getUptime();
System.out.println("GC ratio: " + (100 * gcTime / jvmUptime) + "%");
You can use VisualVM to monitor what happens inside the JVM and you can monitor remote instances via JMX. You did not describe which application container that you are using (Apache Tomcat, GlassFish etc.), you can set up a JMX connector like this in the case of Tomcat.
Don't forget to adjust Security Groups in AWS to have the proper permission to access the JMX port.
The JVM flags PrintGCApplicationConcurrentTime and PrintGCApplicationStoppedTime will log how long the application was active or suspended. They're a bit of a misnomers since they actually measure time spent in and out of safepoints, not just GCs.