The issue:
Cloud run services are supposed to be charged only for the time spent on processing the request.
We have deployed the python service and enabled authentication. According to google docs (https://cloud.google.com/run/pricing) only authenticated requests from services are billed. Non-authenticated requests (from bots, scanners, etc.) are getting 403.
The logs are showing that requests are going to cloudrun service during working hours (8:00-16:00 UTC window) (though not frequently, 4-5 times a day). The requests are coming from our internal services to auto-generated cloudrun url, trigger some data to be generated and sent back to the service.
We are paying ~100$/month per each cloud run service and would like to decrease the costs.
Expected behaviour:
Request comes to the service. The container is spun. Request is processed and we are billed only for the time that container exists. Then container is shut down and "Billable container instance time" metric drops to 0.
Real behaviour:
The metrics are showing straight line in "Billable container instance time" which means that container is never stopped being billed.
Please, assist on the matter.
UPD: The solution was to decrease minimum instances to 0. Previously it was set to min=1, max=4 so 1 instance was always running as idle.
Related
Hi I am new to Fargate and confused about its calculation.
How is the 'Average duration' calculated and charged ? is it calculated and charged only for the time between request arriving and return of response or pods are continually running and are charged for 24*7*365 ?
Also does fargate fetches image from ECR every time a request arrives ?
Do fargate costs even when there is no request and nothing is processing ?
What is the correct way of calculating Average duration section ?
This can make huge difference in cost.
You can learn more details from AWS Fargate Pricing and from AWS Pricing Calculator. When you read details from first link that I mentioned, you will find the explanation for duration and there are 3 example in the link.
How is the 'Average duration' calculated and charged ? is it calculated and charged only for the time between request arriving and return of response or pods are continually running and are charged for 247365 ?
Fargate is not a request-based service. Fargate runs your pod for the entire time you ask it to run that pod. It doesn't deploy pods when a request comes in, the pods are running 24/7 (or as long as you have it configured to run).
Fargate is "serverless" in the sense that you don't have to manage the EC2 server the container(s) are running on yourself, Amazon manages the EC2 server for you.
Also does fargate fetches image from ECR every time a request arrives ?
Fargate pulls from ECR when a pod is deployed. It has to be deployed and running already in order to accept requests. It does not deploy a pod when a request comes in like you are suggesting.
Do fargate costs even when there is no request and nothing is processing ?
Fargate charges for the amount of RAM and CPU you have allocated to your pod, regardless of if they are actively processing requests or not. Fargate does not care about the number of requests. You could even use Fargate for doing things like back-end processing services that don't accept requests at all.
If you want an AWS service that only runs (and charges) when a request comes in, then you would have to use AWS Lambda.
You could also look at AWS App Runner, which is in kind of a middle ground between Lambda and Fargate. It works like Fargate, but it suspends your containers when requests aren't coming in, in order to save some money on the CPU charges.
For deploying a Django web app to GCR, I would like to understand the relationships between various autoscaling related parameters of Gunicorn and GCR.
Gunicorn has flags like:
workers
threads
timeout
Google Cloud Run has these configuration options:
CPU limit
Min instances
Max instances
Concurrency
My understanding so far:
Number of workers set in Gunicorn should match the CPU limit of GCR.
We set timeout to 0 in Gunicorn to allow GCP autoscale the GCR instance.
GCP will always keep some instances alive, this number is Min instances.
When more traffic comes, GCP will autoscale up to a certain number, this number is Max instances.
I want to know the role of threads (Gunicorn) and concurrency (GCR) in autoscaling. More specifically:
How does the number of thread in Gunicorn affect autoscaling?
I think This should not affect autoscaling at all. They are useful for background tasks such as file operations, making async calls etc.
How does the Concurrency setting of GCR affect autoscaling?
If number or workers is set to 1, then a particular instance should be able to handle only one request at a time. So setting this value to anything more than 1 does not help. In fact, We should set CPU limit, concurrency, workers these three to match each other. Please let me know if this is correct.
Edit 1:
Adding some details in response to John Hanley's commment.
We expect to have up to 100 req/s. This is based on what we've seen in GCP console. If our business grows we'll get more traffic. So I would like to understand how the final decision changes if we're to expect say 200 or 500 req/s.
We expect requests to arrive in bursts. Users are groups of people who perform some activities on our web app during a given time window. There can be only one such event on a given day, but the event will see 1000 or more users using our services for a 30 minute window. On busy days, we can have multiple events, some of them may overlap. The service will be idle outside of the event times.
How many simultaneous requests can a cloud run instance handle? I am trying to understand this one myself. Without cloud run, I could've deployed this with x number of workers and then the answer would've been x. But with cloud run, I don't know if the number of gunicorn workers have the same meaning.
Edit 2: more details.
The application is stateless.
The web app reads and writes to DB.
I think chrome to cloud run is doing http/2 from what I am reading and looking at developer tools, it shows things as http/2 headers(at least I don't think chrome displays it in http/2 header format if it is http1, but I can't tell as I would think this website is http1 but I see http/2 request headers in chrome's dev tools -> https://www.w3.org/Protocols/HTTP/Performance/microscape/).
Anyways, I am wondering for cloud run if I loop and keep calling a json endpoint to delivery pieces of a file to cloud storage, will it stay connected to the same instance the entire time such that my upload will work with the ByteReader in the server. In this way, I can load large files as long as it loads within the cloud run timeout window.
Does anyone know if this will work or will cloud run see each json request form chrome hit the firewall and the firewall might round robin it among cloud run instances?
Anyways, I am wondering for cloud run if I loop and keep calling a
JSON endpoint to deliver pieces of a file to cloud storage, will it
stay connected to the same instance the entire time ...
The answer sometimes it will and sometimes it will not. Do not design something that depends on that answer.
What you are looking for is often termed sticky sessions or session affinity.
Google Cloud Run is designed as a stateless service.
Google Cloud Run automatically scales container instances and load balances every request. Cloud Run does not offer any session stickiness between requests.
Google Cloud Run: About sticky sessions (session affinity)
Cloud Run offer bidirectional streaming and websocket support. The timeout is still limited to 1 hour, but it's a suitable connection to stream your large file into the same instance (don't crash the instance memory size, remember that even the file that you store take space in memory, because it's a stateless service)
A bad solution is to set a max instance to 1. It's a bad solution, because it's not scalable and, even if most of the time, you will have only one instance, sometime Cloud Run service can provision 2 or more instances and only guaranty you that only one is used at the same time.
I link my Cloud SQL instances to Cloud Run with the --add-cloudsql-instances argument.
Some requests are getting 500 Internal Server error in it's responses. Looking at the logs, I was able to know that Cloud Run "Exceeded maximum of 100 connections per instance...". I know that Cloud Run limits to 100 the number of connections that each Cloud Run instance can do to Cloud SQL.
I have already tried to set lower concurrency levels in my Cloud Run service as a way to avoid each instance from exceeding the limit, but the problem never dies. What can I do to mitigate this behaviour and bring my application back to normal stability?
PS. I can't find good and recent answers on this anywhere in the internet, so I decided to ask here.
Details about my last Cloud Run revision: 4 vCPUs, 6GB of RAM, --concurrency of 32.
With a concurrency of 32, and a connection limit of 100, you have a connection problem. You are either not closing connections before returning an HTTP response (leaving unused connections) or you are opening more than one connection per HTTP request and possibly not closing them.
You will need to do a code review for database connection handling.
Opening database connections is an expensive operation. Opening more than one connection per HTTP request consumes time and resources. Use Connection Pooling to resuse connections to increase performance and prevent exhausting open connection limits.
Restrict the load to your application by setting a HTTPS load balancer.
Running AWS "Managed Nodes" for an EKS Cluster across 2 AZ's.
3 Nodes in total. I get random timeouts when attempting to pull the containers down.
This has been so hard to trace because it does work (sometimes), so it's not like an ACL is blocking or a security group.
When I ssh into the nodes, sometimes I can pull down the image manually and sometimes I cannot. When I've run curl requests curl -I https://hub.docker.com it takes sometimes 2 minutes to get a response back. I'm guessing this is why the images are timing out.
I don't know of a way to increase the timeout for k8s to pull the image, but also can't figure out why the latency is so bad in doing the curl request.
Any suggestions are greatly appreciated.
FYI, worker nodes in Private Subnet, proper routes to NAT Gateway in place. VPC Flow logs are good.
Random is the hardest thing to trace 🤷.
🥼 You could move your images to a private ECR registry or simply run a registry in your cluster to discard that it's an issue with your Kubernetes networking. Running AWS CNI❓
It could also just be rate-limiting from docker hub itself. Are you using the same external NAT IP address to pull from multiple nodes/clusters❓:
Docker will gradually impose download rate limits with an eventual limit of 300 downloads per six hours for anonymous users.
Logged in users will not be affected at this time. Therefore, we recommend that you log into Docker Hub as an authenticated user. For more information, see the following section How do I authenticate pull requests.
✌️