Cloud Run, ideal vCPU and memory amount per instance? - google-cloud-platform

When setting up a cloud run, I am worried about how many memory and vCPU should be set each time per server instance.
I use Cloud Run for mobile apps.
I am confused about when to increase vCPU and memory instead of increasing server instances, and when to increase server instances instead of vCPU and memory.
How should I calculate it?

There isn't a good answer to that question. You have to know the limits:
The max number of concurrent requests that you can handle concurrently with 4cpu or/and 32Gb of memory (up to 1000 concurrent requests)
The max number on instance on Cloud Run (1000)
Then it's a matter of tradeoff, and it's highly dependent of your use case.
Bigger instances reduce the number of cold starts (and so high latency when your service scale up). But, if you have only 1 request at a time, you will pay a BIG instance for a small processing
Smaller instances allow you to optimize cost and to add only a small slice of resource in your cluster, but you will have to spawn often a new instance and you will have several cold start to endure.
Optimize what you prefer, find the right balance. No magic formula!!

You can simulate a load of requests in your current settings using k6.io, check the memory and cpu percentage of your container and adjust them to a lower or higher setting to see if you can get more RPS out of a single container.
Once you are satisfied with a single container instance's let's say 100 rps per container instance, you can then specify using gcloud the flags --min-instances and --max-instances depending of course on the --concurrency flag, which in my explanation would be set to 100.
Also note that it starts at the default of 80 and can go up to 1000.
More info about this can be read on the links below:
https://cloud.google.com/run/docs/about-concurrency
https://cloud.google.com/sdk/gcloud/reference/run/deploy
I would also recommend you investigating if you need to pass the --cpu-throttling flag or the --no-cpu-throttling depending on your need for adjusting for cold starts.

Related

Using Concurrent thread in jmeter and AWS DLT with .jmx file - How do I provide inputs so that we can achieve 5000 RPS for 5 minutes duration

We have configured AWS for distributed load testing using - https://aws.amazon.com/solutions/implementations/distributed-load-testing-on-aws/
Our requirement includes achieving 5k RPS.
Please help me in understanding the inputs that needs to be provided here
When we consider the system supports 5k RPS then - What should be the Task Count, Concurrency, Ramp Up and Hold For values in order to achieve 5k RPS using AWS DLT.
We are also trying to achieve it using jmeter concurrent threads. Hoping someone could help with values and explaining the usage for the same.
We don't know.
Have you tried read the documentation the link to which you provided yourself? I.e. for the Concurrency there is a chapter called Determine the number of users which suggests start with 200 and increase/decrease depending on the resources consumption.
The same applies to the Task Count, you may either go with a single container with default resources, increase container resources or increase the number of containers.
The number of hits per second will mostly depend on your application response time, for example given 200 recommended users if response time will be 1 second - you will have 200 RPS, if response time will be 2 seconds - you will get 100 RPS, if response time will be 0.5 seconds - you will get 400 RPS, etc. See What is the Relationship Between Users and Hits Per Second? article for more comprehensive explanation if needed. The throughput can also be controlled on JMeter side using Concurrency Thread Group and Throughput Shaping Timer but again, the container(s) must have sufficient resources in order to produce the desired load.
With regards to ramp-up - again, we don't know. Personally I tend to increase the load gradually so I could correlate increasing load with other metrics. JMeter documentation recommends starting with ramp-up period in seconds equal to number of users.
The same for the time to hold the load, i.e. after ramping up to the number of users required to conduct the load of 5K RPS I would recommend holding the load for the duration of ramp-up period to see how the system behaves, whether it stabilizes when the load stops increasing, are the response times static or they go up, etc.

Cloud run with CPU always allocated is cheaper than only allocated during request processing. How?

I use Cloud Run for my apps and trying to predict the costs using the GCP pricing calculator. I can't find out why it's cheaper with CPU always allocated instead of CPU allocated during request processing when it says "When you opt in to "CPU always allocated", you are billed for the entire lifetime of container instances".
Any explanation?
Thank you!
Cloud Run is serverless by default: you pay as you use. When a request comes in, an instance is created (and started, it's called cold-start) and your request processed. The timer starts. When your web server send the answer, the timer stop.
You pay for the memory and the CPU used during the request processing, rounded to the upper 100ms. The instance continue to live for about 15 minutes (by default, can be changed at any moment) to be ready to process another request without the need to start another one (and wait the cold start again).
As you can see, the instance continue to live EVEN IF YOU NO LONGER PAY FOR IT. Because you pay only when a request is processed.
When you set the CPU always on, you pay full time the instance run. No matter the request handling or not. Google don't have to pay for instances running and not used, waiting a request as the pay per use model. You pay for that, and you pay less
It's like a Compute Engine up full time. And as a Compute Engine, you can have something similar to sustained used discount. That's why it's cheaper.
In general it depends on how you use cloud run. Google is giving some hints here: https://cloud.google.com/run/docs/configuring/cpu-allocation
To give a summary to the biggest pricing differences:
CPU is only allocated during request processing
you pay for:
every request on a per request basis
cpu and memory allocation time during request
CPU always allocated
you pay for
cpu and memory allocation with a cheaper rate for time of instance active
Compare the pricing here: https://cloud.google.com/run/pricing
So if you have a lot of request which do not use a lot of ressources and not so much variance in it, then "always allocated" might be a lot cheaper.

CPU credits on AWS EC2 instance how does that work

I have a micro amazon EC2 instance, and whenever the hosted application at this platform is given a large load for a couple of hours, the application slows down and CPU credits reach almost to zero.
I have turned auto scaling option on but still it does not work can some help me to figure out how to get around this?
All t2 instances use a burstable model. Which is not really intended for sustained heavy usage. The instance, when idling, will build up CPU credits up to a cap. When the CPU is maxed, the credits are spent. Once you run out, you are capped at a very low rate. The amount of credits you can get and the rate at which you earn them depend on which t2 instance you are using.
Autoscaling is for horizontal scaling. With it you can launch extra instances based on certain triggers. But you need to use a load balancer to spread traffic accross instances.
To the question how you see on the CPU utilization if you are using all your credited CPU at 100%, my experience is you don't. What you see is in top or iostat for exmaple, your CPU% is reported at something quite low. Like 30%, while IO is not bottlenecked, and you wonder why it is stuck to low CPU% usage.
But there is a value you might see in top, at the far right end, something like "68% st" that is the "steal" value. This means that you only get to have 32% of that CPU, and so your 30% CPU value is actually 94% of what you get.
I have also observed that when you add the CPU% of the processes that are in running state (R) in top, you arrive at a number relative to your actually available CPU. For example, I had 24 processes running at 8% each on a t2.medium instance with 2 virtual CPUs, that means 192% actually running, that is 96% of the available CPU cycles, not 32% as reported by top and iostat as % user.
If I was to generate an auto scale trigger, I would look at what I can get from the /proc file system and consider the "steal" amount.

Auto scaling batch jobs on Elastic Beanstalk

I am trying to set up a scalable background image processing using beanstalk.
My setup is the following:
Application server (running on Elastic Beanstalk) receives a file, puts it on S3 and sends a request to process it over SQS.
Worker server (also running on Elastic Beanstalk) polls the SQS queue, takes the request, load original image from S3, processes it resulting in 10 different variants and stores them back on S3.
These upload events are happening at a rate of about 1-2 batches per day, 20-40 pics each batch, at unpredictable times.
Problem:
I am currently using one micro-instance for the worker. Generating one variant of the picture can take anywhere from 3 seconds to 25-30 (it seems first ones are done in 3, but then micro instance slows down, I think this is by its 2 second bursty workload design). Anyway, when I upload 30 pictures that means the job takes: 30 pics * 10 variants each * 30 seconds = 2.5 hours to process??!?!
Obviously this is unacceptable, I tried using "small" instance for that, the performance is consistent there, but its about 5 seconds per variant, so still 30*10*5 = 26 minutes per batch. Still not really acceptable.
What is the best way to attack this problem which will get fastest results and will be price efficient at the same time?
Solutions I can think of:
Rely on beanstalk auto-scaling. I've tried that, setting up auto scaling based on CPU utilization. That seems very slow to react and unreliable. I've tried setting measurement time to 1 minute, and breach duration at 1 minute with thresholds of 70% to go up and 30% to go down with 1 increments. It takes the system a while to scale up and then a while to scale down, I can probably fine tune it, but it still feels weird. Ideally I would like to get a faster machine than micro (small, medium?) to use for these spikes of work, but with beanstalk that means I need to run at least one all the time, since most of the time the system is idle that doesn't make any sense price-wise.
Abandon beanstalk for the worker, implement my own monitor of of the SQS queue running on a micro, and let it fire up larger machine(or group of larger machines) when there are enough pending messages in the queue, terminate them the moment we detect queue is idle. That seems like a lot of work, unless there is a solution for this ready out there. In any case, I lose the benefits of beanstalk of deploying the code through git, managing environments etc.
I don't like any of these two solutions
Is there any other nice approach I am missing?
Thanks
CPU utilization on a micro instance is probably not the best metric to use for autoscaling in this case.
Length of the SQS queue would probably be the better metric to use, and the one that makes the most natural sense.
Needless to say, if you can budget for a bigger base-line machine everything would run that much faster.

Amazon m1.small vs micro instance CPU perfomance

I have Amazon micro instance and looks like CPU is not enough. Going to upgrade to the next cheapest instance with more CPU available.
Can it be m1.small instance ? According to the description they have same number of compute units. And looks like micro can even overperform small instance when more cores becomes available for short CPU bursts.
Update: note that this information is only really applicable to the previous generation t1.micro instance type, which had a cyclical clamping throttle algorithm. The current generation t2 instance class, including the t2.micro, has much better performance than the t1.micro and an entirely different algorithm controlling the throttling. Throttling on the t2 instance class is driven by CPU credits, which are visible in the CloudWatch metrics for the instance, throttling is much more graceful, and kicks in much later. Throttling on the t1.micro was essentially a black box, and the system would repeatedly shift in and out of the throttled mode, under high loads. There is no longer a compelling reason to use a t1 instance, unless you are running a PV AMI. The t2 is HVM.
ECUs are "EC2 Compute Units" and represent, approximately, the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron processor.
This Comparison of t1.micro and m1.small explains that a small instance has 1 ECU continually available, while a Micro can operate in short bursts of up to 2 ECU, but with an ongoing baseline of much less.
In my testing, I've found that consuming 100% CPU for about 10-15 seconds on a micro instance, gets you throttled down to a fraction of that -- approximately 0.2 ECU -- for about the next 2-3 minutes, when the throttling lifts for a few seconds, then the cycle repeats, though it only repeats if you are still pulling the hard burst. They accomplish the throttling via the hypervisor "stealing" a large percentage of your available cycles. You can see this in "top" when it's happening. If you go long enough without demanding 100% CPU, the 2 ECU burst is immediately available with you need it -- it's not as if they are cycling the performance up and down with a timer -- the throttling is reactive to the imposed load.
Over time, the small instance will get more processing done, since the micro is throttled so aggressively after a few seconds of heavy usage, long enough to more than counteract the brief periods of nice burstablity. This makes sense, though since the micro is a lower cost instance.
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/concepts_micro_instances.html
...so, yes, try a small instance.