Google Cloud Run concurrency limits + autoscaling clarifications - google-cloud-platform

Google Cloud Run allows a specified request concurrency limit per container. The subtext of the input field states "When this concurrency number is reached, a new container instance is started" Two clarification questions:
Is there any way to set Cloud Run to anticipate the concurrency limit being reached, and spawn a new container a little before that happens to ensure that requests over the concurrency limit of Container 1 are seamlessly handled by Container 2 without the cold start time affecting the requests?
Imagine we have Maximum Instances set to 10, Concurrency set to 10 and there are currently 100 requests being processed (i.e. we've maxed our our capacity and cannot autoscale any more). What happens to the 101th request? Will it be queued up for some period of time, or will a 5XX be returned immediately?

Is there any way to set Cloud Run to anticipate the concurrency limit
being reached, and spawn a new container a little before that happens
to ensure that requests over the concurrency limit of Container 1 are
seamlessly handled by Container 2 without the cold start time
affecting the requests?
No. Cloud Run does not try to predict future traffic patterns.
Imagine we have Maximum Instances set to 10, Concurrency set to 10 and
there are currently 100 requests being processed (i.e. we've maxed our
our capacity and cannot autoscale any more). What happens to the 101th
request? Will it be queued up for some period of time, or will a 5XX
be returned immediately?
HTTP Error 429 Too Many Requests will be returned.
[EDIT - Google Cloud documentation on request queuing]
Under normal circumstances, your revision scales out by creating new
instances to handle incoming traffic load. But when you set a maximum
instances limit, in some scenarios there will be insufficient
instances to meet that traffic load. In that case, incoming requests
queue for up to 60 seconds. During this 60 second window, if an
instance finishes processing requests, it becomes available to process
queued requests. If no instances become available during the 60 second
window, the request fails with a 429 error code on Cloud Run (fully
managed).
About maximum container instances

Related

Running thousands of scheduled jobs in AWS on a regular cadence?

I'm architecting an application solution in AWS and am looking into options AWS has for running one-off jobs to run on a regular schedule.
For example, we have a task that needs to run every 5 minutes that does an API call to an external API, interprets the data and then possibly stores some new information in a database. This particular task is expected to run for 30 seconds or so and will need to be run every 5 minutes. Where this gets a little more complex is we're running a multi-tenant application and this task needs to be performed for each tenant individually. It doesn't satisfy the user's requirements to have a single process do the specified task for each tenant in sequence. The task must be performed every x minutes (sometimes as low as every minute) and it must complete for each tenant as quickly as it takes to perform the task exactly 1 time. In other words, all 200, let's say, tenants must have a task run for them at midnight that each have their task complete in the time it takes to query the API and update the database for one tenant.
To add to the complexity a bit, this is not the only task we will be running on a regular schedule for our tenants. In the end we could have dozens of unique tasks, each running for hundreds of tenants, resulting in thousands or tens of thousands of unique concurrent tasks.
I've looked into ECS Scheduled Tasks which uses CloudWatch Events (which is now the EventBridge) but the EventBridge has a limit of 300 rules per event bus. I think that means we're going to be out of luck if we need to have 10,000 rules (one for each task * the number of tenants), but I'm honestly not sure whether each account gets its own event bus or if that's divided up differently.
In any case, even if this did work, it's still not a very attractive option to me to have 10,000 different rules set up in the EventBridge. At least, it feels like it might be difficult to manage. To that end I'm now more so looking into just creating a single EventBridge rule per event type that will kick off a parent task, that in turn asynchronously kicks off as many asynchronous instances of a child task that is needed, one per tenant. This would limit our EventBridge rules to somewhere around a few dozen. Each one of these, when triggered, would asynchronously spawn a task for each tenant that can all run together. I'm not 100% sure on what type of object this will spawn, it wouldn't be a Lambda since that would easily cause us to hit the 1,000 concurrent Lambda function limit but it might be something like a Fargate ECS task that executes for a few seconds then goes away when it's completed.
I'd love to hear others thoughts on these options, my current direction and any other options I'm currently missing.
You don't necessarily need to look at ECS for this, because 1,000 invocations of a Lambda at a time is only the default concurrency limit. That is something you can request an increase for in the Service Quotas console:
There is no maximum concurrency limit for Lambda functions. However, limit increases are granted only if the increase is required for your use case.
Source: AWS Support article.
Same goes for the 300 rules per event bus limit. That is also a default limit and can be increased upon request in the Service Quotas console.
Since you mentioned branching logic, I wonder if you've looked into AWS Step Functions? In particular, Express Workflows within Step Functions may suit the duration and rates of your executions.

Why are all requests to ECS container going to only 1 (of 2) EC2 instances in AWS?

In AWS, I have an ECS cluster that contains a service that has 2 EC2 instances. I sent 3 separate API requests to this service, each should take about an hour to run at 100% capacity. I sent the requests a couple minutes apart. They all went to the same instance and left the other open. Here's a graph of CPU utilization Here's an image of my Service CPU Utilization. It is not using all it's bandwidth: What am I missing? Why won't requests go to the second EC2 instance
An ALB will not perfectly Round-Robin between two instances. If you sent 100 requests, 100 times, then on average each instance would receive 50 requests each, but most of the time it won't be 50 exactly for each backend.
For a long running task like this it is preferable to use something else such as SQS, whereby each container will only process x messages at a time (most of the time you'd want x=1). Each instance can then poll SQS for the work, and wont take more work whilst it is busy.
You will receive other benefits too such as being able to see how long a message is taking to finish, and error handling capabilities to account for timeouts or if a server were to die whilst it is doing work.

Is Cloud Run limited by cold starts and maximum execution length?

When using cloud functions we have the limitations related to cold starts and the maximum execution length of 9 minutes. Does any of these limitations also exist on Google Cloud Run?
According to the documentation, there is a limit of 15 minutes before a timeout.
Cloud Run still has cold starts, but they are much less frequent that Cloud Functions, depending on your traffic patterns and the configured level of concurrency for an instance (max 80 concurrent requests, also from the documentation).

How should I go about handling instances on app engine when there is a sudden spike during a sale we host on our website which lasts for an hour?

We are hosting a sale every month. Once we are ready with all the deals data we send a notification to all of our users. As a result of that we get huge traffic with in seconds and it lasts for about an hour. Currently we are changing the instance class type to F4_1G before the sale time and back to F1 after one hour. Is there a better way to handle this?
A part from changing the instance class of App Engine Standard based on the expected demand that you have, you can (and should) also consider a good scaling approach for your application. App Engine Standard offers three different scaling types, which are documented in detail, but let me summarize their main features here:
Automatic scaling: based on request rate, latency in the responses and other application metrics. This is probably the best option for the use case you present, as more instances will be spun up in response to demand.
Manual scaling: continuously running, instances preserve state and you can configure the exact number of instances you want running. This can be useful if you already know how to handle your demand from previous occurrences of the spikes in usage.
Basic scaling: the number of instances scales with the volume demand, and you can set up the maximum number of instances that can be serving.
According to the use case you presented in your question, I think automatic scaling is the scaling type that matches your requirements better. So let me get a little more in-depth on the parameters that you can tune when using it:
Concurrent requests to be handled by each instance: set up the maximum number of concurrent requests that can be accepted by an instance before spinning up a new instance.
Idle instances available: how many idle (not serving traffic) instances should be ready to handle traffic. You can tune this parameter to be higher when you have the traffic spike, so that requests are handled in a short time without having to wait for an instance to be spun up. After the peak you can set it to a lower value to reduce the costs of the deployment.
Latency in the responses: the allowed time that a request can be left in the queue (when no instance can handle it) without starting a new instance.
If you play around with these configuration parameters, you can define in a very deterministic manner the amount of instances that you want to have, being able to accommodate the big spikes and later returning to lower values in order to decrease the usage and cost.
An additional note that you should take into account when using automatic scaling is that, after a traffic load spike, you may see more idle instances than you specified (they are not torn down in order to avoid that new instances must be started), but you will only be billed for the max_idle_instances that you specified.

Scheduling long-running tasks using AWS services

My application heavily relies on AWS services, and I am looking for an optimal solution based on them. Web Application triggers a scheduled job (assume repeated infinitely) which requires certain amount of resources to be performed. Single run of the task normally will take maximum 1 min.
Current idea is to pass jobs via SQS and spawn workers on EC2 instances depending on the queue size. (this part is more or less clear)
But I struggle to find a proper solution for actually triggering the jobs at certain intervals. Assume we are dealing with 10000 jobs. So for a scheduler to run 10k cronjobs (the job itself is quite simple, just passing job description via SQS) at the same time seems like a crazy idea. So the actual question would be, how to autoscale the scheduler itself (given the scenarios when scheduler is restarted, new instance is created etc. )?
Or the scheduler is redundant as an app and it is wiser to rely on AWS Lambda functions (or other services providing scheduling)? The problem with using Lambda functions is the certain limitation and the memory provided 128mb provided by single function is actually too much (20mb seems like more than enough)
Alternatively, the worker itself can wait for a certain amount of time and notify the scheduler that it should trigger the job one more time. Let's say if the frequency is 1 hour:
1. Scheduler sends job to worker 1
2. Worker 1 performs the job and after one hour sends it back to Scheduler
3. Scheduler sends the job again
The issue here however is the possibility of that worker will be get scaled in.
Bottom Line I am trying to achieve a lightweight scheduler which would not require autoscaling and serve as a hub with sole purpose of transmitting job descriptions. And certainly should not get throttled on service restart.
Lambda is perfect for this. You have a lot of short running processes (~1 minute) and Lambda is for short processes (up until five minutes nowadays). It is very important to know that CPU speed is coupled to RAM linearly. A 1GB Lambda function is equivalent to a t2.micro instance if I recall correctly, and 1.5GB RAM means 1.5x more CPU speed. The cost of these functions is so low that you can just execute this. The 128MB RAM has 1/8 CPU speed of a micro instance so I do not recommend using those actually.
As a queueing mechanism you can use S3 (yes you read that right). Create a bucket and let the Lambda worker trigger when an object is created. When you want to schedule a job, put a file inside the bucket. Lambda starts and processes it immediately.
Now you have to respect some limits. This way you can only have 100 workers at the same time (the total amount of active Lambda instances), but you can ask AWS to increase this.
The costs are as follows:
0.005 per 1000 PUT requests, so $5 per million job requests (this is more expensive than SQS).
The Lambda runtime. Assuming normal t2.micro CPU speed (1GB RAM), this costs $0.0001 per job (60 seconds, first 300.000 seconds are free = 5000 jobs)
The Lambda requests. $0.20 per million triggers (first million is free)
This setup does not require any servers on your part. This cannot go down (only if AWS itself does).
(don't forget to delete the job out of S3 when you're done)