Google Cloud Run not scaling up despite large backlog and available instances - google-cloud-platform

I am seeing something similar to this post. It looked like additional detail was needed to answer that question, so I'm re-asking with my details since those details weren't provided.
I am running a modified version of the Google Cloud Run image processing tutorial example.
I am inserting tasks into a task queue using this create tasks snippet. The tasks from the queue get pushed to my cloud run instance.
The problem is it isn't scaling up and making it through my tasks in a timely manner.
My cloud run service configuration:
I have tried setting a minimum of both 0 and 50 instances
I have tried a maximum of 100 and 1000 instances
I have tried --concurrency=1 and 2, and 8
I have tried with --async and without --async
With 50 instances pre-allocated even with concurrency set to 1, I am typically seeing ~10 active container instances and ~40 idle container instances. I have ~30,000 tasks in the queue and it is getting through ~5 jobs/minute.
My tasks queue has the default settings. My containers aren't using a lot of cpu, but they are using a lot of memory.
A process takes about a minute to complete. I'm only running one process per container instance. What additional parameters should be set to get higher throughput?
Edit - adding additional logs
I enabled the logs for the queue, I'm seeing some errors for some of the jobs. The errors look like this:
{
insertId: "<my_id>"
jsonPayload: {
#type: "type.googleapis.com/google.cloud.tasks.logging.v1.TaskActivityLog"
attemptResponseLog: {
attemptDuration: "19.453155s"
dispatchCount: "1"
maxAttempts: 0
responseCount: "0"
retryTime: "2021-10-20T22:45:51.559121Z"
scheduleTime: "2021-10-20T16:42:20.848145Z"
status: "UNAVAILABLE"
targetAddress: "POST <my_url>"
targetType: "HTTP"
}
task: "<my_task>"
}
logName: "<my_log_name>"
receiveTimestamp: "2021-10-20T22:45:52.418715942Z"
resource: {
labels: {
location: "us-central1"
project_id: "<my_project>"
queue_id: "<my-queue>"
target_type: "HTTP"
}
type: "cloud_tasks_queue"
}
severity: "ERROR"
timestamp: "2021-10-20T22:45:51.459232147Z"
}
I don't see errors in the cloud run logs.
Edit - Additional Debug Information
I tried to take the queue out of the equation to determine if it is cloud run or the queue. Instead I directly used curl to post to the url. Some of the tasks ran successfully, for others I received an error. In the below logs empty lines are successful:
upstream connect error or disconnect/reset before headers. reset reason: connection termination
upstream connect error or disconnect/reset before headers. reset reason: connection termination
upstream connect error or disconnect/reset before headers. reset reason: connection termination
upstream connect error or disconnect/reset before headers. reset reason: connection termination
upstream connect error or disconnect/reset before headers. reset reason: connection termination
This makes me think cloud run isn't handling all of the incoming requests.
Edit - task completion time test
I wanted to test if the time it takes to complete a task causes any issues with CloudRun and the Queue scaling up and keeping up with the tasks.
In place of the task I actually want completed I put a dummy task that just sleeps for n seconds and prints the task details to stdout (which I can read in the cloud run logs).
With n set to 0, 5, 10 seconds I see the number of instances scale up and it keeps up with the tasks being added to the queue. With n set to 20 seconds or more I see that less CloudRun instances are instantiated and items accumulate in the task queue. I see more errors with the Unavailable status in my logs.
According to this post:
Cloud Run offers a longer request timeout duration of up to 60 minutes
So it seems that long running tasks are expected. Is this a Google bug or am I missing setting some parameter?

I do not think this is a Cloud Run Service problem. I think this is an issue with how you have Tasks setup.
The dates in the log entry look odd. Take a look at the receiveTimestamp and the scheduleTime. The task is scheduled for six hours before the receive time. Do you have a timezone problem?
According to the documentation, if the response_time is not set then the task was not attempted. It looks like you are scheduling tasks incorrectly and the tasks never run.
Search for the text The status of a task attempt. in this link:
Types for Google Cloud Tasks

Related

OperationalError & ConnectTimeoutError When running multiple queries in snowflake (From many cloud run instances)

My platform running over gcp cloud run. The db we use is snowflake.
Once a week, we schedule (with Cloud Schedule) a job that practically triggers up to 200 tasks (currently, will probably grow up in the future). All tasks is being added to certain queue.
Each task is practically push post call to a cloud-run instance.
Each cloud run instance is handling one request (see also environment settings), means - one task at a time. Moreover, each cloud run has 2 active sessions to 2 databases in snowflake (one for each). The first session is for "global_db" and the other one is to specific "person_id" db (Notice: There might be 2 active session to the same person_id db from different cloud run instances)
Issues:
1 - When set the tasks queue "Max concurrent dispatches" to 1000, I get 503 ("The request failed because the instance failed the readiness check.")
Issue was probably gcp autoscaling capacities - SOLVED by decrease the "Max concurrent dispatches" to reasonable number that gcp can handle with.
2- When set the tasks queue "Max concurrent dispatches" to more than 10,
I get multiple ConnectTimeoutError & OperationalError, with the following messages (I removed the long id's and just put {} for make the message shorter):
sqlalchemy.exc.OperationalError: (snowflake.connector.errors. ) 250003: Failed to execute request: HTTPSConnectionPool(host='*****.us-central1.gcp.snowflakecomputing.com', port=443): Max retries exceeded with url: /session/v1/login-request?request_id={REQUEST_ID}&databaseName={DB_NAME}&warehouse={NAME}&request_guid={GUID} (Caused by ConnectTimeoutError(<snowflake.connector.vendored.urllib3.connection.HTTPSConnection object at 0x3e583ff91550>, 'Connection to *****.us-central1.gcp.snowflakecomputing.com timed out. (connect timeout=60)'))
(Background on this error at: http://sqlalche.me/e/13/e3q8)
snowflake.connector.vendored.urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='*****.us-central1.gcp.snowflakecomputing.com', port=443): Max retries exceeded with url: /session/v1/login-request?request_id={ID}&databaseName={NAME}&warehouse={NAME}&request_guid={GUID}(Caused by ConnectTimeoutError(<snowflake.connector.vendored.urllib3.connection.HTTPSConnection object at 0x3eab877b3ed0>, 'Connection to *****.us-central1.gcp.snowflakecomputing.com timed out. (connect timeout=60)'))
Any ideas how can I solve it?
Ask any Q you have, and I will elaborate
environment settings -
cloud tasks queue - Check multiple configurations for "Max concurrent dispatches", from 10 to 1000 concurrency. max attempts is 1, max dispatches is 500.
cloud run - 5 hot instances, 1 request per one. Can autoscaling to max 1000 instances.
snowflake - ACCOUNT parameters were default (MAX_CONCURRENCY_LEVEL=8 and STATEMENT_QUEUED_TIMEOUT_IN_SECONDS=0) and was changed to (in order to handle those errors):
MAX_CONCURRENCY_LEVEL - 32
STATEMENT_QUEUED_TIMEOUT_IN_SECONDS - 600
I want to inform that we've found the problem - When the project was in it's beginning, we've created a VPC with static IP to the cloud run instance.
Unfortunately, the maximum number of connections to a single VPC network is 25..

When I get 'services has reached steady state', in Amazon ECS does it means some tasks had stopped?

Does this means that my service tasks are stopping or it's ok to get these log messages?
actually opposite this. The service scheduler reports status periodically. A normal state indicates that there is nothing for it to do -- all tasks are healthy, there are no scaling requests or deployments.
No it doesn't mean that any of your tasks had stopped. If a task stops you will see an event that clearly states so and will include a link to the specific task that was stopped. For example you will get something like this "service xxx has stopped 1 running tasks: task xxx."
If no tasks have been created or stopped in the last six hours the ECS console will duplicate the last event message to let you know that everything works as expected.
From the ECS docs:
"To ensure that this event view is helpful, we only show the 100 most recent events and duplicate event messages are omitted until either the cause is resolved or six hours passes. If the cause is not resolved within six hours, you will receive another service event message for that cause."
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-event-messages.html
Check this thread here on the aws forums. https://forums.aws.amazon.com/thread.jspa?threadID=182793
This sounds like normal behavior. The service scheduler reports status periodically. A normal state indicates that there is nothing for it to do -- all tasks are healthy, there are no scaling requests or deployments. Are you seeing any issues?

Cloud Run finishes but Cloud Scheduler thinks that job has failed

I have a Cloud Run service setup and I have a Cloud Scheduler task that calls an endpoint on that service. When the task completes (http handler returns), I'm seeing the following error:
The request failed because the HTTP connection to the instance had an error.
However, the actual handler returns HTTP 200 and successfully exists. Does anyone know what this error means and under what circumstances it shows up?
I'm also attaching a screenshot of the logs.
Does your job take longer than 120 seconds? I was having the same issue and figured out node versions prior to 13 has 120 seconds server.timeout limit. I installed node 13 on docker and problem is gone.
Error 503 is returned by the Google Frontend (GFE). The Cloud Run service either has a transient issue, or the GFE has determined that your service is not ready or not working correctly.
In your log entries, I see a POST request. 7 ms later is the error 503. This tells me your Cloud Run application is not yet ready (in a ready state determined by Cloud Run).
One minute, 8 seconds before, I see ReplaceService. This tells me that your service is not yet in a running state and that if you retry later, you will see success.
I've run an incremental sleep test on my FLASK endpoint which returns 200 within 1 min, 2 min and 10 min of waiting time. Having triggered the endpoint via the Cloud Scheduler, the job failed only in the 10 min test. I've found that it was one of the properties of my Cloud Scheduler job causing the failure. The following solved my issue.
gcloud scheduler jobs describe <my_test_scheduler>
There, you'll see a property called 'attemptDeadline' which was set to 180 seconds by default.
You can update that property using:
gcloud scheduler jobs update http <my_test_scheduler> --attempt-deadline 1000s
Ref: scheduler update

How to fix CloudRun error 'The request was aborted because there was no available instance'

I'm using managed CloudRun to deploy a container with concurrency=1. Once deployed, I'm firing four long-running requests in parallel.
Most of the time, all works fine -- But occasionally, I'm facing 500's from one of the nodes within a few seconds; logs only provide the error message provided in the subject.
Using retry with exponential back-off did not improve the situation; the retries also end up with 500s. StackDriver logs also do not provide further information.
Potentially relevant gcloud beta run deploy arguments:
--memory 2Gi --concurrency 1 --timeout 8m --platform managed
What does the error message mean exactly -- and how can I solve the issue?
This error message can appear when the infrastructure didn't scale fast enough to catch up with the traffic spike. Infrastructure only keeps a request in the queue for a certain amount of time (about 10s) then aborts it.
This usually happens when:
traffic suddenly largely increase
cold start time is long
request time is long
We also faced this issue when traffic suddenly increased during business hours. The issue is usually caused by a sudden increase in traffic and a longer instance start time to accommodate incoming requests. One way to handle this is by keeping warm-up instances always running i.e. configuring --min-instances parameters in the cloud run deploy command. Another and recommended way is to reduce the service cold start time (which is difficult to achieve in some languages like Java and Python)
I also experiment the problem. Easy to reproduce. I have a fibonacci container that process in 6s fibo(45). I use Hey to perform 200 requests. And I set my Cloud Run concurrency to 1.
Over 200 requests I have 8 similar errors. In my case: sudden traffic spike and long processing time. (Short cold start for me, it's in Go)
I was able to resolve this on my service by raising the max autoscaling container count from 2 to 10. There really should be no reason that 2 would be even close to too low for the traffic, but I suspect something about the Cloud Run internals were tying up to 2 containers somehow.
Setting the Max Retry Attempts to anything but zero will remedy this, as it did for me.

Cloud composer tasks fail without reason or logs

I run Airflow in a managed Cloud-composer environment (version 1.9.0), whic runs on a Kubernetes 1.10.9-gke.5 cluster.
All my DAGs run daily at 3:00 AM or 4:00 AM. But sometime in the morning, I see a few Tasks failed without a reason during the night.
When checking the log using the UI - I see no log and I see no log either when I check the log folder in the GCS bucket
In the instance details, it reads "Dependencies Blocking Task From Getting Scheduled" but the dependency is the dagrun itself.
Although the DAG is set with 5 retries and an email message it does not look as if any retry took place and I haven't received an email about the failure.
I usually just clear the task instance and it run successfully on the first try.
Has anyone encountered a similar problem?
Empty logs often means the Airflow worker pod was evicted (i.e., it died before it could flush logs to GCS), which is usually due to an out of memory condition. If you go to your GKE cluster (the one under Composer's hood) you will probably see that there is indeed a evicted pod (GKE > Workloads > "airflow-worker").
You will probably see in "Tasks Instances" that said tasks have no Start Date nor Job Id or worker (Hostname) assigned, which, added to no logs, is a proof of the death of the pod.
Since this normally happens in highly parallelised DAGs, a way to avoid this is to reduce the worker concurrency or use a better machine.
EDIT: I filed this Feature Request on your behalf to get emails in case of failure, even if the pod was evicted.