Cloud Run finishes but Cloud Scheduler thinks that job has failed - google-cloud-platform

I have a Cloud Run service setup and I have a Cloud Scheduler task that calls an endpoint on that service. When the task completes (http handler returns), I'm seeing the following error:
The request failed because the HTTP connection to the instance had an error.
However, the actual handler returns HTTP 200 and successfully exists. Does anyone know what this error means and under what circumstances it shows up?
I'm also attaching a screenshot of the logs.

Does your job take longer than 120 seconds? I was having the same issue and figured out node versions prior to 13 has 120 seconds server.timeout limit. I installed node 13 on docker and problem is gone.

Error 503 is returned by the Google Frontend (GFE). The Cloud Run service either has a transient issue, or the GFE has determined that your service is not ready or not working correctly.
In your log entries, I see a POST request. 7 ms later is the error 503. This tells me your Cloud Run application is not yet ready (in a ready state determined by Cloud Run).
One minute, 8 seconds before, I see ReplaceService. This tells me that your service is not yet in a running state and that if you retry later, you will see success.

I've run an incremental sleep test on my FLASK endpoint which returns 200 within 1 min, 2 min and 10 min of waiting time. Having triggered the endpoint via the Cloud Scheduler, the job failed only in the 10 min test. I've found that it was one of the properties of my Cloud Scheduler job causing the failure. The following solved my issue.
gcloud scheduler jobs describe <my_test_scheduler>
There, you'll see a property called 'attemptDeadline' which was set to 180 seconds by default.
You can update that property using:
gcloud scheduler jobs update http <my_test_scheduler> --attempt-deadline 1000s
Ref: scheduler update

Related

Cloud Run Error 504 (Upstream Request Timeout) after successful deploy

I was following this tutorial from Google to deploy a servise to Cloud Run (https://codelabs.developers.google.com/codelabs/cloud-run-hello-python3#5). In Cloud Shell my project is deployed successfully (screenshot below). However, once I click on the link I get timeout. If I test it locally from Cloud Shell it works fine.
Why could this be happening? Where could I get more data about the issue?
As mentioned in the Documentation :
For Cloud Run services, the request timeout setting specifies the time
within which a response must be returned by services deployed to Cloud
Run. If a response isn't returned within the time specified, the
request ends and error 504 is returned.
The timeout is set by default to 5 minutes and can be extended up to
60 minutes. You can change this setting when you deploy a container
image or by updating the service configuration. In addition to
changing the Cloud Run request timeout, you should also check your
language framework to see whether it has its own request timeout
setting that you must also update.
You can refer to this Public group issue which will be helpful in resolving the current error.
You can increase timeout by clicking EDIT & DEPLOY NEW REVISION and then adjust new Request timeout value

Cloud Tasks Failing Dispatch with UNKNOWN(2): HTTP status code 0

I am running Cloud Tasks using OIDC authentication to trigger a Cloud Function. While running ~40K queued tasks, I received a number of task failures (noticed by the fact that the Retries counter was incremented) and when I inspected the Previous Run they all said Status code 2 (UNKNOWN), Reason to retry UNKNOWN(2): HTTP status code 0. Additionally, on inspection of the logs, it does not appear that my Cloud Function was triggered. Upon retry of all these tasks, the Cloud Function was triggered and the task was processed successfully.
I am unsure of what this code means and how to respond to it. Are these just par for the course when using Cloud Tasks? Is this definitely not going to trigger my Cloud Function or could it potentially trigger it while also returning this status? Can I protect against this in any way? Am I paying for these failed dispatches?

Airflow web-server produces temporary 502 errors in Cloud Composer

I'm encountering 502 errors on AirFlow(2.0.2) UI hosted in Cloud Composer(1.17.0).
Error: Server Error The server encountered a temporary error and could not complete your request.
Please try again in 30 seconds.
They last for a few minutes and it happens several times a day after it's gone everything works fine.
At the moment of errors:
there is a gap in logs and after we can see that logs resumed with messages about staring gunicorn:
[1133] [INFO] Starting gunicorn 19.10.0
there is a spike in resource usage of web-server
I didn't spot any other suspicious activity in other parts of the system(workers, scheduler, DB)
I think that this is a result of OOM error because we have DAGs with a big number of tasks (2k).
But I'd like to be sure and I haven't found a way to connect to VM of app engine in tenant project(where Airflow server is hosted by default) to get additional logs.
Maybe anyone knows a way to get additional logs from AirFlow server VMs or have any other idea?
Cloud Composer documentation shows Troubleshooting DAGs sections. It shows how to check individual workers logs. It even mentions OOM issues (direct link).
Generally troubleshooting section is well documented so you should be able to find many interesting information. You can also use Cloud Monitoring and Cloud Logging to monitor Composer, but I am not sure if this will be valuable in this use case (reference).

Google Cloud Run not scaling up despite large backlog and available instances

I am seeing something similar to this post. It looked like additional detail was needed to answer that question, so I'm re-asking with my details since those details weren't provided.
I am running a modified version of the Google Cloud Run image processing tutorial example.
I am inserting tasks into a task queue using this create tasks snippet. The tasks from the queue get pushed to my cloud run instance.
The problem is it isn't scaling up and making it through my tasks in a timely manner.
My cloud run service configuration:
I have tried setting a minimum of both 0 and 50 instances
I have tried a maximum of 100 and 1000 instances
I have tried --concurrency=1 and 2, and 8
I have tried with --async and without --async
With 50 instances pre-allocated even with concurrency set to 1, I am typically seeing ~10 active container instances and ~40 idle container instances. I have ~30,000 tasks in the queue and it is getting through ~5 jobs/minute.
My tasks queue has the default settings. My containers aren't using a lot of cpu, but they are using a lot of memory.
A process takes about a minute to complete. I'm only running one process per container instance. What additional parameters should be set to get higher throughput?
Edit - adding additional logs
I enabled the logs for the queue, I'm seeing some errors for some of the jobs. The errors look like this:
{
insertId: "<my_id>"
jsonPayload: {
#type: "type.googleapis.com/google.cloud.tasks.logging.v1.TaskActivityLog"
attemptResponseLog: {
attemptDuration: "19.453155s"
dispatchCount: "1"
maxAttempts: 0
responseCount: "0"
retryTime: "2021-10-20T22:45:51.559121Z"
scheduleTime: "2021-10-20T16:42:20.848145Z"
status: "UNAVAILABLE"
targetAddress: "POST <my_url>"
targetType: "HTTP"
}
task: "<my_task>"
}
logName: "<my_log_name>"
receiveTimestamp: "2021-10-20T22:45:52.418715942Z"
resource: {
labels: {
location: "us-central1"
project_id: "<my_project>"
queue_id: "<my-queue>"
target_type: "HTTP"
}
type: "cloud_tasks_queue"
}
severity: "ERROR"
timestamp: "2021-10-20T22:45:51.459232147Z"
}
I don't see errors in the cloud run logs.
Edit - Additional Debug Information
I tried to take the queue out of the equation to determine if it is cloud run or the queue. Instead I directly used curl to post to the url. Some of the tasks ran successfully, for others I received an error. In the below logs empty lines are successful:
upstream connect error or disconnect/reset before headers. reset reason: connection termination
upstream connect error or disconnect/reset before headers. reset reason: connection termination
upstream connect error or disconnect/reset before headers. reset reason: connection termination
upstream connect error or disconnect/reset before headers. reset reason: connection termination
upstream connect error or disconnect/reset before headers. reset reason: connection termination
This makes me think cloud run isn't handling all of the incoming requests.
Edit - task completion time test
I wanted to test if the time it takes to complete a task causes any issues with CloudRun and the Queue scaling up and keeping up with the tasks.
In place of the task I actually want completed I put a dummy task that just sleeps for n seconds and prints the task details to stdout (which I can read in the cloud run logs).
With n set to 0, 5, 10 seconds I see the number of instances scale up and it keeps up with the tasks being added to the queue. With n set to 20 seconds or more I see that less CloudRun instances are instantiated and items accumulate in the task queue. I see more errors with the Unavailable status in my logs.
According to this post:
Cloud Run offers a longer request timeout duration of up to 60 minutes
So it seems that long running tasks are expected. Is this a Google bug or am I missing setting some parameter?
I do not think this is a Cloud Run Service problem. I think this is an issue with how you have Tasks setup.
The dates in the log entry look odd. Take a look at the receiveTimestamp and the scheduleTime. The task is scheduled for six hours before the receive time. Do you have a timezone problem?
According to the documentation, if the response_time is not set then the task was not attempted. It looks like you are scheduling tasks incorrectly and the tasks never run.
Search for the text The status of a task attempt. in this link:
Types for Google Cloud Tasks

Cloud Run crashes after 121 seconds

I'm triggering a long running scraping Cloud Run function with a PubSub topic and subscription trigger. Everytime I run it it does crash after 121.8 seconds but I don't get why.
POST 503 556B 121.8s APIs-Google; (+https://developers.google.com/webmasters/APIs-Google.html) https://????.a.run.app/
The request failed because either the HTTP response was malformed or connection to the instance had an error.
I've got a built-in timeout trigger and when I set it at 1 minute the functions runs without any problems but when I set at 2 minutes the above error gets triggered so it must be something with the Cloud Run or Subscription timeout settings but I've tried to increase those (read more below).
Things involved
1 x Cloud Run
1 x SubPub subscription
1 x SubPub topic
These are the things I've checked
The timeout of the Cloud Run instance (900 sec)
The timeout of the Pubsub subscription (Acknowledgement deadline - 600 sec & Message retention duration - 10 minutes)
I've increased the memory to 4GB and that is way above what it's needed.
Anyone who can point me in the right direction?
This is almost certainly due to Node.js' default server timeout of 120secs.
Try server.setTimeout(0) to remove this timeout.