Cloud Run crashes after 121 seconds - google-cloud-platform

I'm triggering a long running scraping Cloud Run function with a PubSub topic and subscription trigger. Everytime I run it it does crash after 121.8 seconds but I don't get why.
POST 503 556B 121.8s APIs-Google; (+https://developers.google.com/webmasters/APIs-Google.html) https://????.a.run.app/
The request failed because either the HTTP response was malformed or connection to the instance had an error.
I've got a built-in timeout trigger and when I set it at 1 minute the functions runs without any problems but when I set at 2 minutes the above error gets triggered so it must be something with the Cloud Run or Subscription timeout settings but I've tried to increase those (read more below).
Things involved
1 x Cloud Run
1 x SubPub subscription
1 x SubPub topic
These are the things I've checked
The timeout of the Cloud Run instance (900 sec)
The timeout of the Pubsub subscription (Acknowledgement deadline - 600 sec & Message retention duration - 10 minutes)
I've increased the memory to 4GB and that is way above what it's needed.
Anyone who can point me in the right direction?

This is almost certainly due to Node.js' default server timeout of 120secs.
Try server.setTimeout(0) to remove this timeout.

Related

AWS Glue job run not respecting Timeout and not stopping

I am running AWS Glue jobs using PySpark. They have set Timeout (as visible on the screenshot) of 1440 mins, which is 24 hours. Nevertheless, the job continues working over those 24 hours.
When this particular job had been running for over 5 days I stopped it manually (clicking stop icon in column "Run status" in GUI visible on the screenshot). However, since then (it has been over 2 days) it still hasn't stopped - the "Run status" is Stopping, not Stopped.
Additionally, after about 4 hours of running, new logs (column "Logs") in CloudWatch regarding this Job Run stopped appearing (in my PySpark script I have print() statements which regularly and often log extra data). Also, last error log in CloudWatch (column "Error logs") has been written 24 seconds after the date of the newest log in "Logs".
This behaviour continues for multiple jobs.
My questions are:
What could be reasons for Job Runs not obeying set Timeout value? How to fix that?
Why the newest log is from 4 hours since starting the Job Run, while the logs should appear regularly during 24 hours of the (desired) duration of the Job Run?
Why the Job Runs don't stop if I try to stop them manually? How can they be stopped?
Thank you in advance for your advice and hints.

OperationalError & ConnectTimeoutError When running multiple queries in snowflake (From many cloud run instances)

My platform running over gcp cloud run. The db we use is snowflake.
Once a week, we schedule (with Cloud Schedule) a job that practically triggers up to 200 tasks (currently, will probably grow up in the future). All tasks is being added to certain queue.
Each task is practically push post call to a cloud-run instance.
Each cloud run instance is handling one request (see also environment settings), means - one task at a time. Moreover, each cloud run has 2 active sessions to 2 databases in snowflake (one for each). The first session is for "global_db" and the other one is to specific "person_id" db (Notice: There might be 2 active session to the same person_id db from different cloud run instances)
Issues:
1 - When set the tasks queue "Max concurrent dispatches" to 1000, I get 503 ("The request failed because the instance failed the readiness check.")
Issue was probably gcp autoscaling capacities - SOLVED by decrease the "Max concurrent dispatches" to reasonable number that gcp can handle with.
2- When set the tasks queue "Max concurrent dispatches" to more than 10,
I get multiple ConnectTimeoutError & OperationalError, with the following messages (I removed the long id's and just put {} for make the message shorter):
sqlalchemy.exc.OperationalError: (snowflake.connector.errors. ) 250003: Failed to execute request: HTTPSConnectionPool(host='*****.us-central1.gcp.snowflakecomputing.com', port=443): Max retries exceeded with url: /session/v1/login-request?request_id={REQUEST_ID}&databaseName={DB_NAME}&warehouse={NAME}&request_guid={GUID} (Caused by ConnectTimeoutError(<snowflake.connector.vendored.urllib3.connection.HTTPSConnection object at 0x3e583ff91550>, 'Connection to *****.us-central1.gcp.snowflakecomputing.com timed out. (connect timeout=60)'))
(Background on this error at: http://sqlalche.me/e/13/e3q8)
snowflake.connector.vendored.urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='*****.us-central1.gcp.snowflakecomputing.com', port=443): Max retries exceeded with url: /session/v1/login-request?request_id={ID}&databaseName={NAME}&warehouse={NAME}&request_guid={GUID}(Caused by ConnectTimeoutError(<snowflake.connector.vendored.urllib3.connection.HTTPSConnection object at 0x3eab877b3ed0>, 'Connection to *****.us-central1.gcp.snowflakecomputing.com timed out. (connect timeout=60)'))
Any ideas how can I solve it?
Ask any Q you have, and I will elaborate
environment settings -
cloud tasks queue - Check multiple configurations for "Max concurrent dispatches", from 10 to 1000 concurrency. max attempts is 1, max dispatches is 500.
cloud run - 5 hot instances, 1 request per one. Can autoscaling to max 1000 instances.
snowflake - ACCOUNT parameters were default (MAX_CONCURRENCY_LEVEL=8 and STATEMENT_QUEUED_TIMEOUT_IN_SECONDS=0) and was changed to (in order to handle those errors):
MAX_CONCURRENCY_LEVEL - 32
STATEMENT_QUEUED_TIMEOUT_IN_SECONDS - 600
I want to inform that we've found the problem - When the project was in it's beginning, we've created a VPC with static IP to the cloud run instance.
Unfortunately, the maximum number of connections to a single VPC network is 25..

AWS SQS Polling Issue

I have encountered a weird sqs situation that I can't find a satisfying answer.
I created a delay queue that should delay (what a surprise) incoming events for 4 seconds and then they should be processed by lambda. Order is not an issue here.
The issue though is that the "approximate age of the oldest message" metric (stat. Max) sometimes it reaches over 1 minute which is weird since there aren't so many message as you can see in the picture. My expectation would be that the event gets processed immediately after the 4 secs delay time.
The reserved concurrency level of that lambda is 50 so the sqs poller should have no problem invoking more lambda instances if there is too much traffic. But traffic isn't really a problem as you can see.
The queue is configured like this:
Default visibility timeout: 120 sec
Delivery delay: 4 sec
Dead-letter-queue: No (It is only one event generated by aws, so no
bad pills)
Message retention period: 4 days
The lambda config:
Batch size: 5 (Tried also 1 or 10. Not much of a difference for the mentioned metric)
Batch window: None
reserved concurrency: 50
timeout: 20 secs
I can't explain the reason for those old messages (ApproximateAgeOfOldestMessage). Any help would be highly appreciated
Best
Patrick
I contacted the AWS Support. Apparently it is a bug on the aws side:
Response from AWS Support:
I have just received an update from the backend service team and the
team has confirmed that they have identified an issue of unexpected
spikes in "ApproximateAgeOfOldestMessage" metrics that triggers when
messages are sent to SQS with a configured delay. This issue's root
cause is that our internal system uses recently processed delayed
messages to calculate the "ApproximateAgeOfOldestMessage," which
results in a higher than the actual value for
"ApproximateAgeOfOldestMessage" metrics. They have now identified a
fix for this issue and will start deploying the fix soon. After this
update, when messages are sent to Amazon SQS with a configured delay,
you may see the "ApproximateAgeOfOldestMessage" metrics value come
down for the queues to the accurate value.
So if you encounter the same problem you have to wait for that mentioned fix. Hope it will come soon.

Cloud Run finishes but Cloud Scheduler thinks that job has failed

I have a Cloud Run service setup and I have a Cloud Scheduler task that calls an endpoint on that service. When the task completes (http handler returns), I'm seeing the following error:
The request failed because the HTTP connection to the instance had an error.
However, the actual handler returns HTTP 200 and successfully exists. Does anyone know what this error means and under what circumstances it shows up?
I'm also attaching a screenshot of the logs.
Does your job take longer than 120 seconds? I was having the same issue and figured out node versions prior to 13 has 120 seconds server.timeout limit. I installed node 13 on docker and problem is gone.
Error 503 is returned by the Google Frontend (GFE). The Cloud Run service either has a transient issue, or the GFE has determined that your service is not ready or not working correctly.
In your log entries, I see a POST request. 7 ms later is the error 503. This tells me your Cloud Run application is not yet ready (in a ready state determined by Cloud Run).
One minute, 8 seconds before, I see ReplaceService. This tells me that your service is not yet in a running state and that if you retry later, you will see success.
I've run an incremental sleep test on my FLASK endpoint which returns 200 within 1 min, 2 min and 10 min of waiting time. Having triggered the endpoint via the Cloud Scheduler, the job failed only in the 10 min test. I've found that it was one of the properties of my Cloud Scheduler job causing the failure. The following solved my issue.
gcloud scheduler jobs describe <my_test_scheduler>
There, you'll see a property called 'attemptDeadline' which was set to 180 seconds by default.
You can update that property using:
gcloud scheduler jobs update http <my_test_scheduler> --attempt-deadline 1000s
Ref: scheduler update

Google Cloud Function Timeout Setting doesn't work

I can't get a Google Cloud Function to run for more than 60secs, even when the timeout is set to 540secs!! Any suggestions?
I set the timeout flag on deployment to --timeout=540, and I know the setting goes through, because the 540 sec timeout setting appears in the GCP WEB UI. I have also tried to manually edit the timeout to 540 through the GCP WEB UI. But in any case i still get the DEADLINE_EXCEEDED after just ~ 62000 ms.
I have tried both the pub/sub and https methods as the func trigger, but still get the premature function timeout at ~60s.
Im running the latest CLI, with these these func settings:
trigger: http/pubsub (both tested, same result)
availableMemoryMb: 2048
runtime: nodejs6
status: ACTIVE
timeout: 540s
Thanks for any inputs!
Br Markus
I have used the documentation code for delay and executed a Cloud Function with the same specifications as yours. In the documentation, the execution is delayed 120000 ms (2 mins). I edited that and put it at 500000 ms. This plus the normal time that the CF takes to execute, will reach the desired execution time (around 9 minutes). If you add 540000 to test the code, it will execute with timeout error at ~540025, because the value itself is exceeding the timeout limit of the Cloud Function and at the same time the default maximum timeout limit of a Cloud Function, which is 9 minutes.
I also tried the creating the function using this command
gcloud functions deploy [FUNCTION_NAME] --trigger-http --timeout=540.
After successful deployment, I updated the code manually in the GCP Cloud Function UI as follows
exports.timeoutTest = (req, res) => {
setTimeout(() => {
let message = req.query.message || req.body.message || 'Hello World today!';
res.status(200).send(message);
res.end();
}, 500000);
};
Both times the Cloud Function was executed and returned with status code 200. This means that you can set a timeout to be more than 60 secs which is the default value.
If you revised everything correctly and you still have this issue, I recommend you to start afresh, create a new CF and use the documentation link I provided.
The 60 seconds timeout is not resulting from GCP Cloud Function setting. For instance if this is a Django/Gunicorn App, the timeout is coming from the timeout of gunicorn that is set in app.yaml
entrypoint: gunicorn -t 3600 -b :$PORT project_name.wsgi
For instance, this will achieve a timeout of 3600 seconds for gunicorn.
I believe I'm some years late but here is my suggestion.
If you're using the "Test the function" button in the "Testing tab" of the Cloud Function (in the gcp "Cloud Console") it says right next to the button that:
Testing in the Cloud Console has a 60s timeout. Note that this is different from the limit set in the function configuration.
I hope you fixed it and this answer can help someone in the future.
Update: Second try ("Test the function") was precisely 9 minutes
From: 23:15:38
Till: 23:24:38
And it is exactly the 9 minutes, although the message again was about 60 seconds only and popped up much earlier than the actual stop.
Function execution took 540004 ms, finished with status: 'timeout'
This time with a lot of memory (2 GB), timeout clearly made it stop. The message is perhaps just popping up earlier since it has not been programmed in detail, my guess. You should always look at the logs to see what is happening.
I guess that the core of your question is outdated then: At least in 01/2022, you do have the demanded timeout time regardless of the what you may read, and you just should not care about the messages.
First try ("Test the function") 8 minutes after reached memory limit
A screenshot of how it looks like in 2022/01 if you get over the 60 seconds (with 540s maximum timeout for this example function set in the "Edit" menu of the CF):
Function being tested has exceeded the 60s timeout imposed by the Cloud Functions testing utility.
Yet, in reality, when using just the "Testing tab" the timeout is at least after 300s / 5 minutes which can be seen next to the "Test the function" button:
Testing in the Cloud Console has a 5 minute timeout. Note that this is different from the limit set in the function configuration.
But it is even more. I know from testing (started from the "Testing tab" --> "Test Function" in the Cloud Function) that you have at least 8 minutes:
From 22:31:43:
Till 22:39:53
And this was at first stopped by the 256 MB limit, secondly only by time (a bit unclear why there were both messages).
Therefore, your question about why you get only 60 seconds timeout time might rather ask why these messages are wrong (like in my case). Perhaps GCP did not make the effort to parametrize the messages for each function.
Perhaps you get even slightly more time when you start with gcloud from terminal, but that is not so likely since 9 minutes are the maximum anyway.