OperationalError & ConnectTimeoutError When running multiple queries in snowflake (From many cloud run instances) - google-cloud-platform

My platform running over gcp cloud run. The db we use is snowflake.
Once a week, we schedule (with Cloud Schedule) a job that practically triggers up to 200 tasks (currently, will probably grow up in the future). All tasks is being added to certain queue.
Each task is practically push post call to a cloud-run instance.
Each cloud run instance is handling one request (see also environment settings), means - one task at a time. Moreover, each cloud run has 2 active sessions to 2 databases in snowflake (one for each). The first session is for "global_db" and the other one is to specific "person_id" db (Notice: There might be 2 active session to the same person_id db from different cloud run instances)
Issues:
1 - When set the tasks queue "Max concurrent dispatches" to 1000, I get 503 ("The request failed because the instance failed the readiness check.")
Issue was probably gcp autoscaling capacities - SOLVED by decrease the "Max concurrent dispatches" to reasonable number that gcp can handle with.
2- When set the tasks queue "Max concurrent dispatches" to more than 10,
I get multiple ConnectTimeoutError & OperationalError, with the following messages (I removed the long id's and just put {} for make the message shorter):
sqlalchemy.exc.OperationalError: (snowflake.connector.errors. ) 250003: Failed to execute request: HTTPSConnectionPool(host='*****.us-central1.gcp.snowflakecomputing.com', port=443): Max retries exceeded with url: /session/v1/login-request?request_id={REQUEST_ID}&databaseName={DB_NAME}&warehouse={NAME}&request_guid={GUID} (Caused by ConnectTimeoutError(<snowflake.connector.vendored.urllib3.connection.HTTPSConnection object at 0x3e583ff91550>, 'Connection to *****.us-central1.gcp.snowflakecomputing.com timed out. (connect timeout=60)'))
(Background on this error at: http://sqlalche.me/e/13/e3q8)
snowflake.connector.vendored.urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='*****.us-central1.gcp.snowflakecomputing.com', port=443): Max retries exceeded with url: /session/v1/login-request?request_id={ID}&databaseName={NAME}&warehouse={NAME}&request_guid={GUID}(Caused by ConnectTimeoutError(<snowflake.connector.vendored.urllib3.connection.HTTPSConnection object at 0x3eab877b3ed0>, 'Connection to *****.us-central1.gcp.snowflakecomputing.com timed out. (connect timeout=60)'))
Any ideas how can I solve it?
Ask any Q you have, and I will elaborate
environment settings -
cloud tasks queue - Check multiple configurations for "Max concurrent dispatches", from 10 to 1000 concurrency. max attempts is 1, max dispatches is 500.
cloud run - 5 hot instances, 1 request per one. Can autoscaling to max 1000 instances.
snowflake - ACCOUNT parameters were default (MAX_CONCURRENCY_LEVEL=8 and STATEMENT_QUEUED_TIMEOUT_IN_SECONDS=0) and was changed to (in order to handle those errors):
MAX_CONCURRENCY_LEVEL - 32
STATEMENT_QUEUED_TIMEOUT_IN_SECONDS - 600

I want to inform that we've found the problem - When the project was in it's beginning, we've created a VPC with static IP to the cloud run instance.
Unfortunately, the maximum number of connections to a single VPC network is 25..

Related

Getting Cloud Run Rate exceeded error, when just two requests are being proceesed

cloud Run is configured with default concurrency 80 , so when I was testing two simultaneous connection, how can error "Rate exceeded" be thrown?
What happens if the number of requests exceed concurrency, suppose concurrency is set to two, then if third, fourth and fifth requests comes and first and second request has not finished, does these requests wait per Request timeout ? or not served at all ?

Google Cloud Run not scaling up despite large backlog and available instances

I am seeing something similar to this post. It looked like additional detail was needed to answer that question, so I'm re-asking with my details since those details weren't provided.
I am running a modified version of the Google Cloud Run image processing tutorial example.
I am inserting tasks into a task queue using this create tasks snippet. The tasks from the queue get pushed to my cloud run instance.
The problem is it isn't scaling up and making it through my tasks in a timely manner.
My cloud run service configuration:
I have tried setting a minimum of both 0 and 50 instances
I have tried a maximum of 100 and 1000 instances
I have tried --concurrency=1 and 2, and 8
I have tried with --async and without --async
With 50 instances pre-allocated even with concurrency set to 1, I am typically seeing ~10 active container instances and ~40 idle container instances. I have ~30,000 tasks in the queue and it is getting through ~5 jobs/minute.
My tasks queue has the default settings. My containers aren't using a lot of cpu, but they are using a lot of memory.
A process takes about a minute to complete. I'm only running one process per container instance. What additional parameters should be set to get higher throughput?
Edit - adding additional logs
I enabled the logs for the queue, I'm seeing some errors for some of the jobs. The errors look like this:
{
insertId: "<my_id>"
jsonPayload: {
#type: "type.googleapis.com/google.cloud.tasks.logging.v1.TaskActivityLog"
attemptResponseLog: {
attemptDuration: "19.453155s"
dispatchCount: "1"
maxAttempts: 0
responseCount: "0"
retryTime: "2021-10-20T22:45:51.559121Z"
scheduleTime: "2021-10-20T16:42:20.848145Z"
status: "UNAVAILABLE"
targetAddress: "POST <my_url>"
targetType: "HTTP"
}
task: "<my_task>"
}
logName: "<my_log_name>"
receiveTimestamp: "2021-10-20T22:45:52.418715942Z"
resource: {
labels: {
location: "us-central1"
project_id: "<my_project>"
queue_id: "<my-queue>"
target_type: "HTTP"
}
type: "cloud_tasks_queue"
}
severity: "ERROR"
timestamp: "2021-10-20T22:45:51.459232147Z"
}
I don't see errors in the cloud run logs.
Edit - Additional Debug Information
I tried to take the queue out of the equation to determine if it is cloud run or the queue. Instead I directly used curl to post to the url. Some of the tasks ran successfully, for others I received an error. In the below logs empty lines are successful:
upstream connect error or disconnect/reset before headers. reset reason: connection termination
upstream connect error or disconnect/reset before headers. reset reason: connection termination
upstream connect error or disconnect/reset before headers. reset reason: connection termination
upstream connect error or disconnect/reset before headers. reset reason: connection termination
upstream connect error or disconnect/reset before headers. reset reason: connection termination
This makes me think cloud run isn't handling all of the incoming requests.
Edit - task completion time test
I wanted to test if the time it takes to complete a task causes any issues with CloudRun and the Queue scaling up and keeping up with the tasks.
In place of the task I actually want completed I put a dummy task that just sleeps for n seconds and prints the task details to stdout (which I can read in the cloud run logs).
With n set to 0, 5, 10 seconds I see the number of instances scale up and it keeps up with the tasks being added to the queue. With n set to 20 seconds or more I see that less CloudRun instances are instantiated and items accumulate in the task queue. I see more errors with the Unavailable status in my logs.
According to this post:
Cloud Run offers a longer request timeout duration of up to 60 minutes
So it seems that long running tasks are expected. Is this a Google bug or am I missing setting some parameter?
I do not think this is a Cloud Run Service problem. I think this is an issue with how you have Tasks setup.
The dates in the log entry look odd. Take a look at the receiveTimestamp and the scheduleTime. The task is scheduled for six hours before the receive time. Do you have a timezone problem?
According to the documentation, if the response_time is not set then the task was not attempted. It looks like you are scheduling tasks incorrectly and the tasks never run.
Search for the text The status of a task attempt. in this link:
Types for Google Cloud Tasks

Cloud Run crashes after 121 seconds

I'm triggering a long running scraping Cloud Run function with a PubSub topic and subscription trigger. Everytime I run it it does crash after 121.8 seconds but I don't get why.
POST 503 556B 121.8s APIs-Google; (+https://developers.google.com/webmasters/APIs-Google.html) https://????.a.run.app/
The request failed because either the HTTP response was malformed or connection to the instance had an error.
I've got a built-in timeout trigger and when I set it at 1 minute the functions runs without any problems but when I set at 2 minutes the above error gets triggered so it must be something with the Cloud Run or Subscription timeout settings but I've tried to increase those (read more below).
Things involved
1 x Cloud Run
1 x SubPub subscription
1 x SubPub topic
These are the things I've checked
The timeout of the Cloud Run instance (900 sec)
The timeout of the Pubsub subscription (Acknowledgement deadline - 600 sec & Message retention duration - 10 minutes)
I've increased the memory to 4GB and that is way above what it's needed.
Anyone who can point me in the right direction?
This is almost certainly due to Node.js' default server timeout of 120secs.
Try server.setTimeout(0) to remove this timeout.

Cloud Run finishes but Cloud Scheduler thinks that job has failed

I have a Cloud Run service setup and I have a Cloud Scheduler task that calls an endpoint on that service. When the task completes (http handler returns), I'm seeing the following error:
The request failed because the HTTP connection to the instance had an error.
However, the actual handler returns HTTP 200 and successfully exists. Does anyone know what this error means and under what circumstances it shows up?
I'm also attaching a screenshot of the logs.
Does your job take longer than 120 seconds? I was having the same issue and figured out node versions prior to 13 has 120 seconds server.timeout limit. I installed node 13 on docker and problem is gone.
Error 503 is returned by the Google Frontend (GFE). The Cloud Run service either has a transient issue, or the GFE has determined that your service is not ready or not working correctly.
In your log entries, I see a POST request. 7 ms later is the error 503. This tells me your Cloud Run application is not yet ready (in a ready state determined by Cloud Run).
One minute, 8 seconds before, I see ReplaceService. This tells me that your service is not yet in a running state and that if you retry later, you will see success.
I've run an incremental sleep test on my FLASK endpoint which returns 200 within 1 min, 2 min and 10 min of waiting time. Having triggered the endpoint via the Cloud Scheduler, the job failed only in the 10 min test. I've found that it was one of the properties of my Cloud Scheduler job causing the failure. The following solved my issue.
gcloud scheduler jobs describe <my_test_scheduler>
There, you'll see a property called 'attemptDeadline' which was set to 180 seconds by default.
You can update that property using:
gcloud scheduler jobs update http <my_test_scheduler> --attempt-deadline 1000s
Ref: scheduler update

Orderer disconnections in a Hyperledger Fabric application

We have a hyperledger application. The main application is hosted on AWS VM's whereas the DR is hosted on Azure VM's. Recently the Microsoft Team identified that one of the DR VM's became unavailable and the availability was restored in approximately 8 minutes. As per Microsoft "This unexpected occurrence was caused by an Azure initiated auto-recovery action. The auto-recovery action was triggered by a hardware issue on the physical node where the virtual machine was hosted. As designed, your VM was automatically moved to a different and healthy physical node to avoid further impact." The Zookeeper VM was also redeployed at the same
The day after this event occurred, we have started noticing that an orderer goes offline and immediately comes online after a few seconds. This disconnection/connection occurs regularly after a gap of 12 hours and 10 minutes.
We have noticed two things
In the log we get
- [orderer/consensus/kafka] startThread -> CRIT 24df#033[0m [channel:
testchainid] Cannot set up channel consumer = kafka server: The
requested offset is outside the range of offsets maintained by the
server for the given topic/partition.
- panic: [channel: testchainid] Cannot set up channel consumer = kafka
server: The requested offset is outside the range of offsets
maintained by the server for the given topic/partition.
- goroutine 52 [running]:
- github.com/hyperledger/fabric/vendor/github.com/op/go-logging.(*Logger).Panicf(0xc4202748a0,
0x108dede, 0x31, 0xc420327540, 0x2, 0x2)
- /w/workspace/fabric-binaries-x86_64/gopath/src/github.com/hyperledger/fabric/vendor/github.com/op/go-logging/logger.go:194
+0x134
- github.com/hyperledger/fabric/orderer/consensus/kafka.startThread(0xc42022cdc0)
- /w/workspace/fabric-binaries-x86_64/gopath/src/github.com/hyperledger/fabric/orderer/consensus/kafka/chain.go:261
+0xb33
- created by
github.com/hyperledger/fabric/orderer/consensus/kafka.(*chainImpl).Start
- /w/workspace/fabric-binaries-x86_64/gopath/src/github.com/hyperledger/fabric/orderer/consensus/kafka/chain.go:126
+0x3f
Another thing which we noticed is that, in logs prior to the VM failure event there were 3 kafka brokers but we can see only 2 kafka brokers in the logs after this event.
Can someone guide me on this? How do I resolve this problem?
Additional information - We have been through the Kafka logs of the day after which the VM was redeployed and we noticed the following
org.apache.kafka.common.network.InvalidReceiveException: Invalid receive (size = 1195725856 larger than 104857600)
at org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:132)
at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:93)
at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:231)
at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:192)
at org.apache.kafka.common.network.Selector.attemptRead(Selector.java:528)
at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:469)
at org.apache.kafka.common.network.Selector.poll(Selector.java:398)
at kafka.network.Processor.poll(SocketServer.scala:535)
at kafka.network.Processor.run(SocketServer.scala:452)
at java.lang.Thread.run(Thread.java:748)
It seems that we have a solution but it needs to be validated. Once the solution is validated, I will post it on this site.