App Engine Flexible deployment fails to become healthy in the allotted time - google-cloud-platform

My flask app deployment via App Engine Flex is timing out and after setting debug=True. I see the following line repeating over and over until it fails. I am not sure however what this is and cannot find anything useful in logs explorer.
Updating service [default] (this may take several minutes)...working DEBUG: Operation [apps/enhanced-bonito-349015/operations/81b83124-17b1-4d90-abdc-54b3fa28df67] not complete. Waiting to retry.
Could anyone share advice on where to look to resolve this issue?
Here is my app.yaml (I thought this was due to a memory issue..):
runtime: python
env:flex
entrypoint: gunicorn - b :$PORT main:app
runtime_config:
python_version:3
resources:
cpu:4
memory_gb: 12
disk_size_gb: 1000
readiness_check:
path: "/readines_check"
check_interval_sec: 5
timeout_sec: 4
failure_threshold: 2
success_threshold: 2
app_start_timeout_sec: 300
Error logs:
ERROR: (gcloud.app.deploy) Error Response: [4] An internal error occurred while processing task /app-engine-flex/flex_await_healthy/flex_await_healthy>2022-05-10T23:21:10.941Z47607.vt.0: Your deployment has failed to become healthy in the allotted time and therefore was rolled back. If you believe this was an error, try adjusting the 'app_start_timeout_sec' setting in the 'readiness_check' section.

There could be possible ways to resolve such deployment errors.
Increase the value of app_start_timeout_sec to the maximum value which is 1800
Make sure that all the Google Cloud services that Endpoints and ESP require are enabled on your project.
Assuming that splitHealthChecks feature is enabled, make sure to follow all the steps needed when migrating from the legacy version.

Related

invalid cloud build timeout?

We have a build that takes anywhere from 1 minute to 15 minutes(monobuild that is not parallized yet so it may build 8 servers or 1). It was timing out so I modified the build file to
steps:
- name: gcr.io/$PROJECT_ID/continuous-deploy
timeout: 1200s
I also ran these commands(the last one failed though even though I got that from another post so it worked for them somehow)...
Deans-MacBook-Pro:orderly dean$ gcloud config set app/cloud_build_timeout 1250
Updated property [app/cloud_build_timeout].
Deans-MacBook-Pro:orderly dean$ gcloud config set builds/timeout 1300
Updated property [builds/timeout].
Deans-MacBook-Pro:orderly dean$ gcloud config set container/build-timeout 1350
ERROR: (gcloud.config.set) Section [container] has no property [build-timeout].
Deans-MacBook-Pro:orderly dean$
I get the following error that anything greater than 10 minutes is invalid on google
invalid build: invalid timeout in build step #0: build step timeout "20m0s" must be <= build timeout "10m0s"
Why MUST it be less than 10m0s? I really need our builds to be about 20 minutes.
I was going off of
Why can't I override the timeout on my Google Cloud Build?
and
GCP Cloud build ignores timeout settings
thanks,
Dean
The timeout of the steps should be less or equal than the timeout of the whole task.
By setting the timeout at the step level to 20 minutes it is causing the error as the default timeout for the whole task is 10 minutes by default.
The way to avoid this happenning is to set the timeout of the full task to be grater or equal to the the timeout of the specific steps.
I added a small example on how to define this.
steps:
- name: gcr.io/$PROJECT_ID/continuous-deploy
timeout: 1200s # Step Timeout
timeout: 1200s # Full Task Timeout

trying to debug "502 Bad Gateway" error after deploying react app to gcp?

I've deployed a React app via "gcloud app deploy". The "gcloud app browse" command opens a browser which tries to load for a while but then displays a browser title of "502 Bad Gateway." I found the following troubleshooting page:
https://cloud.google.com/endpoints/docs/openapi/troubleshoot-response-errors#gae_errors
The following info on the troubleshoting page appears to be a good match for my scenario:
"An error code 502 with BAD_GATEWAY in the message usually indicates
that App Engine terminated the application because it ran out of
memory. The default App Engine flexible VM only has 1GB of memory,
with only 600MB available for the application container."
But I don't see any "out of memory" error reference in my logs for this. I think I probably need to ensure that I "gcloud app deploy" with a proper app.yaml file. I'm having problems identifying what is a valid minimum yaml file for my React app for which I can be assured that my "gcloud app deploy" will have the expected result. I found the following reference which appears to be a good starting point:
https://cloud.google.com/endpoints/docs/openapi/get-started-app-engine
^^^ This page refers to the following yaml sample code:
https://github.com/GoogleCloudPlatform/java-docs-samples/blob/master/endpoints/getting-started/src/main/appengine/app.yaml
But the url refers to "java-docs-sample" so not sure if this is a vaid yaml file for a React app deployment. Can you provide some guidance on this? I'm really just looking for the minimum yaml file that I can use for a successful deployment. This is the structure of the yaml file that I used for my initial "gcloud app deploy", and the deployment process appeared to indicate success, but not sure if there is any type of fatal flaw here or anything else that may be missing:
runtime: nodejs
env: flex
manual_scaling:
instances: 1
resources:
cpu: 1
From what I understand, you just want a minimal good app.yaml for react apps as the out of memory seems to be the issue if everything else is correct.
A sample app.yaml for react is the following:
# [START runtime]
runtime: nodejs
env: flex
# [END runtime]
# [START handlers]
handlers:
- url: /
static_files: index.html
upload: index.html
# [END handlers]
But you need to modify your handlers according to your needs/ configuration.
502 error sometimes indicates that your app has an issue itself. So it's better to test locally first and make sure your app is working.
Then for the memory part, you can try specifying the instance type to be one with a higher memory. If it still throws the same error then most likely the issue is within your app or dependencies.
I think there is something about react-scripts start that google cloud doesn't like; I've had trouble with this (react app + google cloud deployment) twice in completely different environments (one had docker and one did not); but the first time I never posted anything to stack overflow so I had to go through the pain again :p
Try changing the package.json file to not use react-scripts start when you run npm run start.
Note that this will overwrite the npm run start and npm start command, so if you use this, you can also update the package json with another keyword such as local and change your local running process to involve writing npm run local
"scripts": {
"start": "serve -s build",
"local": "react-scripts start",
"build": "react-scripts build",
...
},
A working repo

Cloud Composer GKE Node upgrade results in Airflow task randomly failing

The problem:
I have a managed Cloud composer environment, under a 1.9.7-gke.6 Kubernetes cluster master.
I tried to upgrade it (as well as the default-pool nodes) to 1.10.7-gke.1, since an upgrade was available.
Since then, Airflow has been acting randomly. Tasks that were working properly are failing for no given reason. This makes Airflow unusable, since the scheduling becomes unreliable.
Here is an example of a task that runs every 15 minutes and for which the behavior is very visible right after the upgrade:
airflow_tree_view
On hover on a failing task, it only shows an Operator: null message (null_operator). Also, there is no log at all for that task.
I have been able to reproduce the situation with another Composer environment in order to ensure that the upgrade is the cause of the dysfunction.
What I have tried so far :
I assumed the upgrade might have screwed up either the scheduler or Celery (Cloud composer defaults to CeleryExecutor).
I tried restarting the scheduler with the following command:
kubectl get deployment airflow-scheduler -o yaml | kubectl replace --force -f -
I also tried to restart Celery from inside the workers, with
kubectl exec -it airflow-worker-799dc94759-7vck4 -- sudo celery multi restart 1
Celery restarts, but it doesn't fix the issue.
So I tried to restart the airflow completely the same way I did with airflow-scheduler.
None of these fixed the issue.
Side note, I can't access Flower to monitor Celery when following this tutorial (Google Cloud - Connecting to Flower). Connecting to localhost:5555 stay in 'waiting' state forever. I don't know if it is related.
Let me know if I'm missing something!
1.10.7-gke.2 is available now [1]. Can you further upgrade to 1.10.7-gke.2 to see if the issue persists?
[1] https://cloud.google.com/kubernetes-engine/release-notes

Error 100: Unable to render instance groups for deployment

I am deploying Cloud Foundry using bosh. I have completed following things:
Spinning up Director VM
Setting up bosh target
Creating Manifest Stub
Uploading bosh release
Currently I am stuck at bosh deploy.
ubuntu#bosh:~/my-bosh/cf-release$ bosh deploy
Acting as user 'admin' on deployment 'dev' on 'my-bosh'
Getting deployment properties from director...
Detecting deployment changes
.
Cloud Foundry Manifest!! (redacted)
.
Please review all changes carefully
Deploying
---------
Are you sure you want to deploy? (type 'yes' to continue): yes
Director task 44
Deprecation: Ignoring cloud config. Manifest contains 'networks' section.
Started preparing deployment > Preparing deployment. Done (00:00:01)
Error 100: Unable to render instance groups for deployment. Errors are:
- Unable to render jobs for instance group 'doppler_z1'. Errors are:
- Unable to render templates for job 'doppler'. Errors are:
- Error filling in template 'doppler.crt.erb' (line 1: Can't find property '["loggregator.tls.doppler.cert"]')
- Error filling in template 'doppler.key.erb' (line 1: Can't find property '["loggregator.tls.doppler.key"]')
- Unable to render jobs for instance group 'loggregator_trafficcontroller_z1'. Errors are:
- Unable to render templates for job 'loggregator_trafficcontroller'. Errors are:
- Error filling in template 'trafficcontroller.crt.erb' (line 1: Can't find property '["loggregator.tls.trafficcontroller.cert"]')
- Error filling in template 'trafficcontroller.key.erb' (line 1: Can't find property '["loggregator.tls.trafficcontroller.key"]')
Task 44 error
It seems to me that your stub is missing certificates for the loggregator certs which are provided by default in the AWS stub (see the
minimal-aws.yml).
Also a current [ticket][3] is opened tracking your problem.

Pivotal Cloud Foundry Installation of Elastic Runtime Fails at smoke tests

vSphere 5.5
Ops Manager 1.3.4.0
Elastic Runtime 1.3.4.0
Ops Metrics 1.3.3.0
Install fails on step: Running errand Run Smoke Tests for Pivotal Elastic Runtime
I can't go into the VM to troubleshoot what is going on, as when smoke tests fail the smoke tests vm is removed. I can skip the smoke test errand and it will complete, but I am trying to figure out why the smoke test errand will not complete properly. Any help is greatly appreciated.
Here is a complete link to my install log https://dl.dropboxusercontent.com/u/14091323/cf-install.log
Here is an excerpt from the install log where the failure happens:
Errand push-app-usage-service' completed successfully (exit code 0)
{"type": "step_finished", "id": "errands.running.cf-9b93ae0464e2a248f279.push-app-usage-service"}
{"type": "step_started", "id": "errands.running.cf-9b93ae0464e2a248f279.smoke-tests"}
46ab6197-dd49-46f1-9631-1249406d452f
Deployment set to/var/tempest/workspaces/default/deployments/cf-9b93ae0464e2a248f279.yml'
Director task 52
Deprecation: Please use templates' when specifying multiple templates for a job.template' for multiple templates will soon be unsupported.
Deprecation: Please use templates' when specifying multiple templates for a job.template' for multiple templates will soon be unsupported.
Deprecation: Please use templates' when specifying multiple templates for a job.template' for multiple templates will soon be unsupported.
Deprecation: Please use templates' when specifying multiple templates for a job.template' for multiple templates will soon be unsupported.
Deprecation: Please use templates' when specifying multiple templates for a job.template' for multiple templates will soon be unsupported.
Deprecation: Please use templates' when specifying multiple templates for a job.template' for multiple templates will soon be unsupported.
Started preparing deployment
Started preparing deployment > Binding deployment. Done (00:00:00)
Started preparing deployment > Binding releases. Done (00:00:00)
Started preparing deployment > Binding existing deployment. Done (00:00:00)
Started preparing deployment > Binding resource pools. Done (00:00:00)
Started preparing deployment > Binding stemcells. Done (00:00:00)
Started preparing deployment > Binding templates. Done (00:00:00)
Started preparing deployment > Binding properties. Done (00:00:00)
Started preparing deployment > Binding unallocated VMs. Done (00:00:00)
Started preparing deployment > Binding instance networks. Done (00:00:00)
Done preparing deployment (00:00:00)
Started preparing package compilation > Finding packages to compile. Done (00:00:00)
Started creating bound missing vms > smoke-tests/0. Done (00:00:37)
Started binding instance vms > smoke-tests/0. Done (00:00:00)
Started updating job smoke-tests > smoke-tests/0 (canary). Done (00:00:45)
Started running errand > smoke-tests/0. Done (00:00:38)
Started fetching logs for smoke-tests/0 > Finding and packing log files. Done (00:00:01)
Started deleting errand instances smoke-tests > vm-0207c40c-3551-4436-834d-7037871efdb5. Done (00:00:05)
Task 52 done
Started 2015-04-12 21:23:27 UTC
Finished 2015-04-12 21:25:36 UTC
Duration 00:02:09
Errand `smoke-tests' completed with error (exit code 1)[stdout]
################################################################################################################
go version go1.2.1 linux/amd64
CONFIG=/var/vcap/jobs/smoke-tests/bin/config.json
{
"suitename" : "CFSMOKETESTS",
"api" : "https://api.cf.lab.local",
"appsdomain" : "cf.lab.local",
"user" : "smoketests",
"password" : "ad445f38ca9bbf21933e",
"org" : "CFSMOKETESTORG",
"space" : "CFSMOKETESTSPACE",
"useexistingorg" : false,
"useexistingspace" : false,
"loggingapp" : "",
"runtimeapp" : "",
"skipsslvalidation": true
}CONFIG=/var/vcap/jobs/smoke-tests/bin/config.json
GOPATH=/var/vcap/packages/smoke-tests/src/github.com/cloudfoundry-incubator/cf-smoke-tests/Godeps/workspace:/var/vcap/packages/smoke-tests
GOROOT=/var/vcap/data/packages/golang/aa5f90f06ada376085414bfc0c56c8cd67abba9c.1-f892239e5c78542d10f4d8f098d9b892c0b27bc1
OLDPWD=/var/vcap/bosh
PATH=/var/vcap/packages/smoke-tests/src/github.com/cloudfoundry-incubator/cf-smoke-tests/Godeps/_workspace/bin:/var/vcap/packages/smoke-tests/bin:/var/vcap/packages/cli/bin:/var/vcap/data/packages/golang/aa5f90f06ada376085414bfc0c56c8cd67abba9c.1-f892239e5c78542d10f4d8f098d9b892c0b27bc1/bin:/var/vcap/packages/git/bin:/usr/sbin:/usr/bin:/sbin:/bin
PWD=/var/vcap/packages/smoke-tests/src/github.com/cloudfoundry-incubator/cf-smoke-tests
SHLVL=1
TMPDIR=/var/vcap/data/tmp
_=/usr/bin/env
################################################################################################################
Running smoke tests...
/var/vcap/data/packages/golang/aa5f90f06ada376085414bfc0c56c8cd67abba9c.1-f892239e5c78542d10f4d8f098d9b892c0b27bc1/bin/go
Running Suite: CF-Smoke-Tests
Random Seed: [1m1428873898[0m
Will run [1m2[0m of [1m2[0m specs
[0mLoggregator:[0m
[1mcan see app messages in the logs[0m
[37m/var/vcap/packages/smoke-tests/src/github.com/cloudfoundry-incubator/cf-smoke-tests/smoke/loggregator_test.go:37[0m
[32m> cf api https://api.cf.lab.local --skip-ssl-validation [0m
Setting api endpoint to https://api.cf.lab.local...
FAILED
i/o timeout
[32m> cf delete-space CFSMOKETEST_SPACE -f [0m
No API endpoint targeted. Use 'cf login' or 'cf api' to target an endpoint.
[91m[1m• Failure [5.240 seconds][0m
[91m[1mLoggregator: [BeforeEach][0m
[90m/var/vcap/packages/smoke-tests/src/github.com/cloudfoundry-incubator/cf-smoke-tests/smoke/loggregatortest.go:38[0m
can see app messages in the logs
[90m/var/vcap/packages/smoke-tests/src/github.com/cloudfoundry-incubator/cf-smoke-tests/smoke/loggregatortest.go:37[0m
[91mExpected
<int>: 1
to match exit code:
<int>: 0[0m
/var/vcap/packages/smoke-tests/src/github.com/cloudfoundry-incubator/cf-smoke-tests/Godeps/workspace/src/github.com/cloudfoundry-incubator/cf-test-helpers/cf/asuser.go:39
[90m------------------------------[0m
[0mRuntime:[0m
[1mcan be pushed, scaled and deleted[0m
[37m/var/vcap/packages/smoke-tests/src/github.com/cloudfoundry-incubator/cf-smoke-tests/smoke/runtime_test.go:62[0m
[32m> cf api https://api.cf.lab.local --skip-ssl-validation [0m
Setting api endpoint to https://api.cf.lab.local...
OK
API endpoint: https://api.cf.lab.local (API version: 2.13.0)
Not logged in. Use 'cf login' to log in.
[32m> cf auth smoke_tests ad445f38ca9bbf21933e [0m
API endpoint: https://api.cf.lab.local
Authenticating...
OK
Use 'cf target' to view or set your target org and space
[32m> cf create-quota CFSMOKETESTORGQUOTA -m 10G -r 10 -s 2 [0m
Creating quota CFSMOKETESTORGQUOTA as smoke_tests...
FAILED
i/o timeout
[32m> cf delete-space CFSMOKETEST_SPACE -f [0m
FAILED
No org targeted, use 'cf target -o ORG' to target an org.
[91m[1m• Failure [15.910 seconds][0m
[91m[1mRuntime: [BeforeEach][0m
[90m/var/vcap/packages/smoke-tests/src/github.com/cloudfoundry-incubator/cf-smoke-tests/smoke/runtimetest.go:63[0m
can be pushed, scaled and deleted
[90m/var/vcap/packages/smoke-tests/src/github.com/cloudfoundry-incubator/cf-smoke-tests/smoke/runtimetest.go:62[0m
[91mExpected
<int>: 1
to match exit code:
<int>: 0[0m
/var/vcap/packages/smoke-tests/src/github.com/cloudfoundry-incubator/cf-smoke-tests/smoke/init_test.go:59
[90m------------------------------[0m
[91m[1mSummarizing 2 Failures:[0m
[91m[1m[Fail] [0m[91m[1m[BeforeEach] Loggregator: [0m[0mcan see app messages in the logs [0m
[37m/var/vcap/packages/smoke-tests/src/github.com/cloudfoundry-incubator/cf-smoke-tests/Godeps/workspace/src/github.com/cloudfoundry-incubator/cf-test-helpers/cf/asuser.go:39[0m
[91m[1m[Fail] [0m[91m[1m[BeforeEach] Runtime: [0m[0mcan be pushed, scaled and deleted [0m
[37m/var/vcap/packages/smoke-tests/src/github.com/cloudfoundry-incubator/cf-smoke-tests/smoke/init_test.go:59[0m
[1m[91mRan 2 of 2 Specs in 21.151 seconds[0m
[1m[91mFAIL![0m -- [32m[1m0 Passed[0m | [91m[1m2 Failed[0m | [33m[1m0 Pending[0m | [36m[1m0 Skipped[0m --- FAIL: TestSmokeTests (21.15 seconds)
FAIL
Ginkgo ran 1 suite in 31.489423576s
Test Suite Failed
Smoke Tests Complete; exit status: 1
[stderr]
+ which go
+ localgopath=/var/vcap/packages/smoke-tests/src/github.com/cloudfoundry-incubator/cf-smoke-tests/Godeps/workspace
+ mkdir -p /var/vcap/packages/smoke-tests/src/github.com/cloudfoundry-incubator/cf-smoke-tests/Godeps/workspace/bin
+ export GOPATH=/var/vcap/packages/smoke-tests/src/github.com/cloudfoundry-incubator/cf-smoke-tests/Godeps/workspace:/var/vcap/packages/smoke-tests/src/github.com/cloudfoundry-incubator/cf-smoke-tests/Godeps/workspace:/var/vcap/packages/smoke-tests
+ export PATH=/var/vcap/packages/smoke-tests/src/github.com/cloudfoundry-incubator/cf-smoke-tests/Godeps/workspace/bin:/var/vcap/packages/smoke-tests/src/github.com/cloudfoundry-incubator/cf-smoke-tests/Godeps/_workspace/bin:/var/vcap/packages/smoke-tests/bin:/var/vcap/packages/cli/bin:/var/vcap/data/packages/golang/aa5f90f06ada376085414bfc0c56c8cd67abba9c.1-f892239e5c78542d10f4d8f098d9b892c0b27bc1/bin:/var/vcap/packages/git/bin:/usr/sbin:/usr/bin:/sbin:/bin
+ go install -v github.com/onsi/ginkgo/ginkgo
io
bytes
bufio
syscall
time
os
fmt
flag
github.com/onsi/ginkgo/config
go/token
strings
path/filepath
go/scanner
go/ast
path
regexp/syntax
regexp
io/ioutil
net/url
text/template/parse
text/template
go/doc
go/parser
log
go/build
text/tabwriter
go/printer
go/format
os/exec
github.com/onsi/ginkgo/ginkgo/convert
github.com/onsi/ginkgo/ginkgo/nodot
github.com/onsi/ginkgo/ginkgo/testsuite
encoding/base64
encoding/json
encoding/xml
github.com/onsi/ginkgo/types
github.com/onsi/ginkgo/reporters/stenographer
github.com/onsi/ginkgo/reporters
hash
crypto
crypto/md5
encoding/binary
net
compress/flate
hash/crc32
compress/gzip
crypto/cipher
crypto/aes
crypto/des
math/big
crypto/elliptic
crypto/ecdsa
crypto/hmac
crypto/rand
crypto/rc4
crypto/rsa
crypto/sha1
crypto/sha256
crypto/dsa
encoding/asn1
crypto/x509/pkix
encoding/hex
encoding/pem
crypto/x509
crypto/tls
mime
net/textproto
mime/multipart
net/http
github.com/onsi/ginkgo/internal/remote
github.com/onsi/ginkgo/ginkgo/testrunner
github.com/onsi/ginkgo/ginkgo/watch
os/signal
github.com/onsi/ginkgo/ginkgo
+ ginkgo -r -v -slowSpecThreshold=300
{"type": "step_finished", "id": "errands.running.cf-9b93ae0464e2a248f279.smoke-tests"}
Exited with 1.
It turns out this was a bug in Pivotal Cloud Foundry 1.3. You will only see this bug if you use a separate Deployment and Infrastructure network (as is recommended). This bug is fixed in Pivotal Cloud Foundry 1.4.
I have outlined in detail what was going on here:
http://www.feeny.org/smoke-tests-fail-pivotal-cloud-foundry-1-3-solution/
Basically, the short of it is, the smoke-tests errand is created with the Ops Manager Infrastructure network IP address in its /etc/resolv.conf. This creates an asymmetrical routing situation and results in a timeout. This can be fixed by changing the following on the Ops Mgr:
To change this behaviour in Pivotal CF v1.3.x, on the Ops Manager VM, change /home/tempest-web/tempest/app/models/tempest/manifests/network_section.rb
Line 20: "dns" => [microbosh_dns_ip] + network.parsed_dns,
to "dns" => network.parsed_dns,
then restart the tempest-web service:
sudo service tempest-web stop
sudo service tempest-web start
Now you can re-enable the smoke-tests errand and re-apply changes and all will be well!