Error running Canary Deployment in Spinnaker - amazon-web-services

I am trying to enable the canary deployment for the AWS eks but my kayenta pod is not starting. When I describe the pod I see this error. Can anyone help?
Warning Unhealthy 12m (x2 over 12m) kubelet Readiness probe failed: wget: can't connect to remote host (127.0.0.1): Connection refused
Warning Unhealthy 2m56s (x59 over 12m) kubelet Readiness probe failed: wget: server returned error: HTTP/1.1 503
This is the status of pod:
NAME READY STATUS RESTARTS AGE
spin-clouddriver-d796bdc59-tpznw 1/1 Running 0 3h40m
spin-deck-77cc75b57d-w7rfp 1/1 Running 0 3h40m
spin-echo-db954bb9-phfd5 1/1 Running 0 3h40m
spin-front50-7c5684cf9-t7vl8 1/1 Running 0 3h40m
spin-gate-78d6779854-7xqz4 1/1 Running 0 3h40m
spin-kayenta-6d7b5fdfc6-p5tcp 0/1 Running 0 21m
spin-kayenta-869c46bfcf-8t5fh 0/1 Running 0 28m
spin-orca-7ddd66758d-mpnkg 1/1 Running 0 3h40m
spin-redis-5975cfcdc8-rnm9g 1/1 Running 0 45h
spin-rosco-b7dbb577-z4szz 1/1 Running 0 3h40m

I will try to address your issue from the Kubernetes perspective.
The errors you were experiencing:
Warning Unhealthy 12m (x2 over 12m) kubelet Readiness probe failed: wget: can't connect to remote host (127.0.0.1): Connection refused
Warning Unhealthy 2m56s (x59 over 12m) kubelet Readiness probe failed: wget: server returned error: HTTP/1.1 503
indicates that there was a problem with your ReadinessProbe configuration. Removing the ReadinessProbe from the deployment "fixed" the error but can cause some more issues in the future. To avoid that I recommend adding it back with a proper configuration:
Probes have a number of fields that you can use to more precisely
control the behavior of liveness and readiness checks:
initialDelaySeconds: Number of seconds after the container has started before liveness or readiness probes are initiated. Defaults to
0 seconds. Minimum value is 0.
periodSeconds: How often (in seconds) to perform the probe. Default to 10 seconds. Minimum value is 1.
timeoutSeconds: Number of seconds after which the probe times out. Defaults to 1 second. Minimum value is 1.
successThreshold: Minimum consecutive successes for the probe to be considered successful after having failed. Defaults to 1. Must be 1
for liveness. Minimum value is 1.
failureThreshold: When a probe fails, Kubernetes will try failureThreshold times before giving up. Giving up in case of liveness
probe means restarting the container. In case of readiness probe the
Pod will be marked Unready. Defaults to 3. Minimum value is 1.
You'll need to adjust the Probe's configuration based on your apps behavior (usually by trial and error). The two resources I would recommend that will help you with that are:
Configure Liveness, Readiness and Startup Probes
Kubernetes best practices: Setting up health checks with readiness and liveness probes

Related

ingress-nginx-controller unable to start with CrashLoopBackOff

I am trying to install the ingres-nginx-controller kubeadm (baremetal) , however for some reason it fails to start and when I try to apply my ingress rule it throws the below error:
Error from server (InternalError): error when creating "ing.yaml": Internal error occurred: failed calling webhook "validate.nginx.ingress.kubern
etes.io": Post https://ingress-nginx-controller-admission.ingress-nginx.svc:443/extensions/v1beta1/ingresses?timeout=30s: dial tcp 10.103.2.38:44
3: connect: connection refused
I believed this was suggesting of unable to connect to the ingress-nginx-controller pods; so on checking I could see that the nginx controller pod was unable to start :
# kubectl get all -n ingress-nginx
NAME READY STATUS RESTARTS AGE
pod/ingress-nginx-admission-create-wwl67 0/1 CrashLoopBackOff 4 4m17s
pod/ingress-nginx-admission-patch-zclsr 0/1 CrashLoopBackOff 4 4m17s
pod/ingress-nginx-controller-75589bd5f6-hjk4z 0/1 ContainerCreating 0 4m27s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/ingress-nginx-controller NodePort 10.96.192.255 <none> 80:30044/TCP,443:32048/TCP 4m27s
service/ingress-nginx-controller-admission ClusterIP 10.102.71.188 <none> 443/TCP 4m28s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/ingress-nginx-controller 0/1 1 0 4m27s
NAME DESIRED CURRENT READY AGE
replicaset.apps/ingress-nginx-controller-75589bd5f6 1 1 0 4m27s
NAME COMPLETIONS DURATION AGE
job.batch/ingress-nginx-admission-create 0/1 4m17s 4m27s
job.batch/ingress-nginx-admission-patch 0/1 4m17s 4m27s
I used the following to install the ingress-nginx:
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v0.34.0/deploy/static/provider/baremetal/deploy.yaml
Also this [ https://raw.githubusercontent.com/kubernetes/ingress-nginx/master/deploy/mandatory.yaml ] shows error 404
I tried to search but could not find any suggestive posts online. Please help me understand and resolve the issue.
Thanks.

istio upgrade from 1.4.6 -> 1.5.0 throws istiod erros : remote error: tls: error decrypting message

Just upgraded istio from 1.4.6 (helm) to istio 1.5.0 (istioctl) [Purged istio and installed from istioctl] but it appears the istiod logs keep throwing the following :
2020-03-16T18:25:45.209055Z info grpc: Server.Serve failed to complete security handshake from "10.150.56.111:56870": remote error: tls: error decrypting message
2020-03-16T18:25:46.792447Z info grpc: Server.Serve failed to complete security handshake from "10.150.57.112:49162": remote error: tls: error decrypting message
2020-03-16T18:25:46.930483Z info grpc: Server.Serve failed to complete security handshake from "10.150.56.160:36878": remote error: tls: error decrypting message
2020-03-16T18:25:48.284122Z info grpc: Server.Serve failed to complete security handshake from "10.150.52.230:44758": remote error: tls: error decrypting message
2020-03-16T18:25:48.288180Z info grpc: Server.Serve failed to complete security handshake from "10.150.57.149:56756": remote error: tls: error decrypting message
2020-03-16T18:25:49.108515Z info grpc: Server.Serve failed to complete security handshake from "10.150.57.151:53970": remote error: tls: error decrypting message
2020-03-16T18:25:49.111874Z info Handling event update for pod contentgatewayaidest-7f4694d87-qmq8z in namespace djin-content -> 10.150.53.50
2020-03-16T18:25:49.519861Z info grpc: Server.Serve failed to complete security handshake from "10.150.57.91:59510": remote error: tls: error decrypting message
2020-03-16T18:25:50.133664Z info grpc: Server.Serve failed to complete security handshake from "10.150.57.203:59726": remote error: tls: error decrypting message
2020-03-16T18:25:50.331020Z info grpc: Server.Serve failed to complete security handshake from "10.150.57.195:59970": remote error: tls: error decrypting message
2020-03-16T18:25:52.110695Z info Handling event update for pod contentgateway-d74b44c7-dtdxs in namespace djin-content -> 10.150.56.215
2020-03-16T18:25:53.312761Z info Handling event update for pod dysonpriority-b6dbc589b-mk628 in namespace djin-content -> 10.150.52.91
2020-03-16T18:25:53.496524Z info grpc: Server.Serve failed to complete security handshake from "10.150.56.111:57276": remote error: tls: error decrypting message
This also leads to no sidecars successfully launching and failing with :
2020-03-16T18:32:17.265394Z info Envoy proxy is NOT ready: config not received from Pilot (is Pilot running?): cds updates: 16 successful, 0 rejected; lds updates: 0 successful, 0 rejected
2020-03-16T18:32:19.269334Z info Envoy proxy is NOT ready: config not received from Pilot (is Pilot running?): cds updates: 16 successful, 0 rejected; lds updates: 0 successful, 0 rejected
2020-03-16T18:32:21.265214Z info Envoy proxy is NOT ready: config not received from Pilot (is Pilot running?): cds updates: 16 successful, 0 rejected; lds updates: 0 successful, 0 rejected
2020-03-16T18:32:23.266159Z info Envoy proxy is NOT ready: config not received from Pilot (is Pilot running?): cds updates: 16 successful, 0 rejected; lds updates: 0 successful,
Weirdly other clusters that I upgraded go through fine. Any idea where this error might be popping up from ? istioctl analyze works fine.
error goes away after killing the nodes (recreating) but istio-proxies still fail with :
info Envoy proxy is NOT ready: config not received from Pilot (is Pilot running?): cds updates: 1 successful, 0 rejected; lds updates: 0 successful, 0 rejected
As far as I know since version 1.4.4 they add istioctl upgrade, which should be used when You want to upgrade istio from 1.4.x to 1.5.0.
The istioctl upgrade command performs an upgrade of Istio. Before performing the upgrade, it checks that the Istio installation meets the upgrade eligibility criteria. Also, it alerts the user if it detects any changes in the profile default values between Istio versions.
The upgrade command can also perform a downgrade of Istio.
See the istioctl upgrade reference for all the options provided by the istioctl upgrade command.
istioctl upgrade --help
The upgrade command checks for upgrade version eligibility and, if eligible, upgrades the Istio control plane components in-place. Warning: traffic may be disrupted during upgrade. Please ensure PodDisruptionBudgets are defined to maintain service continuity.
I made a test on gcp cluster with istio 1.4.6 installed with istioctl and then I used istioctl upgrade from version 1.5.0 and everything works fine.
kubectl get pods -n istio-system
NAME READY STATUS RESTARTS AGE
istio-ingressgateway-598796f4d9-lvzdb 1/1 Running 0 12m
istiod-7d9c7bdd6-mggx7 1/1 Running 0 12m
prometheus-b47d8c58c-7spq5 2/2 Running 0 12m
I checked the logs and made some simple examples and no errors occurs in istiod like in your example.
Upgrade prerequisites for istioctl upgrade
Ensure you meet these requirements before starting the upgrade process:
Istio version 1.4.4 or higher is installed.
Your Istio installation was installed using istioctl.
I assume because of the differences between 1.4.x and 1.5.0 there might be some issues when you want to use both of the installatio methods, helm and istioctl. The best option here would be to install istio 1.4.6 with istioctl and then upgrade it to 1.5.0.
I hope this answer your question. Let me know if you have any more questions.

CockroachDB on AWS EKS cluster - [n?] no stores bootstrapped

I am attempting to deploy CockroachDB:v2.1.6 to a new AWS EKS cluster. Everything is deployed successfully; statefulset, services, pv's & pvc's are created. The AWS EBS volumes are created successfully too.
The issue is the pods never get to a READY state.
pod/cockroachdb-0 0/1 Running 0 14m
pod/cockroachdb-1 0/1 Running 0 14m
pod/cockroachdb-2 0/1 Running 0 14m
If I 'describe' the pods I get the following:
Normal Pulled 46s kubelet, ip-10-5-109-70.eu-central-1.compute.internal Container image "cockroachdb/cockroach:v2.1.6" already present on machine
Normal Created 46s kubelet, ip-10-5-109-70.eu-central-1.compute.internal Created container cockroachdb
Normal Started 46s kubelet, ip-10-5-109-70.eu-central-1.compute.internal Started container cockroachdb
Warning Unhealthy 1s (x8 over 36s) kubelet, ip-10-5-109-70.eu-central-1.compute.internal Readiness probe failed: HTTP probe failed with statuscode: 503
If I examine the logs of a pod I see this:
I200409 11:45:18.073666 14 server/server.go:1403 [n?] no stores bootstrapped and --join flag specified, awaiting init command.
W200409 11:45:18.076826 87 vendor/google.golang.org/grpc/clientconn.go:1293 grpc: addrConn.createTransport failed to connect to {cockroachdb-0.cockroachdb:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp: lookup cockroachdb-0.cockroachdb on 172.20.0.10:53: no such host". Reconnecting...
W200409 11:45:18.076942 21 gossip/client.go:123 [n?] failed to start gossip client to cockroachdb-0.cockroachdb:26257: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp: lookup cockroachdb-0.cockroachdb on 172.20.0.10:53: no such host"
I came across this comment from the CockroachDB forum (https://forum.cockroachlabs.com/t/http-probe-failed-with-statuscode-503/2043/6)
Both the cockroach_out.log and cockroach_output1.log files you sent me (corresponding to mycockroach-cockroachdb-0 and mycockroach-cockroachdb-2) print out no stores bootstrapped during startup and prefix all their log lines with n?, indicating that they haven’t been allocated a node ID. I’d say that they may have never been properly initialized as part of the cluster.
I have deleted everything including pv's, pvc's & AWS EBS volumes through the kubectl delete command and reapplied with the same issue.
Any thoughts would be very much appreciated. Thank you
I was not aware that you had to initialize the CockroachDB cluster after creating it. I did the following to resolve my issue:
kubectl exec -it cockroachdb-0 -n /bin/sh
/cockroach/cockroach init
See here for more details - https://www.cockroachlabs.com/docs/v19.2/cockroach-init.html
After this the pods started running correctly.

Where is Airflow webserver running on Google Composer?

I have following pods:
NAME READY STATUS RESTARTS AGE
airflow-database-init-job-ggk95 0/1 Completed 0 3h
airflow-redis-0 1/1 Running 0 3h
airflow-scheduler-7594cd584-mlfrt 2/2 Running 9 3h
airflow-sqlproxy-74f64b8b97-csl8h 1/1 Running 0 3h
airflow-worker-5fcd4fffff-7w2sg 2/2 Running 0 3h
airflow-worker-5fcd4fffff-m44bs 2/2 Running 0 3h
airflow-worker-5fcd4fffff-mm55s 2/2 Running 0 3h
composer-agent-0034135a-3fed-49a6-b173-9d3f9d0569db-ktwwt 0/1 Completed 0 3h
composer-agent-0034135a-3fed-49a6-b173-9d3f9d0569db-nmjvw 0/1 Error 0 3h
composer-agent-d043348f-025a-4aa1-89b4-d4a5fae91653-8zdwk 0/1 Completed 0 3h
composer-fluentd-daemon-grwsp 1/1 Running 0 3h
composer-fluentd-daemon-rxhjc 1/1 Running 0 3h
composer-fluentd-daemon-xxrmr 1/1 Running 0 3h
I don't know which of them are webserver pods. airflow-worker is probably not webserver, righ? I want to poke it to check if it is working properly, because it seems not to.
As explained in the documentation about Cloud Composer's architecture, the Airflow webserver is running in an App Engine flexible environment hosted in a Google-managed tenant project to which users do not have access.
Unfortunately, the Webserver logs are not forwarded to the Composer's main project (i.e. your project), although there is an open Feature Request in the Public Issue Tracker, so feel free to click on the star icon and comment on it in order to let the Composer engineering know about the importance of this feature and your use case too. Therefore, if you believe you have any other similar issue regarding the webserver itself, I recommend you to either contact support (if you are eligible to do so) or open an issue in the corresponding Public Issue Tracker so that your issue can be investigated by the GCP Support Team.
In case you want to know more about the Airflow Webserver, you can find some additional information in its documentation page too.
With regards to logs of Airflow Webserver - these logs are visible in Stackdriver logging.
Navigate to GCP Menu -> Logging -> Logs Viewer
If you are using classical Stackdriver UI then select "Cloud Composer Environment" in the "resource" dropdown and then select "airflow-webserver" in the second drop down as it was shown in this picture
If you are using new Stackdriver menu then put the following query into query box:
query:resource.type="cloud_composer_environment"
logName="projects/<your project name>/logs/airflow-webserver"
... and you will get the logs generated by airflow-webserver.

istio-ingress can't start up

When I start minikube and apply istio.yaml
bug the ingress can't start up:
eumji#eumji:~$ kubectl get pods -n istio-system
NAME READY STATUS RESTARTS AGE
istio-ca-76dddbd695-bdwm9 1/1 Running 5 2d
istio-ingress-85fb769c4d-qtbcx 0/1 CrashLoopBackOff 67 2d
istio-mixer-587fd4bbdb-ldvhb 3/3 Running 15 2d
istio-pilot-7db8db896c-9znqj 2/2 Running 10 2d
When I try to see the log I get following output:
eumji#eumji:~$ kubectl logs -f istio-ingress-85fb769c4d-qtbcx -n istio-system
ERROR: logging before flag.Parse: I1214 05:04:26.193386 1 main.go:68] Version root#24c944bda24b-0.3.0-24ec6a3ac3a1d592d1873d2d8198278a849b8301
ERROR: logging before flag.Parse: I1214 05:04:26.193463 1 main.go:109] Proxy role: proxy.Node{Type:"ingress", IPAddress:"", ID:"istio-ingress-85fb769c4d-qtbcx.istio-system", Domain:"istio-system.svc.cluster.local"}
ERROR: logging before flag.Parse: I1214 05:04:26.193480 1 resolve.go:35] Attempting to lookup address: istio-mixer
ERROR: logging before flag.Parse: I1214 05:04:41.195879 1 resolve.go:42] Finished lookup of address: istio-mixer
Error: lookup failed for udp address: i/o timeout
Usage:
agent proxy [flags]
--serviceregistry string Select the platform for service registry, options are {Kubernetes, Consul, Eureka} (default "Kubernetes")
--statsdUdpAddress string IP Address and Port of a statsd UDP listener (e.g. 10.75.241.127:9125)
--zipkinAddress string Address of the Zipkin service (e.g. zipkin:9411)
Global Flags:
--log_backtrace_at traceLocation when logging hits line file:N, emit a stack trace (default :0)
-v, --v Level log level for V logs (default 0)
--vmodule moduleSpec comma-separated list of pattern=N settings for file-filtered logging
ERROR: logging before flag.Parse: E1214 05:04:41.198640 1 main.go:267] lookup failed for udp address: i/o timeout
What could be the reason?
There is not enough information in your post to figure out what may be wrong, in particular it seems that somehow your ingress isn't able to resolve istio-mixer which is unexpected.
Can you file a detailed issue
https://github.com/istio/issues/issues/new
And we can take it from there ?
Thanks
Are you using something like minikube? The quick-start docs give this hint: "Note: If your cluster is running in an environment that does not support an external load balancer (e.g., minikube), the EXTERNAL-IP of istio-ingress says . You must access the application using the service NodePort, or use port-forwarding instead."
https://istio.io/docs/setup/kubernetes/quick-start.html