SAP Vora2.1 on AWS intermittently goes down - vora

I have set up a SAP Vora2.1 installation on AWS using kops. It is a 4 node cluster with 1 master and 3 nodes. the persistent volume requirements for vsystem-vrep is provided using AWS-EFS and for other stateful components by using AWS-EBS. While the installation goes through fine and runs for few days but after 3-4 days following 5 vora pods starts showing some issues,
vora-catalog
Vora-relational
Vora-timeseries
vora-tx-coordinator
vora-disk
Each of these pods has 2 containers and both should be up and running. However after 3-4 days one of the containers goes down on its own although kubernetes cluster is up and running. I tried various ways to bring these pods up and running with all required containers in it but it does not come up.
I have captured events for vora-disk as sample but all of pods show same trace,
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
1h 7m 21 kubelet, ip-172-31-64-23.ap-southeast-2.compute.internal spec.containers{disk} Warning Unhealthy Liveness probe failed: dial tcp 100.96.7.21:10002: getsockopt: connection refused
1h 2m 11 kubelet, ip-172-31-64-23.ap-southeast-2.compute.internal spec.containers{disk} Normal Killing Killing container with id docker://disk:pod "vora-disk-0_vora(2f5ea6df-545b-11e8-90fd-029979a0ef92)" container "disk" is unhealthy, it will be killed and re-created.
1h 58s 51 kubelet, ip-172-31-64-23.ap-southeast-2.compute.internal Warning FailedSync Error syncing pod
1h 58s 41 kubelet, ip-172-31-64-23.ap-southeast-2.compute.internal spec.containers{disk} Warning BackOff Back-off restarting failed container
1h 46s 11 kubelet, ip-172-31-64-23.ap-southeast-2.compute.internal spec.containers{disk} Normal Started Started container
1h 46s 11 kubelet, ip-172-31-64-23.ap-southeast-2.compute.internal spec.containers{disk} Normal Pulled Container image "ip-172-31-13-236.ap-southeast-2.compute.internal:5000/vora/dqp:2.1.32.19-vora-2.1" already present on machine
1h 46s 11 kubelet, ip-172-31-64-23.ap-southeast-2.compute.internal spec.containers{disk} Normal Created Created container
1h 1s 988 kubelet, ip-172-31-64-23.ap-southeast-2.compute.internal spec.containers{disk} Warning Unhealthy Readiness probe failed: HTTP probe failed with statuscode: 503
Appreciate if any pointers to resolve this issue.
Thanks Frank for you suggestion and pointer. Definitely this has helped to overcome few issues but not all.
We have specifically observed issues related to Vora services going down for no reason. While we understand that there may be some reason why Vora goes down however the recovery procedure is not available either in admin guide or anywhere on internet. We have seen Vora services created by vora-operator going down (each of these pods contains one security container and other service specific container. Service specific container goes down and does not come up). we tried various options like restarting all vora pods or only restarting pods related to vora deployment operator but these pods do not come up. We are re-deploying Vora in such cases but that essentially means all previous work goes away. Is there any command or way so that Vora pods comes up with all container?

This issue is described in SAP Note 2631736 - Liveness and Readiness issue in Vora 2.x - it is suggested to increase the health check interval.

Related

Deploying JupyterHub on AWS EKS: Readiness probe failed, connection refused

I want to deploy JupyterHub on AWS' EKS service, and I am following the Zero to JupyterHub with Kubernetes guide to achieve this.
I am using the eksctl tool to deploy one cluster with a node group with one node that is represented by a t3.medium EC2 instance. After I deployed JupyterHub according to the instructions given in the guide, I get the following output when running kubectl get pods:
NAME READY STATUS RESTARTS AGE
continuous-image-puller-kl67x 1/1 Running 0 56s
hub-84b6467ff8-spjws 0/1 Pending 0 56s
proxy-79d75ddf8d-76rqm 1/1 Running 0 56s
user-scheduler-795f7d845f-7b8bn 1/1 Running 0 56s
user-scheduler-795f7d845f-mgks9 1/1 Running 0 56s
One pod, hub-84b6467ff8-spjws, remains in peding mode. kubectl describe pods outputs the following at the end:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 3m9s default-scheduler Successfully assigned jh/user-scheduler-795f7d845f-mgks9 to ip-192-168-20-191.eu-west-1.compute.internal
Normal Pulling 3m8s kubelet Pulling image "k8s.gcr.io/kube-scheduler:v1.23.10"
Normal Pulled 3m6s kubelet Successfully pulled image "k8s.gcr.io/kube-scheduler:v1.23.10" in 2.371033007s
Normal Created 3m3s kubelet Created container kube-scheduler
Normal Started 3m3s kubelet Started container kube-scheduler
Warning Unhealthy 3m3s kubelet Readiness probe failed: Get "https://192.168.8.94:10259/healthz": dial tcp 192.168.8.94:10259: connect: connection refused
I am having troubles understanding what "Readiness probe failed: Get "https://192.168.8.94:10259/healthz": dial tcp 192.168.8.94:10259: connect: connection refused" really means. I know there are similar questions relating to this, but so far their answers didn't work for me. I tried to have multiple nodes in the node group with nodes that have more storage, and I made sure the role have the right permission (according to the guide).
I am clearly missing something here, and I am more happy if someone could shed some light on this situation for me.

Setting up K8S on AWS machine fails: api server not starting

I followed this guide to set up K8S on an EC2 instance, and I can't get it to work. I am not using EKS, but instead I want to make my own installation of K8S by running three EC2 instances and make one the master and the two others the workers.
I installed Docker, which works fine:
docker version
Client: Docker Engine - Community
Version: 20.10.17
API version: 1.41
Go version: go1.17.11
Git commit: 100c701
Built: Mon Jun 6 23:02:46 2022
OS/Arch: linux/amd64
Context: default
Experimental: true
Server: Docker Engine - Community
Engine:
Version: 20.10.17
API version: 1.41 (minimum version 1.12)
Go version: go1.17.11
Git commit: a89b842
Built: Mon Jun 6 23:00:51 2022
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.7
GitCommit: 0197261a30bf81f1ee8e6a4dd2dea0ef95d67ccb
runc:
Version: 1.1.3
GitCommit: v1.1.3-0-g6724737
docker-init:
Version: 0.19.0
GitCommit: de40ad0
And kubectl, kubeadm and kubelet.
I ran kubeadm init to initialize my cluster, but after doing so the cluster doesn't manage to start successfully. It is possible to run kubectl cluster-info for a while, but after some time kubelet seems to just give up and commands answer with The connection to the server 172.31.10.55:6443 was refused - did you specify the right host or port?.
When looking at the output of kubectl describe pod kube-apiserver-master -n=kube-system I can see the following:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Killing 30m kubelet Stopping container kube-apiserver
Normal SandboxChanged 29m kubelet Pod sandbox changed, it will be killed and re-created.
Normal Pulled 29m kubelet Container image "k8s.gcr.io/kube-apiserver:v1.24.4" already present on machine
Normal Created 29m kubelet Created container kube-apiserver
Normal Started 29m kubelet Started container kube-apiserver
Warning Unhealthy 29m kubelet Readiness probe failed: Get "https://172.31.10.55:6443/readyz": dial tcp 172.31.10.55:6443: connect: connection refused
Warning Unhealthy 29m kubelet Liveness probe failed: Get "https://172.31.10.55:6443/livez": dial tcp 172.31.10.55:6443: connect: connection refused
Normal SandboxChanged 29m kubelet Pod sandbox changed, it will be killed and re-created.
Normal Pulled 29m kubelet Container image "k8s.gcr.io/kube-apiserver:v1.24.4" already present on machine
Normal Created 29m kubelet Created container kube-apiserver
Normal Started 29m kubelet Started container kube-apiserver
Warning Unhealthy 27m (x3 over 27m) kubelet Liveness probe failed: HTTP probe failed with statuscode: 500
Warning Unhealthy 27m (x15 over 27m) kubelet Readiness probe failed: HTTP probe failed with statuscode: 500
Warning Unhealthy 25m (x2 over 25m) kubelet Startup probe failed: Get "https://172.31.10.55:6443/livez": dial tcp 172.31.10.55:6443: connect: connection refused
Normal Killing 3m52s (x7 over 29m) kubelet Stopping container kube-apiserver
Looking at this, it seems that the issue is that K8S fails to access its own API on port 6443, but the port is not restricted in AWS and anyway this should be internal requests, never coming out of the internal network.
I feel like I might be missing something obvious, but looking at other people with similar problems (6443 port reporting a connection refused) always seems unrelated to the issue I have here. I assume it's due to some network constraint in AWS but no idea which, since it's the first time I manage my own EC2 instances.

EKS: Unhealthy nodes in the kubernetes cluster

I’m getting an error when using terraform to provision node group on AWS EKS.
Error: error waiting for EKS Node Group (xxx) creation: NodeCreationFailure: Unhealthy nodes in the kubernetes cluster.
And I went to console and inspected the node. There is a message “runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker network plugin is not ready: cni config uninitialized”.
I have 5 private subnets and connect to Internet via NAT.
Is someone able to give me some hint on how to debug this?
Here are some details on my env.
Kubernetes version: 1.18
Platform version: eks.3
AMI type: AL2_x86_64
AMI release version: 1.18.9-20201211
Instance types: m5.xlarge
There are three workloads set up in the cluster.
coredns, STATUS (2 Desired, 0 Available, 0 Ready)
aws-node STATUS (5 Desired, 5 Scheduled, 0 Available, 0 Ready)
kube-proxy STATUS (5 Desired, 5 Scheduled, 5 Available, 5 Ready)
go inside the coredns, both pods are in pending state, and conditions has “Available=False, Deployment does not have minimum availability” and “Progress=False, ReplicaSet xxx has timed out progressing”
go inside the one of the pod in aws-node, the status shows “Waiting - CrashLoopBackOff”
Add pod network add-on
kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/2140ac876ef134e0ed5af15c65e414cf26827915/Documentation/kube-flannel.yml

AWS ECS service Tasks getting replaced with (reason Request timed out)

We are running ECS as container orchestration layer for more than 2 years. But there is one problem which we are not able to figure out the reason for, In few of our (node.js) services we have started observing errors in ECS events as
service example-service (instance i-016b0a460d9974567) (port 1047) is unhealthy in target-group example-service due to (reason Request timed out)
This causes our dependent service to start experiencing 504 gateway timeout which impacts them in big way.
Upgraded Docker storage driver from devicemapper to overlay2
We increased the resources for all ECS instances including CPU, RAM and EBS storage as we saw in few containers.
We increase health check grace period for the service from 0 to 240secs
Increased KeepAliveTimeout and SocketTimeout to 180 secs
Enabled awslogs on containers instead of stdout, but there was no unusual behavior
Enabled ECSMetaData at container and pipelined all information in our application logs. This helped us in looking all the logs for problematic container only.
Enabled container insights for better container level debugging
Out of this things which helped the most if upgrading devicemapper to overlay2 storage driver and increasing healthcheck grace period.
The amount of errors have come down amazingly with these two but still we are getting this issue once a while.
We have seen all the graphs related to instance and container which went down below are the logs for it:
ECS container insights logs for victim container :
Query :
fields CpuUtilized, MemoryUtilized, #message
| filter Type = "Container" and EC2InstanceId = "i-016b0a460d9974567" and TaskId = "dac7a872-5536-482f-a2f8-d2234f9db6df"
Example Logs answered :
{
"Version":"0",
"Type":"Container",
"ContainerName":"example-service",
"TaskId":"dac7a872-5536-482f-a2f8-d2234f9db6df",
"TaskDefinitionFamily":"example-service",
"TaskDefinitionRevision":"2048",
"ContainerInstanceId":"74306e00-e32a-4287-a201-72084d3364f6",
"EC2InstanceId":"i-016b0a460d9974567",
"ServiceName":"example-service",
"ClusterName":"example-service-cluster",
"Timestamp":1569227760000,
"CpuUtilized":1024.144923245614,
"CpuReserved":1347.0,
"MemoryUtilized":871,
"MemoryReserved":1857,
"StorageReadBytes":0,
"StorageWriteBytes":577536,
"NetworkRxBytes":14441583,
"NetworkRxDropped":0,
"NetworkRxErrors":0,
"NetworkRxPackets":17324,
"NetworkTxBytes":6136916,
"NetworkTxDropped":0,
"NetworkTxErrors":0,
"NetworkTxPackets":16989
}
None of logs were having CPU and Memory utilised ridiculously high.
We stopped getting responses from the victim container at let's say t1, we got errors in dependent services at t1+2mins and container was taken away by ECS at t1+3mins
Our health check configurations are below :
Protocol HTTP
Path /healthcheck
Port traffic port
Healthy threshold 10
Unhealthy threshold 2
Timeout 5
Interval 10
Success codes 200
Let me know if you need any more information, I will be happy to provide it. Configurations which we are running are :
docker info
Containers: 11
Running: 11
Paused: 0
Stopped: 0
Images: 6
Server Version: 18.06.1-ce
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 468a545b9edcd5932818eb9de8e72413e616e86e
runc version: 69663f0bd4b60df09991c08812a60108003fa340
init version: fec3683
Security Options:
seccomp
Profile: default
Kernel Version: 4.14.138-89.102.amzn1.x86_64
Operating System: Amazon Linux AMI 2018.03
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 30.41GiB
Name: ip-172-32-6-105
ID: IV65:3LKL:JESM:UFA4:X5RZ:M4NZ:O3BY:IZ2T:UDFW:XCGW:55PW:D7JH
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
There should some indication about resource contention or service crashing or genuine network failure to explain all this. But as mentioned there was nothing which we got to know caused any issue.
Your steps from 1 to 7 almost no thing do with the error.
service example-service (instance i-016b0a460d9974567) (port 1047) is
unhealthy in target-group example-service due to (reason Request timed
out)
The error is very clear, you ECS service is not reachable to Load balancer health check.
Target Group Unhealthy
When this is the case, go straight and check the container SG, Port, application status or health status code.
Possible reason
There might be the case, there is no route Path /healthcheck in the backend service
The status code from /healthcheck is not 200
Might be the case that target port is invalid, configure it correctly, if an application running on port 8080 or 3000 it should be 3000 or 8080
The security group is not allowing traffic on the target group
Application is not running in the container
These are the possible reason when there is a timeout from health check.
I faced the same issue of ( Reason request timeout ).
I managed to solve it by updating my security-group inbound rules.
Currently, there was no rule defined in Inbound rules so I add general allow-all traffic for the ipv4 rule for the time being because I was in development at that time.

Endpoints are not being updated with new IP address of a pod

Platform: AWS EKS
Output of helm version:
Client: &version.Version{SemVer:"v2.12.3", GitCommit:"eecf22f77df5f65c823aacd2dbd30ae6c65f186e", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.14.2", GitCommit:"a8b13cc5ab6a7dbef0a58f5061bcc7c0c61598e7", GitTreeState:"clean"}
Output of kubectl version:
Client Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.1", GitCommit:"4485c6f18cee9a5d3c3b4e523bd27972b1b53892", GitTreeState:"clean", BuildDate:"2019-07-18T09:18:22Z", GoVersion:"go1.12.5", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"12+", GitVersion:"v1.12.10-eks-2e569f", GitCommit:"2e569fd887357952e506846ed47fc30cc385409a", GitTreeState:"clean", BuildDate:"2019-07-25T23:13:33Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}
Cloud Provider/Platform (AKS, GKE, Minikube etc.): AWS EKS
The problem:
After jenkins pod restart, the pod got a new IP address and ReadinesProbe supposed to update the endpoints but it doesn't.
kubectl get endpoints
jenkins <none>
jenkins-agent <none>
Error:
Readiness probe failed: Get http://192.168.0.109:8080/login: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
I can successfully access above URL from all pods and worker nodes and I get correct Headers.
This happened after helm failed to upgrade jenkins and then I rollback the release, and it was successful (apart from now not updating endpoints)
Now I need to edit endpoints manually to point Endpoints to the correct IP address of a pod.
Current ReadinesProbe from deployment is:
readinessProbe:
failureThreshold: 3
httpGet:
path: /login
port: http
scheme: HTTP
initialDelaySeconds: 60
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
Log from Jenkins pod is:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 8m13s default-scheduler Successfully assigned default/jenkins-pod-id to <ip>.<region>.compute.internal
Normal SuccessfulAttachVolume 8m6s attachdetach-controller AttachVolume.Attach succeeded for volume "jenkins"
Normal Pulling 8m4s kubelet, <ip>.<region>.compute.internal pulling image "jenkins/jenkins:2.176.2-alpine"
Normal Pulled 7m57s kubelet, <ip>.<region>.compute.internal Successfully pulled image "jenkins/jenkins:2.176.2-alpine"
Normal Created 7m56s kubelet, <ip>.<region>.compute.internal Created container
Normal Started 7m56s kubelet, <ip>.<region>.compute.internal Started container
Normal Pulling 7m43s kubelet, <ip>.<region>.compute.internal pulling image "jenkins/jenkins:2.176.2-alpine"
Normal Pulled 7m42s kubelet, <ip>.<region>.compute.internal Successfully pulled image "jenkins/jenkins:2.176.2-alpine"
Normal Created 7m42s kubelet, <ip>.<region>.compute.internal Created container
Normal Started 7m42s kubelet, <ip>.<region>.compute.internal Started container
Warning Unhealthy 6m40s kubelet, <ip>.<region>.compute.internal Readiness probe failed: Get http://<IP>:8080/login: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
The pod got IP almost instantly but it takes a few minutes for the container to start. How can I get the ReadinesProbe to update Endpoints or even get ReadinesProbe logs? This is running in AWS so no access to the controller to get more logs.
If I update the endpoints fast enough, the ReadinesProbe won't fail but this doesn't help when the pod restarts next time.
Update:
Just enabled EKS logs and got this:
deployment_controller.go:484] Error syncing deployment default/jenkins: Operation cannot be fulfilled on deployments.apps "jenkins": the object has been modified; please apply your changes to the latest version and try again
Below helped. The Readiness probe is still failing but this is due to Jenkins taking 90s to start. I will update this.
helm delete jenkins
release "jenkins" deleted
helm rollback jenkins 25
Rollback was a success! Happy Helming!