Understanding Kubernetes cluster scaling - amazon-web-services

Using AWS EKS with t3.medium instances so I have (2 VCPU = 2000 cores and 4gb ram).
Running 6 different apps on the node with these cpu request definitions:
name request replica total-cpu
app#1 300m x2 600
app#2 100m x4 400
app#3 150m x1 150
app#4 300m x1 300
app#5 100m x1 100
app#6 150m x1 150
With basic math I can say whole apps consume 1700m cpu cores. Also I have hpa with 60% cpu limit for app#1 and app#2. So, I am expecting to have just one node, or maybe two nodes (because of kube-system pods), but the cluster always running with 3 nodes. It looks like I understood autoscaling thing wrong.
$ kubectl top nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
ip-*.eu-central-1.compute.internal 221m 11% 631Mi 18%
ip-*.eu-central-1.compute.internal 197m 10% 718Mi 21%
ip-*.eu-central-1.compute.internal 307m 15% 801Mi 23%
As you see it's just using 10-15% of nodes. How can I optimize node scaling? What is the reason to have 3 nodes?
$ kubectl get hpa
NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS AGE
app#1 Deployment/easyinventory-deployment 37%/60% 1 5 3 5d16h
app#2 Deployment/poolinventory-deployment 64%/60% 1 5 4 4d10h
UPDATE #1
I have pod disruption budget for kube-system pods
kubectl create poddisruptionbudget pdb-event --namespace=kube-system --selector k8s-app=event-exporter --max-unavailable 1
kubectl create poddisruptionbudget pdb-fluentd --namespace=kube-system --selector k8s-app=k8s-app: fluentd-gcp-scaler --max-unavailable 1
kubectl create poddisruptionbudget pdb-heapster --namespace=kube-system --selector k8s-app=heapster --max-unavailable 1
kubectl create poddisruptionbudget pdb-dns --namespace=kube-system --selector k8s-app=kube-dns --max-unavailable 1
kubectl create poddisruptionbudget pdb-dnsauto --namespace=kube-system --selector k8s-app=kube-dns-autoscaler --max-unavailable 1
kubectl create poddisruptionbudget pdb-glbc --namespace=kube-system --selector k8s-app=glbc --max-unavailable 1
kubectl create poddisruptionbudget pdb-metadata --namespace=kube-system --selector app=metadata-agent-cluster-level --max-unavailable 1
kubectl create poddisruptionbudget pdb-kubeproxy --namespace=kube-system --selector component=kube-proxy --max-unavailable 1
kubectl create poddisruptionbudget pdb-metrics --namespace=kube-system --selector k8s-app=metrics-server --max-unavailable 1
#source: https://gist.github.com/kenthua/fc06c6ea52a25a51bc07e70c8f781f8f
UPDATE #2
Figured out 3rd node is not always live, k8s scaling down to 2 nodes but after a few minutes, scaling up again to 3 nodes and later down to 2 nodes again and again.
kubectl describe nodes
# Node 1
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1010m (52%) 1300m (67%)
memory 3040Mi (90%) 3940Mi (117%)
ephemeral-storage 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0
# Node 2
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1060m (54%) 1850m (95%)
memory 3300Mi (98%) 4200Mi (125%)
ephemeral-storage 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0
UPDATE #3
I0608 11:03:21.965642 1 static_autoscaler.go:192] Starting main loop
I0608 11:03:21.965976 1 utils.go:590] No pod using affinity / antiaffinity found in cluster, disabling affinity predicate for this loop
I0608 11:03:21.965996 1 filter_out_schedulable.go:65] Filtering out schedulables
I0608 11:03:21.966120 1 filter_out_schedulable.go:130] 0 other pods marked as unschedulable can be scheduled.
I0608 11:03:21.966164 1 filter_out_schedulable.go:130] 0 other pods marked as unschedulable can be scheduled.
I0608 11:03:21.966175 1 filter_out_schedulable.go:90] No schedulable pods
I0608 11:03:21.966202 1 static_autoscaler.go:334] No unschedulable pods
I0608 11:03:21.966257 1 static_autoscaler.go:381] Calculating unneeded nodes
I0608 11:03:21.966336 1 scale_down.go:437] Scale-down calculation: ignoring 1 nodes unremovable in the last 5m0s
I0608 11:03:21.966359 1 scale_down.go:468] Node ip-*-93.eu-central-1.compute.internal - memory utilization 0.909449
I0608 11:03:21.966411 1 scale_down.go:472] Node ip-*-93.eu-central-1.compute.internal is not suitable for removal - memory utilization too big (0.909449)
I0608 11:03:21.966460 1 scale_down.go:468] Node ip-*-115.eu-central-1.compute.internal - memory utilization 0.987231
I0608 11:03:21.966469 1 scale_down.go:472] Node ip-*-115.eu-central-1.compute.internal is not suitable for removal - memory utilization too big (0.987231)
I0608 11:03:21.966551 1 static_autoscaler.go:440] Scale down status: unneededOnly=false lastScaleUpTime=2020-06-08 09:14:54.619088707 +0000 UTC m=+143849.361988520 lastScaleDownDeleteTime=2020-06-06 17:18:02.104469988 +0000 UTC m=+36.847369765 lastScaleDownFailTime=2020-06-06 17:18:02.104470075 +0000 UTC m=+36.847369849 scaleDownForbidden=false isDeleteInProgress=false scaleDownInCooldown=false
I0608 11:03:21.966578 1 static_autoscaler.go:453] Starting scale down
I0608 11:03:21.966667 1 scale_down.go:785] No candidates for scale down
Update #4
According to autoscaler logs, it was ignoring the ip-*145.eu-central-1.compute.internal to scale down, for some reason, I wonder what will happen and terminated the instance from EC2 console directly. And these lines appeared in autoscaler logs:
I0608 11:10:43.747445 1 scale_down.go:517] Finding additional 1 candidates for scale down.
I0608 11:10:43.747477 1 cluster.go:93] Fast evaluation: ip-*-145.eu-central-1.compute.internal for removal
I0608 11:10:43.747540 1 cluster.go:248] Evaluation ip-*-115.eu-central-1.compute.internal for default/app2-848db65964-9nr2m -> PodFitsResources predicate mismatch, reason: Insufficient memory,
I0608 11:10:43.747549 1 cluster.go:248] Evaluation ip-*-93.eu-central-1.compute.internal for default/app2-848db65964-9nr2m -> PodFitsResources predicate mismatch, reason: Insufficient memory,
I0608 11:10:43.747557 1 cluster.go:129] Fast evaluation: node ip-*-145.eu-central-1.compute.internal is not suitable for removal: failed to find place for default/app2-848db65964-9nr2m
I0608 11:10:43.747569 1 scale_down.go:554] 1 nodes found to be unremovable in simulation, will re-check them at 2020-06-08 11:15:43.746773707 +0000 UTC m=+151098.489673532
I0608 11:10:43.747596 1 static_autoscaler.go:440] Scale down status: unneededOnly=false lastScaleUpTime=2020-06-08 09:14:54.619088707 +0000 UTC m=+143849.361988520 lastScaleDownDeleteTime=2020-06-06 17:18:02.104469988 +0000 UTC m=+36.847369765 lastScaleDownFailTime=2020-06-06 17:18:02.104470075 +0000 UTC m=+36.847369849 scaleDownForbidden=false isDeleteInProgress=false scaleDownInCooldown=false
As far as I see, the node is not scaling down because there are no other nodes to fit "app2". But app memory request is 700Mi and at the moment other nodes have enough memory for the app2
$ kubectl top nodes
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
ip-10-0-0-93.eu-central-1.compute.internal 386m 20% 920Mi 27%
ip-10-0-1-115.eu-central-1.compute.internal 298m 15% 794Mi 23%
Still no idea why autoscaler is not moving app2 to one of other available nodes and scale down the ip-*-145.

How Pods with resource requests are scheduled.
A request is the amount guaranteed for the container. So the scheduler will not schedule a pod to a node without enough capacity. In your case, the 2 existing nodes already cap their mem (0.9 and 0.98). So ip-*-145 cannot be scaled down otherwise app2 has nowhere to go.

Related

Error running Canary Deployment in Spinnaker

I am trying to enable the canary deployment for the AWS eks but my kayenta pod is not starting. When I describe the pod I see this error. Can anyone help?
Warning Unhealthy 12m (x2 over 12m) kubelet Readiness probe failed: wget: can't connect to remote host (127.0.0.1): Connection refused
Warning Unhealthy 2m56s (x59 over 12m) kubelet Readiness probe failed: wget: server returned error: HTTP/1.1 503
This is the status of pod:
NAME READY STATUS RESTARTS AGE
spin-clouddriver-d796bdc59-tpznw 1/1 Running 0 3h40m
spin-deck-77cc75b57d-w7rfp 1/1 Running 0 3h40m
spin-echo-db954bb9-phfd5 1/1 Running 0 3h40m
spin-front50-7c5684cf9-t7vl8 1/1 Running 0 3h40m
spin-gate-78d6779854-7xqz4 1/1 Running 0 3h40m
spin-kayenta-6d7b5fdfc6-p5tcp 0/1 Running 0 21m
spin-kayenta-869c46bfcf-8t5fh 0/1 Running 0 28m
spin-orca-7ddd66758d-mpnkg 1/1 Running 0 3h40m
spin-redis-5975cfcdc8-rnm9g 1/1 Running 0 45h
spin-rosco-b7dbb577-z4szz 1/1 Running 0 3h40m
I will try to address your issue from the Kubernetes perspective.
The errors you were experiencing:
Warning Unhealthy 12m (x2 over 12m) kubelet Readiness probe failed: wget: can't connect to remote host (127.0.0.1): Connection refused
Warning Unhealthy 2m56s (x59 over 12m) kubelet Readiness probe failed: wget: server returned error: HTTP/1.1 503
indicates that there was a problem with your ReadinessProbe configuration. Removing the ReadinessProbe from the deployment "fixed" the error but can cause some more issues in the future. To avoid that I recommend adding it back with a proper configuration:
Probes have a number of fields that you can use to more precisely
control the behavior of liveness and readiness checks:
initialDelaySeconds: Number of seconds after the container has started before liveness or readiness probes are initiated. Defaults to
0 seconds. Minimum value is 0.
periodSeconds: How often (in seconds) to perform the probe. Default to 10 seconds. Minimum value is 1.
timeoutSeconds: Number of seconds after which the probe times out. Defaults to 1 second. Minimum value is 1.
successThreshold: Minimum consecutive successes for the probe to be considered successful after having failed. Defaults to 1. Must be 1
for liveness. Minimum value is 1.
failureThreshold: When a probe fails, Kubernetes will try failureThreshold times before giving up. Giving up in case of liveness
probe means restarting the container. In case of readiness probe the
Pod will be marked Unready. Defaults to 3. Minimum value is 1.
You'll need to adjust the Probe's configuration based on your apps behavior (usually by trial and error). The two resources I would recommend that will help you with that are:
Configure Liveness, Readiness and Startup Probes
Kubernetes best practices: Setting up health checks with readiness and liveness probes

Minikube with Virtualbox or KVM using lots of CPU on Centos 7

I've installed minikube as per the kubernetes instructions.
After starting it, and waiting a while, I noticed that it is using a lot of CPU, even though I have nothing particular running in it.
top shows this:
%Cpu(s): 0.3 us, 7.1 sy, 0.5 ni, 92.1 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 32521856 total, 2259992 free, 9882020 used, 20379844 buff/cache
KiB Swap: 2097144 total, 616108 free, 1481036 used. 20583844 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4847 root 20 0 3741112 91216 37492 S 52.5 0.3 9:57.15 VBoxHeadless
lscpu shows this:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 21
Model: 2
Model name: AMD Opteron(tm) Processor 3365
I see the same effect if I use KVM instead of VirtualBox
kubectl get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 20m
I installed metrics-server and it outputs this:
kubectl top node minikube
NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
minikube 334m 16% 1378Mi 76%
kubectl top pods --all-namespaces
NAMESPACE NAME CPU(cores) MEMORY(bytes)
default hello-minikube-56cdb79778-rkdc2 0m 3Mi
kafka-data-consistency zookeeper-84fb4cd6f6-sg7rf 1m 36Mi
kube-system coredns-fb8b8dccf-2nrl4 4m 15Mi
kube-system coredns-fb8b8dccf-g6llp 4m 8Mi
kube-system etcd-minikube 38m 41Mi
kube-system kube-addon-manager-minikube 31m 6Mi
kube-system kube-apiserver-minikube 59m 186Mi
kube-system kube-controller-manager-minikube 22m 41Mi
kube-system kube-proxy-m2fdb 2m 17Mi
kube-system kube-scheduler-minikube 2m 11Mi
kube-system kubernetes-dashboard-79dd6bfc48-7l887 1m 25Mi
kube-system metrics-server-cfb4b47f6-q64fb 2m 13Mi
kube-system storage-provisioner 0m 23Mi
Questions:
1) is it possible to find out why it is using so much CPU? (note that I am generating no load, and none of my containers are processing any data)
2) is that normal?
Are you sure nothing is running? What happens if you type kubectl get pods --all-namespaces? By default Kubernetes only displays the pods that are inside the default namespace (thus excluding the pods inside the system namespace).
Also, while I am no CPU expert, this seems like a reasonable consumption for the hardware you have.
In response to question 1):
You can ssh into minikube and from there you can run top to see the processes which are running:
minikube ssh
top
There is a lot of docker and kublet stuff running:
top - 21:43:10 up 8:27, 1 user, load average: 10.98, 12.00, 11.46
Tasks: 148 total, 1 running, 147 sleeping, 0 stopped, 0 zombie
%Cpu0 : 15.7/15.7 31[|||||||||||||||||||||||||||||||| ]
%Cpu1 : 6.0/10.0 16[|||||||||||||||| ]
GiB Mem : 92.2/1.9 [ ]
GiB Swap: 0.0/0.0 [ ]
11842 docker 20 0 24.5m 3.1m 0.7 0.2 0:00.71 R `- top
1948 root 20 0 480.2m 77.0m 8.6 4.1 27:45.44 S `- /usr/bin/dockerd -H tcp://0.0.0.0:2376 -H unix:///var/run/docker.sock --tlsverify --tlscacert /etc/docker/ca+
...
3176 root 20 0 10.1g 48.4m 2.0 2.6 17:45.61 S `- etcd --advertise-client-urls=https://192.168.39.197:2379 --cert-file=/var/lib/minikube/certs/etc+
The two process with 27 and 17 hours of processor time are the culprits.
In response to question 2): No idea but could be. See answer from #alassane-ndiaye

How to optimally set spark driver properties in YARN

I am trying out various options for setting spark driver memory in yarn.
Use Case:
I have a spark cluster with 1 master and 2 slaves
master : r5d xlarge - 8 vcore, 32GB
slave : r5d xlarge - 8 vcore, 32GB
I am using Apache Zeppelin to run the queries on spark cluster. Spark interpreter is configured with properties provided by Zeppelin. I am using spark 2.3.1 running on YARN. I want to create 4 interpreters so that 4 users can parallelly use this cluster.
Config 1:
spark.submit.deployMode client
spark.driver.cores 7
spark.driver.memory 24G
spark.driver.memoryOverhead 3072M
spark.executor.cores 1
spark.executor.memory 3G
spark.executor.memoryOverhead 512M
spark.yarn.am.cores 1
spark.yarn.am.memory 3G
spark.yarn.am.memoryOverhead 512M
Below is the spark executor UI:
Config 2:
spark.submit.deployMode client
spark.driver.cores 7
spark.driver.memory 12G
spark.driver.memoryOverhead 3072M
spark.executor.cores 1
spark.executor.memory 3G
spark.executor.memoryOverhead 512M
spark.yarn.am.cores 1
spark.yarn.am.memory 3G
spark.yarn.am.memoryOverhead 512M
Below is the spark executor UI:
Questions:
Why is the container size of driver 0?
Is the spark.memory.fraction calculated as (spark.driver.memory-300)*0.6 ? If so, why is it not exact ? (14.22, 7.02 resp)
Why is the container size of executor 3.8 GB ? According to my configuration, it should be 3G + 512M = 3.5 GB. This issue was not there with spark 2.1
No of VCores available to YARN is 8 per node. How is this possible since AWS gives vCPU with their instances? Hence I should only be getting 4 VCores according to AWS.
https://aws.amazon.com/ec2/instance-types/r5/
If I want to use 4 interpreters, should I distribute 32 GB of master equally to all the interpreters?
Driver:
spark.driver.cores 2
spark.driver.memory 7G
spark.driver.memoryOverhead 1024M

Spark EMR Cluster is removing executors when run because they are idle

I have a spark application that was running fine in standalone mode, I'm now trying to get the same application to run on an AWS EMR Cluster but currently it's failing.
The message is one I've not seen before and implies that the workers are not receiving jobs and are being shut down.
**16/11/30 14:45:00 INFO ExecutorAllocationManager: Removing executor 3 because it has been idle for 60 seconds (new desired total will be 7)
16/11/30 14:45:00 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 2
16/11/30 14:45:00 INFO ExecutorAllocationManager: Removing executor 2 because it has been idle for 60 seconds (new desired total will be 6)
16/11/30 14:45:00 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 4
16/11/30 14:45:00 INFO ExecutorAllocationManager: Removing executor 4 because it has been idle for 60 seconds (new desired total will be 5)
16/11/30 14:45:01 INFO YarnClientSchedulerBackend: Requesting to kill executor(s) 7
16/11/30 14:45:01 INFO ExecutorAllocationManager: Removing executor 7 because it has been idle for 60 seconds (new desired total will be 4)**
The DAG shows the workers initialised, then a collect (one that is relatively small) and then shortly after they all fail.
Dynamic allocation was enabled so there was a thought that perhaps the driver wasn't sending them any tasks and so they timed out - to prove the theory I spun up another cluster without dynamic allocation and the same thing happened.
The master is set to yarn.
Any help is massively appreciated, thanks.
16/11/30 14:49:16 INFO BlockManagerMaster: Removal of executor 21 requested
16/11/30 14:49:16 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asked to remove non-existent executor 21
16/11/30 14:49:16 INFO BlockManagerMasterEndpoint: Trying to remove executor 21 from BlockManagerMaster.
16/11/30 14:49:24 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1480517110174_0001_01_000049 on host: ip-10-138-114-125.ec2.internal. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_1480517110174_0001_01_000049
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
My step is quite simple - spark-submit --deploy-mode client --master yarn --class Run app.jar

How to set the frequency of a liveness/readiness probe in Kubernetes

Is probe frequency customizable in liveness/readiness probe?
Also, how many times readiness probe fails before it removes the pod from service load-balancer? Is it customizable?
The probe frequency is controlled by the sync-frequency command line flag on the Kubelet, which defaults to syncing pod statuses once every 10 seconds.
I'm not aware of any way to customize the number of failed probes needed before a pod is considered not-ready to serve traffic.
If either of these features is important to you, feel free to open an issue explaining what your use case is or send us a PR! :)
You can easily customise the probes failure threshold and the frequency, all parameters are defined here.
For example:
livenessProbe:
failureThreshold: 3
httpGet:
path: /health
port: 9081
scheme: HTTP
initialDelaySeconds: 180
timeoutSeconds: 10
periodSeconds: 10
successThreshold: 1
That probe will run the first time after 3 mins, it will run every 10 seconds and the pod will be restarted after 3 consecutives failures.
To customize the liveness/readiness probe frequency and other parameters we need to add liveness/readiness element inside the containers element of the yaml associated with that pod. A simple example of the yaml file is given below :
apiVersion: v1
kind: Pod
metadata:
name: liveness-exec
spec:
containers:
- name: liveness-ex
image: ubuntu
args:
- /bin/sh
- -c
- touch /tmp/healthy; sleep 30; rm -rf /tmp/healthy;sleep 600
livenessProbe:
exec:
command:
- cat
- /tmp/healthy
initialDelaySeconds: 5
periodSeconds: 5
the initialDelaySeconds parameter ensure that liveness probe is checked after 5sec of container start and periodSeconds ensures that it is checked after every 5 sec. For more parameters you can go to link : https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/