I have a k8s environment, where I am running 3 masters and 7 worker nodes. Daily my pods are in evicted states due to disk pressure.
I am getting the below error on my worker node.
Message: The node was low on resource: ephemeral-storage.
Status: Failed
Reason: Evicted
Message: Pod The node had condition: [DiskPressure].
But my worker node has enough resources to schedule pods.
Having analysed the comments it looks like pods go in the Evicted state when they're using more resources then available depending on a particular pod limit. A solution in that case might be manually deleting the evicting pods since they're not using resources at that given time. To read more about Node-pressure Eviction one can visit the official documentation.
Related
I've been runnning into what should be a simple issue with my airflow scheduler. Every couple of weeks, the scheduler becomes Evicted. When I run a describe on the pod, the issue is because The node was low on resource: ephemeral-storage. Container scheduler was using 14386916Ki, which exceeds its request of 0.
The question is two fold. First, why is the scheduler utilizing ephemeral-storage? And second, is it possible to do add ephemeral-storage when running on eks?
Thanks!
I believe Ephemeral Storage is not Airflow's question but more of the configuration of your K8S cluster.
Assuming we are talking about OpenShift' ephemeral storage:
https://docs.openshift.com/container-platform/4.9/storage/understanding-ephemeral-storage.html
This can be configured in your cluster and it wil make "/var/log" ephemeral.
I think the problem is that /var/logs gets full. Possibly some of the system logs (not from airlfow but from some other processes running in the same container). I think a solution will be to have a job that cleans that system log periodically.
For example we have this script that cleans-up Airlfow logs:
https://github.com/apache/airflow/blob/main/scripts/in_container/prod/clean-logs.sh
I have an existing cluster running k8s version 1.12.8 on AWS EC2. The cluster contains several pods - some serving web traffic and others configured as scheduled CronJobs. The cluster has been running fine in it's current configuration for at least 6 months, with CronJobs running every 5 minutes.
Recently, the CronJobs simply stopped. Viewing the pods via kubectl shows all the scheduled CronJobs last run was at roughly the same time. Logs sent to AWS Cloudwatch show no error output, and stop at the same time kubectl shows for the last run.
In trying to diagnose this issue I have found a broader pattern of the cluster being unresponsive to changes, eg: I cannot retrieve logs or nodes via kubectl.
I deleted Pods in Replica Sets and they never return. I've set autoscale values on Replica Sets and nothing happens.
Investigation of the kubelet logs on the master instance revealed repeating errors, coinciding with the time the failure was first noticed:
I0805 03:17:54.597295 2730 kubelet.go:1928] SyncLoop (PLEG): "kube-scheduler-ip-x-x-x-x.z-west-y.compute.internal_kube-system(181xxyyzz)", event: &pleg.PodLifecycleEvent{ID:"181xxyyzz", Type:"ContainerDied", Data:"405ayyzzz"}
...
E0805 03:18:10.867737 2730 kubelet_node_status.go:378] Error updating node status, will retry: failed to patch status "{\"status\":{\"$setElementOrder/conditions\":[{\"type\":\"NetworkUnavailable\"},{\"type\":\"OutOfDisk\"},{\"type\":\"MemoryPressure\"},{\"type\":\"DiskPressure\"},{\"type\":\"PIDPressure\"},{\"type\":\"Ready\"}],"conditions\":[{\"lastHeartbeatTime\":\"2020-08-05T03:18:00Z\",\"type\":\"OutOfDisk\"},{\"lastHeartbeatTime\":\"2020-08-05T03:18:00Z\",\"type\":\"MemoryPressure\"},{\"lastHeartbeatTime\":\"2020-08-05T03:18:00Z\",\"type\":\"DiskPressure\"},{\"lastHeartbeatTime\":\"2020-08-05T03:18:00Z\",\"type\":\"PIDPressure\"},{\"lastHeartbeatTime\":\"2020-08-05T03:18:00Z\",\"type\":\"Ready\"}]}}" for node "ip-172-20-60-88.eu-west-2.compute.internal": Patch https://127.0.0.1/api/v1/nodes/ip-x-x-x-x.z-west-y.compute.internal/status?timeout=10s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
...
E0805 03:18:20.869436 2730 kubelet_node_status.go:378] Error updating node status, will retry: error getting node "ip-172-20-60-88.eu-west-2.compute.internal": Get https://127.0.0.1/api/v1/nodes/ip-172-20-60-88.eu-west-2.compute.internal?timeout=10s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Running docker ps on the master node shows that both k8s_kube-controller-manager_kube-controller-manager and k8s_kube-scheduler_kube-scheduler containers were started 6 days ago, where the other k8s containers are at 8+ months.
tl;dr
A container on my main node (likely kube-scheduler, kube-controller-manager or both) died. The containers have come back up but are unable to communicate with the existing nodes - this is preventing any scheduled CronJobs or new deployments from being satisfied.
How can re-configure kubelet and associated services on the master node to communicate again with the worker nodes?
From the docs on Troubleshoot Clusters
Digging deeper into the cluster requires logging into the relevant machines. Here are the locations of the relevant log files. (note that on systemd-based systems, you may need to use journalctl instead)
Master Nodes
/var/log/kube-apiserver.log - API Server, responsible for serving the API
/var/log/kube-scheduler.log - Scheduler, responsible for making scheduling decisions
/var/log/kube-controller-manager.log - Controller that manages replication controllers
Worker Nodes
/var/log/kubelet.log - Kubelet, responsible for running containers on the node
/var/log/kube-proxy.log - Kube Proxy, responsible for service load balancing
Another way to get logs is to use docker ps to get containerid and then use docker logs containerid
If you have (which you should) a monitoring system setup using prometheus and Grafana you can check metrics such as high cpu load on API Server pods
I have 5 nodes running in k8s cluster and with around 30 pods.
Some of the pods usually take high memory. At one stage we found a node went to “not ready” state when the sum of memory of all running pods exceeded node memory.
Anyhow, I increased the resource request memory to high value for high memory pods but shouldn’t node controller kill all the pods and restarts all instead of making a node to “not ready” state?
Suppose 4 pods were already running in a node and scheduler allowed another pod to get added in that node as resource request memory is within the node left memory capacity. Now over a period of time due to some reason all pods memory started increasing and although each pod memory is still under the individual resource memory limit value but sum of all pods memory exceeds the node memory and this causes the node to “not ready” state.
Is there any way to overcome this situation?
Due to this all pods get shifted to other node or some pods to pending as it has higher resource request value.
Cluster information:
Kubernetes version: 1.10.6
Cloud being used: AWS
You can set proper eviction threshold for Memory and restartPolicy in PodSpec.
See details in https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/
I am new to Kubernetes and I am facing a problem that I do not understand. I created a 4-node cluster in aws, 1 manager node (t2.medium) and 3 normal nodes (c4.xlarge) and they were successfully joined together using Kubeadm.
Then I tried to deploy three Cassandra replicas using this yaml but the pod state does not leave the pending state; when I do:
kubectl describe pods cassandra-0
I get the message
0/4 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 3 Insufficient memory.
And I do not understand why, as the machines should be powerful enough to cope with these pods and I haven't deployed any other pods. I am not sure if this means anything but when I execute:
kubectl describe nodes
I see this message:
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Therefore my question is why this is happening and how can I fix it.
Thank you for your attention
Each node tracks the total amount of requested RAM (resources.requests.memory) for all pods assigned to it. That cannot exceed the total capacity of the machine. I would triple check that you have no other pods. You should see them on kubectl describe node.
Worker nodes were unexpectedly dropped from cluster by master, for unknown reason.
The cluster has the following setup:
AWS
Multi-az configured
Clustered masters, clusters (across AZs)
Flannel networking
Provisioned using CoreOS's kube-aws
An incident of unknown origin occurred, wherein during the span of seconds, all worker nodes were dropped from the master. The only relevant log entry that we could find was for kube-controller-manager:
I0217 14:19:11.432691 1 event.go:217] Event(api.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-XX-XX-XX-XX.ec2.internal", UID:"XXX", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeNotReady' Node ip-XX-XX-XX-XX.ec2.internal status is now: NodeNotReady
The nodes returned to "ready" approximately 10 minutes later.
We have yet to locate the cause of why the node transitioned to NodeNotReady.
We have so far looked through logs of various system components including:
flannel
kubelet
etcd
controller-manager
One potential noteworthy item is that the active master of the cluster currently resides in a different AZ from the nodes. This should be OK, but could be the source of network connectivity problems. That being said, we have seen no indication in logs / monitoring of inter-AZ connection problems.
Checking kubelet logs, there was no clear logging event of the nodes changing their state to "not ready or otherwise. Additionally no clear indication of any fatal events either.
One item that could be noteworthy, is that all kubelets logged after the outage:
Error updating node status, will retry: error getting node "ip-XX-XX-XX-XX.ec2.internal": Get https://master/api/v1/nodes?fieldSelector=metadata.name%3Dip-XX-XX-XX-XX.ec2.internal&resourceVersion=0: read tcp 10.X.X.X:52534->10.Y.Y.Y:443: read: no route to host".
Again please note, these log messages were logged after the nodes had re-joined the cluster (there was a clear ~10min window between cluster collapse and nodes rejoining).