How to resolve EKS Fragate nodes disk pressure

How to resolve EKS Fragate nodes disk pressure - amazon-web-services

I am running EKS cluster with fargate profile. I checked nodes status by using kubectl describe node and it is showing disk pressure:
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Tue, 12 Jul 2022 03:10:33 +0000 Wed, 29 Jun 2022 13:21:17 +0000 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure True Tue, 12 Jul 2022 03:10:33 +0000 Wed, 06 Jul 2022 19:46:54 +0000 KubeletHasDiskPressure kubelet has disk pressure
PIDPressure False Tue, 12 Jul 2022 03:10:33 +0000 Wed, 29 Jun 2022 13:21:17 +0000 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Tue, 12 Jul 2022 03:10:33 +0000 Wed, 29 Jun 2022 13:21:27 +0000 KubeletReady kubelet is posting ready status
And also there is failed garbage collection event.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FreeDiskSpaceFailed 11m (x844 over 2d22h) kubelet failed to garbage collect required amount of images. Wanted to free 6314505830 bytes, but freed 0 bytes
Warning EvictionThresholdMet 65s (x45728 over 5d7h) kubelet Attempting to reclaim ephemeral-storage
I think cause of disk filling quickly is due to application logs, which application is writing to stdout, as per aws documentation which in turn is written to log files by container agent and I am using fargate in-built fluentbit to push application logs to opensearch cluster.
But looks like EKS cluster is not deleting old log files created by container agent.
I was looking to SSH into fargate nodes to furhter debug issue but as per aws support ssh into fargate nodes not possible.
What can be done to remove disk pressure from fargate nodes?
As suggested in answers I am using logrotate in sidecar. But as per logs of logrotate container it is not able to find dir:
rotating pattern: /var/log/containers/*.log
52428800 bytes (5 rotations)
empty log files are not rotated, old logs are removed
considering log /var/log/containers/*.log
log /var/log/containers/*.log does not exist -- skipping
reading config file /etc/logrotate.conf
Reading state from file: /var/lib/logrotate.status
Allocating hash table for state file, size 64 entries
Creating new state
yaml file is:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-apis
namespace: kube-system
spec:
replicas: 3
selector:
matchLabels:
app: api
template:
metadata:
labels:
app: api
spec:
containers:
- name: my-apis
image: 111111xxxxx.dkr.ecr.us-west-2.amazonaws.com/my-apis:1.0.3
ports:
- containerPort: 8080
resources:
limits:
cpu: "1000m"
memory: "1200Mi"
requests:
cpu: "1000m"
memory: "1200Mi"
readinessProbe:
httpGet:
path: "/ping"
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 2
livenessProbe:
httpGet:
path: "/ping"
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 5
- name: logrotate
image: realz/logrotate
volumeMounts:
- mountPath: /var/log/containers
name: my-app-logs
env:
- name: CRON_EXPR
value: "*/5 * * * *"
- name: LOGROTATE_LOGFILES
value: "/var/log/containers/*.log"
- name: LOGROTATE_FILESIZE
value: "50M"
- name: LOGROTATE_FILENUM
value: "5"
volumes:
- name: my-app-logs
emptyDir: {}

What can be done to remove disk pressure from fargate nodes?
No known configuration that could have Fargate to automatic clean a specific log location. You can run logrotate as sidecar. Plenty of choices here.

Found the cause of disk filling quickly. It was due to logging library logback writing logs to both files and console and log rotation policy in logback was retaining large number of log files for long periods. Removing appender in logback config that is writing to files to fix issue.
Also I found out that STDOUT logs written to files by container agent are rotated and have files size of 10 mb and maximum of 5 files. So it cannot cause disk pressure.

Related

Using python manage.py migrate --check in kubernetes readinessProbe never succeeds

I have a django deployment on kubernetes cluster and in the readinessProbe, I am running python, manage.py, migrate, --check. I can see that the return value of this command is 0 but the pod never becomes ready.
Snippet of my deployment:
containers:
- name: myapp
...
imagePullPolicy: Always
readinessProbe:
exec:
command: ["python", "manage.py", "migrate", "--check"]
initialDelaySeconds: 15
periodSeconds: 5
When I describe the pod which is not yet ready:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 66s default-scheduler Successfully assigned ... Normal Pulled 66s kubelet Successfully pulled image ...
Normal Created 66s kubelet Created container ...
Normal Started 66s kubelet Started container ...
Warning Unhealthy 5s (x10 over 50s) kubelet Readiness probe failed:
I can see that migrate --check returns 0 by execing into the container which is still in not ready state and running
python manage.py migrate
echo $?
0
Is there something wrong in my exec command passed as readinessProbe?
The version of kubernetes server that I am using is 1.21.7.
The base image for my deployment is python:3.7-slim.

The solution for the issue is to increase timeoutSeconds parameter, which is by default set to 1 second:
timeoutSeconds: Number of seconds after which the probe times out. Defaults to 1 second. Minimum value is 1.
After increasing the timeoutSeconds parameter, the application is able to pass the readiness probe.
Example snippet of the deployment with timeoutSeconds parameter set to 5:
containers:
- name: myapp
...
imagePullPolicy: Always
readinessProbe:
exec:
command: ["python", "manage.py", "migrate", "--check"]
initialDelaySeconds: 15
periodSeconds: 5
timeoutSeconds: 5

"Failed to set feature gates from initial flags-based config" err="unrecognized feature gate: CSIBlockVolume"

Steps
I have a use case in which I want to create a kubernetes cluster from scratch using kubeadm.
$ kubeadm init --config admin.yaml --v=7
admin.yaml:
apiVersion: kubeadm.k8s.io/v1beta3
kind: InitConfiguration
nodeRegistration:
criSocket: /run/containerd/containerd.sock
ignorePreflightErrors:
- SystemVerification
localAPIEndpoint:
bindPort: 6443
---
apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
apiServer:
extraArgs:
feature-gates: CSIBlockVolume=true,CSIDriverRegistry=true,CSINodeInfo=true,VolumeSnapshotDataSource=true
---
apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
featureGates:
CSIBlockVolume: true
CSIDriverRegistry: true
CSINodeInfo: true
All Operations seem to work until the connection to the kublet should be established.
This is the final log before the crash. The GET requests are sent approx 100 times before it crashes.
LOG:
I1216 12:31:45.043530 15460 round_trippers.go:463] GET https://<IP>:6443/healthz?timeout=10s
I1216 12:31:45.043550 15460 round_trippers.go:469] Request Headers:
I1216 12:31:45.043555 15460 round_trippers.go:473] Accept: application/json, */*
I1216 12:31:45.043721 15460 round_trippers.go:574] Response Status: in 0 milliseconds
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp 127.0.0.1:10248: connect: connection refused.
Unfortunately, an error has occurred:
timed out waiting for the condition
This error is likely caused by:
- The kubelet is not running
- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)
I read the logs for kubelet with
$ journalctl -xeu kubelet
This is the output I received:
- The job identifier is 49904.
Dec 16 13:40:42 <IP> kubelet[24113]: I1216 13:40:42.883879 24113 server.go:198] "Warning: For remote container runtime, --pod-infra-container-image is ignored in kubelet, which should be set in tha>
Dec 16 13:40:42 <IP> kubelet[24113]: E1216 13:40:42.885069 24113 server.go:217] "Failed to set feature gates from initial flags-based config" err="unrecognized feature gate: CSIBlockVolume"
Dec 16 13:40:42 <IP> systemd[1]: kubelet.service: Main process exited, code=exited, status=1/FAILURE
-- Subject: Unit process exited
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- An ExecStart= process belonging to unit kubelet.service has exited.
--
-- The process' exit code is 'exited' and its exit status is 1.
Dec 16 13:40:42 <IP> systemd[1]: kubelet.service: Failed with result 'exit-code'.
-- Subject: Unit failed
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
--
-- The unit kubelet.service has entered the 'failed' state with result 'exit-code'.
Setup
Software Versions:
$ kubelet --version
Kubernetes v1.23.0
$ kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.0", GitCommit:"ab69524f795c42094a6630298ff53f3c3ebab7f4", GitTreeState:"clean", BuildDate:"2021-12-07T18:15:11Z", GoVersion:"go1.17.3", Compiler:"gc", Platform:"linux/amd64"}
Kubernetes and platform
$ uname -a
Linux 5.11.0-1022-aws #23~20.04.1-Ubuntu SMP Mon Nov 15 14:03:19 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
The server is deployed in amazon AWS
Container runtime: containerd
Docker installed: No
I also checked the Kubernetes documentaion, which, if I read this correctly, states that all the Feature Gates are now in the GA state, so integrated into Kubernetes not not experimental anymore.

...kubelet[24113]: E1216 13:40:42.885069 24113 server.go:217] "Failed to set feature gates from initial flags-based config" err="unrecognized feature gate: CSIBlockVolume"
CSIBlockVolume feature gate applies to api-server and not kubelet, you need to enable this at /etc/kubernetes/manifests/kube-apiserver.yaml by adding to --feature-gate="...,CSIBlockVolume=true"

DNS not found from single Docker container running on Beanstalk

I have an AWS Elastic Beanstalk environment and application running the "64bit Amazon Linux 2 v3.0.2 running Docker" solution stack in a standard way. (I followed the AWS documentation.)
I have deployed a Dockerrun.aws.json file to it, and found that it has DNS issues.
To troubleshoot, I SSHed into the EC2 instance where it is running and found that on this instance an nslookup of any of the hostnames in question runs fine.
But, running the nslookup from within Docker such as with . . .
sudo run busybox nslookup www.google.com
. . . yields the result:
*** Can't find www.google.com: No answer
Adding the typical solutions such as passing the --dns x.x.x.x argument or --network=host arguments does not fix the issue. It still times out contacting DNS.
Any thoughts as to what the issue might be? Here is a Docker info:
$ sudo docker info
Client:
Debug Mode: false
Server:
Containers: 30
Running: 1
Paused: 0
Stopped: 29
Images: 4
Server Version: 19.03.6-ce
Storage Driver: devicemapper
Pool Name: docker-docker--pool
Pool Blocksize: 524.3kB
Base Device Size: 107.4GB
Backing Filesystem: ext4
Udev Sync Supported: true
Data Space Used: 5.403GB
Data Space Total: 12.72GB
Data Space Available: 7.314GB
Metadata Space Used: 3.965MB
Metadata Space Total: 16.78MB
Metadata Space Available: 12.81MB
Thin Pool Minimum Free Space: 1.271GB
Deferred Removal Enabled: true
Deferred Deletion Enabled: true
Deferred Deleted Device Count: 0
Library Version: 1.02.135-RHEL7 (2016-11-16)
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: ff48f57fc83a8c44cf4ad5d672424a98ba37ded6
runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
init version: fec3683
Security Options:
seccomp
Profile: default
Kernel Version: 4.14.181-108.257.amzn1.x86_64
Operating System: Amazon Linux AMI 2018.03
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.79GiB
Name:
ID:
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
Live Restore Enabled: false
WARNING: the devicemapper storage-driver is deprecated, and will be removed in a future release.
Thank you

istio 0.2.7 helloworld app init proxy_init stuck at podinitializing

Installing Istio for the first time in Kubernetes 1.7.9. Installed with automatic sidecar injection. When trying the sample applications, although the side car and the application containers are started and in "running' state, the proxy_init is stuck at PodInitializing and the overall Pod state is at Init:0/1.
[root#node-8 helloworld]# kubectl describe pods helloworld-v1-3194034472-12rgj
Name: helloworld-v1-3194034472-12rgj
Namespace: default
Node: node-8/136.225.226.159
Start Time: Wed, 01 Nov 2017 19:13:11 +0100
Labels: app=helloworld
pod-template-hash=3194034472
version=v1
Annotations: kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicaSet","namespace":"default","name":"helloworld-v1-3194034472","uid":"5212bc02-bf30-11e7-b818-0050560...
sidecar.istio.io/status=injected-version-0.2.7
Status: Running
IP: 192.168.144.130
Created By: ReplicaSet/helloworld-v1-3194034472
Controlled By: ReplicaSet/helloworld-v1-3194034472
Init Containers:
istio-init:
Container ID:
Image: docker.io/istio/proxy_init:0.2.7
Image ID:
Port: <none>
Args:
-p
15001
-u
1337
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-76kq4 (ro)
Containers:
helloworld:
Container ID: docker://aa89ecc46d273b76d71a0f67d5169519926cc0e01d9d1f2ab960e2b88a46013b
Image: istio/examples-helloworld-v1
Image ID: docker-pullable://docker.io/istio/examples-helloworld-v1#sha256:c671702b11cbcda103720c2bd3e81a4211012bfef085b7326bb7fbfd8cea4a94
Port: 5000/TCP
State: Running
Started: Wed, 01 Nov 2017 19:13:14 +0100
Ready: True
Restart Count: 0
Requests:
cpu: 100m
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-76kq4 (ro)
istio-proxy:
Container ID: docker://9bb16159d42229512892feae13614c4c373f3436957b6263c772f62282d75e02
Image: docker.io/istio/proxy:0.2.7
Image ID: docker-pullable://docker.io/istio/proxy#sha256:910546c29a32e11f58bab92e68513a5c8f636621c0e20197833270961fda3713
Port: <none>
Args:
proxy
sidecar
-v
2
--configPath
/etc/istio/proxy
--binaryPath
/usr/local/bin/envoy
--serviceCluster
helloworld
--drainDuration
45s
--parentShutdownDuration
1m0s
--discoveryAddress
istio-pilot.istio-system:8080
--discoveryRefreshDelay
1s
--zipkinAddress
zipkin.istio-system:9411
--connectTimeout
10s
--statsdUdpAddress
istio-mixer.istio-system:9125
--proxyAdminPort
15000
State: Running
Started: Wed, 01 Nov 2017 19:13:15 +0100
Ready: True
Restart Count: 0
Environment:
POD_NAME: helloworld-v1-3194034472-12rgj (v1:metadata.name)
POD_NAMESPACE: default (v1:metadata.namespace)
INSTANCE_IP: (v1:status.podIP)
Mounts:
/etc/certs/ from istio-certs (ro)
/etc/istio/proxy from istio-envoy (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-76kq4 (ro)
Conditions:
Type Status
Initialized True
Ready True
PodScheduled True
Volumes:
istio-envoy:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium: Memory
istio-certs:
Type: Secret (a volume populated by a Secret)
SecretName: istio.default
Optional: true
default-token-76kq4:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-76kq4
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.alpha.kubernetes.io/notReady:NoExecute for 300s
node.alpha.kubernetes.io/unreachable:NoExecute for 300s
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
4m 4m 1 default-scheduler Normal Scheduled Successfully assigned helloworld-v1-3194034472-12rgj to node-8
4m 4m 1 kubelet, node-8 Normal SuccessfulMountVolume MountVolume.SetUp succeeded for volume "istio-envoy"
4m 4m 1 kubelet, node-8 Normal SuccessfulMountVolume MountVolume.SetUp succeeded for volume "default-token-76kq4"
4m 4m 1 kubelet, node-8 Normal SuccessfulMountVolume MountVolume.SetUp succeeded for volume "istio-certs"
4m 4m 1 kubelet, node-8 spec.initContainers{istio-init} Normal Pulled Container image "docker.io/istio/proxy_init:0.2.7" already present on machine
4m 4m 1 kubelet, node-8 spec.initContainers{istio-init} Normal Created Created container
4m 4m 1 kubelet, node-8 spec.initContainers{istio-init} Normal Started Started container
4m 4m 1 kubelet, node-8 spec.containers{helloworld} Normal Pulled Container image "istio/examples-helloworld-v1" already present on machine
4m 4m 1 kubelet, node-8 spec.containers{helloworld} Normal Created Created container
4m 4m 1 kubelet, node-8 spec.containers{helloworld} Normal Started Started container
4m 4m 1 kubelet, node-8 spec.containers{istio-proxy} Normal Pulled Container image "docker.io/istio/proxy:0.2.7" already present on machine
4m 4m 1 kubelet, node-8 spec.containers{istio-proxy} Normal Created Created container
4m 4m 1 kubelet, node-8 spec.containers{istio-proxy} Normal Started Started container
[root#node-8 helloworld]# kubectl get pods
NAME READY STATUS RESTARTS AGE
helloworld-v1-3194034472-12rgj 0/2 Init:0/1 0 12m
helloworld-v2-717720256-rc06f 0/2 Init:0/1 0 12m
sleep-140275861-vjqf7 0/2 Init:0/1 0 1h
[root#node-8 helloworld]#
Initializers is enabled:
[root#node-8 istio-0.2.7]# kubectl api-versions | grep admi
admissionregistration.k8s.io/v1alpha1
[root#node-8 istio-0.2.7]#
From the istio-Proxy logs,
[2017-11-02 19:40:19.323][14][warning][main] external/envoy/source/server/server.cc:164] initializing epoch 0 (hot restart version=8.2490552)
[2017-11-02 19:40:19.330][14][warning][main] external/envoy/source/server/server.cc:332] starting main dispatch loop
[2017-11-02 19:40:19.392][14][warning][main] external/envoy/source/server/server.cc:316] all clusters initialized. initializing init manager
[2017-11-02 19:40:19.427][14][warning][config] external/envoy/source/server/listener_manager_impl.cc:451] all dependencies initialized. starting workers
[2017-11-02 19:41:19.429][14][warning][main] external/envoy/source/server/drain_manager_impl.cc:62] shutting down parent after drain
but the proxy_init is stuck at waiting state.

Istio sidecars can be automatically injected into a Pod before deployment using an alpha feature in Kubernetes called Initializers. Please ensure your cluster has the initializer alpha feature enabled. For example, this requires deploy an alpha cluster in GKE. In IBM Bluemix container service, alpha feature should be already enabled in 1.7.x k8s cluster.

After further research, figured that there is a known issue that got fixed in 1.8 where the init container can wait at PodInitializing state. https://github.com/kubernetes/kubernetes/pull/51644. works in 1.8 fine.

How to set the frequency of a liveness/readiness probe in Kubernetes

Is probe frequency customizable in liveness/readiness probe?
Also, how many times readiness probe fails before it removes the pod from service load-balancer? Is it customizable?

The probe frequency is controlled by the sync-frequency command line flag on the Kubelet, which defaults to syncing pod statuses once every 10 seconds.
I'm not aware of any way to customize the number of failed probes needed before a pod is considered not-ready to serve traffic.
If either of these features is important to you, feel free to open an issue explaining what your use case is or send us a PR! :)

You can easily customise the probes failure threshold and the frequency, all parameters are defined here.
For example:
livenessProbe:
failureThreshold: 3
httpGet:
path: /health
port: 9081
scheme: HTTP
initialDelaySeconds: 180
timeoutSeconds: 10
periodSeconds: 10
successThreshold: 1
That probe will run the first time after 3 mins, it will run every 10 seconds and the pod will be restarted after 3 consecutives failures.

To customize the liveness/readiness probe frequency and other parameters we need to add liveness/readiness element inside the containers element of the yaml associated with that pod. A simple example of the yaml file is given below :
apiVersion: v1
kind: Pod
metadata:
name: liveness-exec
spec:
containers:
- name: liveness-ex
image: ubuntu
args:
- /bin/sh
- -c
- touch /tmp/healthy; sleep 30; rm -rf /tmp/healthy;sleep 600
livenessProbe:
exec:
command:
- cat
- /tmp/healthy
initialDelaySeconds: 5
periodSeconds: 5
the initialDelaySeconds parameter ensure that liveness probe is checked after 5sec of container start and periodSeconds ensures that it is checked after every 5 sec. For more parameters you can go to link : https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js