Why isn't Kubernetes service DNS working? - amazon-web-services

I have set up DNS in my Kubernetes (v1.1.2+1abf20d) system, on CoreOS/AWS, but I cannot look up services via DNS. I have tried debugging, but cannot for the life of me find out why. This is what happens when I try to look up the kubernetes service, which should always be available:
$ ~/.local/bin/kubectl --kubeconfig=/etc/kubernetes/kube.conf exec busybox-sleep -- nslookup kubernetes.default
Server: 10.3.0.10
Address 1: 10.3.0.10 ip-10-3-0-10.eu-central-1.compute.internal
nslookup: can't resolve 'kubernetes.default'
error: error executing remote command: Error executing command in container: Error executing in Docker Container: 1
I have installed the DNS addon according to this spec:
apiVersion: v1
kind: ReplicationController
metadata:
name: kube-dns-v10
namespace: kube-system
labels:
k8s-app: kube-dns
version: v10
kubernetes.io/cluster-service: "true"
spec:
replicas: 1
selector:
k8s-app: kube-dns
version: v10
template:
metadata:
labels:
k8s-app: kube-dns
version: v10
kubernetes.io/cluster-service: "true"
spec:
containers:
- name: etcd
image: gcr.io/google_containers/etcd-amd64:2.2.1
resources:
# keep request = limit to keep this container in guaranteed class
limits:
cpu: 100m
memory: 50Mi
requests:
cpu: 100m
memory: 50Mi
command:
- /usr/local/bin/etcd
- -data-dir
- /var/etcd/data
- -listen-client-urls
- http://127.0.0.1:2379,http://127.0.0.1:4001
- -advertise-client-urls
- http://127.0.0.1:2379,http://127.0.0.1:4001
- -initial-cluster-token
- skydns-etcd
volumeMounts:
- name: etcd-storage
mountPath: /var/etcd/data
- name: kube2sky
image: gcr.io/google_containers/kube2sky:1.12
resources:
# keep request = limit to keep this container in guaranteed class
limits:
cpu: 100m
memory: 50Mi
requests:
cpu: 100m
memory: 50Mi
args:
# command = "/kube2sky"
- --domain=cluster.local
- name: skydns
image: gcr.io/google_containers/skydns:2015-10-13-8c72f8c
resources:
# keep request = limit to keep this container in guaranteed class
limits:
cpu: 100m
memory: 50Mi
requests:
cpu: 100m
memory: 50Mi
args:
# command = "/skydns"
- -machines=http://127.0.0.1:4001
- -addr=0.0.0.0:53
- -ns-rotate=false
- -domain=cluster.local.
ports:
- containerPort: 53
name: dns
protocol: UDP
- containerPort: 53
name: dns-tcp
protocol: TCP
livenessProbe:
httpGet:
path: /healthz
port: 8080
scheme: HTTP
initialDelaySeconds: 30
timeoutSeconds: 5
readinessProbe:
httpGet:
path: /healthz
port: 8080
scheme: HTTP
initialDelaySeconds: 1
timeoutSeconds: 5
- name: healthz
image: gcr.io/google_containers/exechealthz:1.0
resources:
# keep request = limit to keep this container in guaranteed class
limits:
cpu: 10m
memory: 20Mi
requests:
cpu: 10m
memory: 20Mi
args:
- -cmd=nslookup kubernetes.default.svc.cluster.local 127.0.0.1 >/dev/null
- -port=8080
ports:
- containerPort: 8080
protocol: TCP
volumes:
- name: etcd-storage
emptyDir: {}
dnsPolicy: Default # Don't use cluster DNS.
---
apiVersion: v1
kind: Service
metadata:
name: kube-dns
namespace: kube-system
labels:
k8s-app: kube-dns
kubernetes.io/cluster-service: "true"
kubernetes.io/name: "KubeDNS"
spec:
selector:
k8s-app: kube-dns
clusterIP: 10.3.0.10
ports:
- name: dns
port: 53
protocol: UDP
- name: dns-tcp
port: 53
protocol: TCP
Why isn't DNS lookup for services working in my Kubernetes setup? Please let me know what other info I need to provide.

There were two things I needed to do:
Configure kube2sky via kubeconfig, so that it's properly configured for TLS.
Configure kube-proxy via kubeconfig, so that it's properly configured for TLS and finds the master node.
/etc/kubernetes/kube.conf on master node
apiVersion: v1
kind: Config
clusters:
- name: kube
cluster:
server: https://127.0.0.1:443
certificate-authority: /etc/ssl/etcd/ca.pem
users:
- name: kubelet
user:
client-certificate: /etc/ssl/etcd/master-client.pem
client-key: /etc/ssl/etcd/master-client-key.pem
contexts:
- context:
cluster: kube
user: kubelet
/etc/kubernetes/kube.conf on worker node
apiVersion: v1
kind: Config
clusters:
- name: local
cluster:
certificate-authority: /etc/ssl/etcd/ca.pem
server: https://<master IP>:443
users:
- name: kubelet
user:
client-certificate: /etc/ssl/etcd/worker.pem
client-key: /etc/ssl/etcd/worker-key.pem
contexts:
- context:
cluster: local
user: kubelet
name: kubelet-context
current-context: kubelet-context
dns-addon.yaml (install this on master)
apiVersion: v1
kind: ReplicationController
metadata:
name: kube-dns-v11
namespace: kube-system
labels:
k8s-app: kube-dns
version: v11
kubernetes.io/cluster-service: "true"
spec:
replicas: 1
selector:
k8s-app: kube-dns
version: v11
template:
metadata:
labels:
k8s-app: kube-dns
version: v11
kubernetes.io/cluster-service: "true"
spec:
containers:
- name: etcd
image: gcr.io/google_containers/etcd-amd64:2.2.1
resources:
# TODO: Set memory limits when we've profiled the container for large
# clusters, then set request = limit to keep this container in
# guaranteed class. Currently, this container falls into the
# "burstable" category so the kubelet doesn't backoff from restarting
# it.
limits:
cpu: 100m
memory: 500Mi
requests:
cpu: 100m
memory: 50Mi
command:
- /usr/local/bin/etcd
- -data-dir
- /var/etcd/data
- -listen-client-urls
- http://127.0.0.1:2379,http://127.0.0.1:4001
- -advertise-client-urls
- http://127.0.0.1:2379,http://127.0.0.1:4001
- -initial-cluster-token
- skydns-etcd
volumeMounts:
- name: etcd-storage
mountPath: /var/etcd/data
- name: kube2sky
image: gcr.io/google_containers/kube2sky:1.14
resources:
# TODO: Set memory limits when we've profiled the container for large
# clusters, then set request = limit to keep this container in
# guaranteed class. Currently, this container falls into the
# "burstable" category so the kubelet doesn't backoff from restarting
# it.
limits:
cpu: 100m
# Kube2sky watches all pods.
memory: 200Mi
requests:
cpu: 100m
memory: 50Mi
livenessProbe:
httpGet:
path: /healthz
port: 8080
scheme: HTTP
initialDelaySeconds: 60
timeoutSeconds: 5
volumeMounts:
- name: kubernetes-etc
mountPath: /etc/kubernetes
readOnly: true
- name: etcd-ssl
mountPath: /etc/ssl/etcd
readOnly: true
readinessProbe:
httpGet:
path: /readiness
port: 8081
scheme: HTTP
# we poll on pod startup for the Kubernetes master service and
# only setup the /readiness HTTP server once that's available.
initialDelaySeconds: 30
timeoutSeconds: 5
args:
# command = "/kube2sky"
- --domain=cluster.local.
- --kubecfg-file=/etc/kubernetes/kube.conf
- name: skydns
image: gcr.io/google_containers/skydns:2015-10-13-8c72f8c
resources:
# TODO: Set memory limits when we've profiled the container for large
# clusters, then set request = limit to keep this container in
# guaranteed class. Currently, this container falls into the
# "burstable" category so the kubelet doesn't backoff from restarting
# it.
limits:
cpu: 100m
memory: 200Mi
requests:
cpu: 100m
memory: 50Mi
args:
# command = "/skydns"
- -machines=http://127.0.0.1:4001
- -addr=0.0.0.0:53
- -ns-rotate=false
- -domain=cluster.local
ports:
- containerPort: 53
name: dns
protocol: UDP
- containerPort: 53
name: dns-tcp
protocol: TCP
- name: healthz
image: gcr.io/google_containers/exechealthz:1.0
resources:
# keep request = limit to keep this container in guaranteed class
limits:
cpu: 10m
memory: 20Mi
requests:
cpu: 10m
memory: 20Mi
args:
- -cmd=nslookup kubernetes.default.svc.cluster.local \
127.0.0.1 >/dev/null
- -port=8080
ports:
- containerPort: 8080
protocol: TCP
volumes:
- name: etcd-storage
emptyDir: {}
- name: kubernetes-etc
hostPath:
path: /etc/kubernetes
- name: etcd-ssl
hostPath:
path: /etc/ssl/etcd
dnsPolicy: Default # Don't use cluster DNS.
/etc/kubernetes/manifests/kube-proxy.yaml on master node
apiVersion: v1
kind: Pod
metadata:
name: kube-proxy
namespace: kube-system
spec:
hostNetwork: true
containers:
- name: kube-proxy
image: gcr.io/google_containers/hyperkube:v1.1.2
command:
- /hyperkube
- proxy
- --master=https://127.0.0.1:443
- --proxy-mode=iptables
- --kubeconfig=/etc/kubernetes/kube.conf
securityContext:
privileged: true
volumeMounts:
- mountPath: /etc/ssl/certs
name: ssl-certs-host
readOnly: true
- mountPath: /etc/kubernetes
name: kubernetes
readOnly: true
- mountPath: /etc/ssl/etcd
name: kubernetes-certs
readOnly: true
volumes:
- hostPath:
path: /usr/share/ca-certificates
name: ssl-certs-host
- hostPath:
path: /etc/kubernetes
name: kubernetes
- hostPath:
path: /etc/ssl/etcd
name: kubernetes-certs
/etc/kubernetes/manifests/kube-proxy.yaml on worker node
apiVersion: v1
kind: Pod
metadata:
name: kube-proxy
namespace: kube-system
spec:
hostNetwork: true
containers:
- name: kube-proxy
image: gcr.io/google_containers/hyperkube:v1.1.2
command:
- /hyperkube
- proxy
- --kubeconfig=/etc/kubernetes/kube.conf
- --proxy-mode=iptables
- --v=2
securityContext:
privileged: true
volumeMounts:
- mountPath: /etc/ssl/certs
name: "ssl-certs"
- mountPath: /etc/kubernetes/kube.conf
name: "kubeconfig"
readOnly: true
- mountPath: /etc/ssl/etcd
name: "etc-kube-ssl"
readOnly: true
volumes:
- name: "ssl-certs"
hostPath:
path: "/usr/share/ca-certificates"
- name: "kubeconfig"
hostPath:
path: "/etc/kubernetes/kube.conf"
- name: "etc-kube-ssl"
hostPath:
path: "/etc/ssl/etcd"

Related

HPA Kills the pod after a while: But container is processing some launch script

I am trying to enable HPA for the Magento application, which consists of 4 containers in my Kubernetes deployment on GKE. The way application works do not seem to be a good container native application. Hence, after launching the pod it takes 8+mins to be in running state during which it launches a shell script from phpfpm container that takes into account some updates. This is critical for the application to work.
So if I use hpa based on default metric like CPU or memory, the autoscaling kicked in and it attempts to create more replica. But after around 4mins30seconds, the pod is killed automatically and a new pod is attempted to spin up which is again killed after that period.
So is there anyway I can have the HPA process wait for 8-9 mins which is too long, but due to current business case I have no other option?
My deployment yaml file:
If I increase the replica count manually it works perfectly. So it means the hpa kills the pod.
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: magentoappli
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "true"
labels:
environment: "test"
spec:
selector:
matchLabels:
app: magentoappli
replicas: 2
strategy:
type: RollingUpdate
template:
metadata:
labels:
app: magentoappli
spec:
serviceAccountName: magento-sa
terminationGracePeriodSeconds: 10
volumes:
- name: phpfpm-configmap
configMap:
name: phpfpm
items:
- key: php
path: php
- key: phpfpm
path: phpfpm
- name: cluster-credentials
secret:
secretName: cluster-credentials
- name: non-prod-magento-netapp-static-claim
persistentVolumeClaim:
claimName: non-prod-magento-netapp-static-claim
- name: non-prod-magento-netapp-media-claim
persistentVolumeClaim:
claimName: non-prod-magento-netapp-media-claim
securityContext:
runAsUser: 1000
runAsGroup: 1000
fsGroup: 1000
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
initContainers:
- image: gcr.io/google.com/cloudsdktool/cloud-sdk:326.0.0-alpine
name: workload-identity-initcontainer
command:
- /bin/bash
- -c
- |
curl -s -H 'Metadata-Flavor: Google' 'http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/default/token' --retry 30 --retry-connrefused --retry-max-time 30 > /dev/null || exit 1
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "256Mi"
cpu: "200m"
securityContext:
runAsNonRoot: true
allowPrivilegeEscalation: false
seccompProfile:
type: RuntimeDefault
- name: nfs-fixer
image: alpine
securityContext:
#runAsUser: 0
#runAsGroup: 0
#fsGroup: 0
allowPrivilegeEscalation: false
seccompProfile:
type: RuntimeDefault
volumeMounts:
- name: non-prod-magento-netapp-static-claim
mountPath: /static
- name: non-prod-magento-netapp-media-claim
mountPath: /media
command:
- sh
- -c
- (chmod 0775 /media /static; chown -R 1000:1000 /media /static)
containers:
- name: phpfpm
image: xxx/phpfpm:non-prod-1.50.104
command:
- /bin/sh
- -c
- environmental/entrypoint.sh
securityContext:
allowPrivilegeEscalation: false
#readOnlyRootFilesystem: true
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
ports:
- containerPort: 9000
livenessProbe:
exec:
command:
- php
- /var/www/html/magento/bin/magento
initialDelaySeconds: 15
periodSeconds: 15
timeoutSeconds: 15
volumeMounts:
- name: non-prod-magento-netapp-static-claim
mountPath: /var/www/html/magento/pub/static
- name: non-prod-magento-netapp-media-claim
mountPath: /var/www/html/magento/pub/media
- name: phpfpm-configmap
mountPath: /usr/local/etc/php/php.ini
subPath: php
readOnly: true
- name: phpfpm-configmap
mountPath: /usr/local/etc/php-fpm.conf
subPath: phpfpm
readOnly: true
envFrom:
- secretRef:
name: cluster-credentials
resources:
requests:
memory: "768Mi"
cpu: "2000m"
limits:
memory: "3072Mi"
cpu: "4000m"
- name: httpd
image: xxx/httpd:non-prod-1.50.104
ports:
- containerPort: 8000
securityContext:
allowPrivilegeEscalation: false
#runAsNonRoot: true
#readOnlyRootFilesystem: true
seccompProfile:
type: RuntimeDefault
volumeMounts:
- name: non-prod-magento-netapp-static-claim
mountPath: /var/www/html/magento/pub/static
readOnly: true
- name: non-prod-magento-netapp-media-claim
mountPath: /var/www/html/magento/pub/media
readOnly: true
resources:
requests:
memory: "256Mi"
cpu: "500m"
limits:
memory: "768Mi"
cpu: "750m"
livenessProbe:
httpGet:
port: 8000
path: /health_check.php
initialDelaySeconds: 30
periodSeconds: 15
timeoutSeconds: 30
readinessProbe:
httpGet:
port: 8000
path: /health_check.php
initialDelaySeconds: 30
periodSeconds: 15
timeoutSeconds: 30
- name: cloudsql-proxy
image: gcr.io/cloudsql-docker/gce-proxy:1.31.2
command:
- "/cloud_sql_proxy"
- "-ip_address_types=PRIVATE"
- "-instances=GCP_PROJECT:region:db-name=tcp:db_port"
- "-verbose=false"
- "-log_debug_stdout=true"
securityContext:
#runAsNonRoot: true
allowPrivilegeEscalation: false
#readOnlyRootFilesystem: true
seccompProfile:
type: RuntimeDefault
resources:
requests:
memory: "50Mi"
cpu: "50m"
limits:
memory: "100Mi"
cpu: "100m"
HPA:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: magento-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: magentoappli
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 85
Any suggestion?

pod stuck on `ContainerCreating` state in AWS EKS

I deployed a k8s cluster on AWS EKS fargate. And deployed a elasticsearch container to the pod. The pod is stuck on ContainerCreating state and describe pod shows below error:
$ kubectl describe pod es-0
Name: es-0
Namespace: default
Priority: 2000001000
Priority Class Name: system-node-critical
Node: fargate-ip-10-0-1-207.ap-southeast-2.compute.internal/10.0.1.207
Start Time: Fri, 28 May 2021 16:39:07 +1000
Labels: controller-revision-hash=es-86f54d94fb
eks.amazonaws.com/fargate-profile=elk_profile
name=es
statefulset.kubernetes.io/pod-name=es-0
Annotations: CapacityProvisioned: 1vCPU 2GB
Logging: LoggingDisabled: LOGGING_CONFIGMAP_NOT_FOUND
kubernetes.io/psp: eks.privileged
Status: Pending
IP:
IPs: <none>
Controlled By: StatefulSet/es
Containers:
es:
Container ID:
Image: elasticsearch:7.10.1
Image ID:
Ports: 9200/TCP, 9300/TCP
Host Ports: 0/TCP, 0/TCP
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Limits:
cpu: 2
memory: 8
Requests:
cpu: 1
memory: 4
Environment: <none>
Mounts:
/usr/share/elasticsearch/config/elasticsearch.yml from es-config (rw,path="elasticsearch.yml")
/var/run/secrets/kubernetes.io/serviceaccount from default-token-6qql4 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
es-config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: es-config
Optional: false
default-token-6qql4:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-6qql4
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreatePodSandBox 75s (x4252 over 16h) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:319: getting the final child's pid from pipe caused \"read init-p: connection reset by peer\"": unknown
How do I know what the issue is and how to fix it? I have tried to restart the Statefulset but it didn't restart. It seems the pod stucked.
apiVersion: v1
kind: ConfigMap
metadata:
name: es-config
data:
elasticsearch.yml: |
cluster.name: my-elastic-cluster
network.host: "0.0.0.0"
bootstrap.memory_lock: false
discovery.zen.ping.unicast.hosts: elasticsearch-cluster
discovery.zen.minimum_master_nodes: 1
discovery.type: single-node
ES_JAVA_OPTS: -Xms2g -Xmx4g
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: es
namespace: default
spec:
serviceName: es-entrypoint
replicas: 1
selector:
matchLabels:
name: es
template:
metadata:
labels:
name: es
spec:
volumes:
- name: es-config
configMap:
name: es-config
items:
- key: elasticsearch.yml
path: elasticsearch.yml
# - name: persistent-storage
# persistentVolumeClaim:
# claimName: efs-es-claim
securityContext:
fsGroup: 1000
runAsUser: 1000
runAsGroup: 1000
containers:
- name: es
image: elasticsearch:7.10.1
resources:
limits:
cpu: 2
memory: 8
requests:
cpu: 1
memory: 4
ports:
- name: http
containerPort: 9200
- containerPort: 9300
name: inter-node
volumeMounts:
- name: es-config
mountPath: /usr/share/elasticsearch/config/elasticsearch.yml
subPath: elasticsearch.yml
# - name: persistent-storage
# mountPath: /usr/share/elasticsearch/data
---
apiVersion: v1
kind: Service
metadata:
name: es-entrypoint
spec:
selector:
name: es
ports:
- port: 9200
targetPort: 9200
protocol: TCP
type: NodePort
Figured out why it happens, after remove the limits resources, it works. Not sure why it doesn't allow limits
limits:
cpu: 2
memory: 8

Error from server (BadRequest): container "grafana" in pod is waiting to start: PodInitializing

Recently worked on a deployment for grafana instance which I edited the replicas within the spec: block from "1" to "0" --- intention was to scale down the replicas of the deployment but did something totally different which caused things to end up in the following state:
container "grafana" in pod "grafana-66f99d7dff-qsffd" is waiting to start: PodInitializing
Even though, I brought back the replicas to their initial state with the default value, the pod's state still stays on PodInitializing
Since then, I have tried the following things:
Rolling Restart by running kubectl rollout restart deployment [deployment_name]
Get logs by running kubectl logs [pod name] -c [init_container_name]
Check if nodes are in healthy state by running kubectl get nodes
Get some additional logs for the overall health of the cluster with kubectl cluster-info dump
Here is an output of the yaml for the grafana deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "1"
creationTimestamp: "2019-08-27T11:22:44Z"
generation: 3
labels:
app: grafana
chart: grafana-3.7.2
heritage: Tiller
release: grafana
name: grafana
namespace: default
resourceVersion: "371133807"
selfLink: /apis/apps/v1/namespaces/default/deployments/grafana
uid: fd7a12a5-c8bc-11e9-8b38-42010af0015f
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: grafana
release: grafana
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
annotations:
checksum/config: 26c545fd5de1c9c9af86777a84500c5b1ec229ecb0355ee764271e69639cfd96
checksum/dashboards-json-config: 01ba4719c80b6fe911b091a7c05124b64eeece964e09c058ef8f9805daca546b
checksum/sc-dashboard-provider-config: 01ba4719c80b6fe911b091a7c05124b64eeece964e09c058ef8f9805daca546b
checksum/secret: 940f74350e2a595924ed2ce4d579942346ba465ada21acdcff4916d95f59dbe5
creationTimestamp: null
labels:
app: grafana
release: grafana
spec:
containers:
- env:
- name: GF_SECURITY_ADMIN_USER
valueFrom:
secretKeyRef:
key: admin-user
name: grafana
- name: GF_SECURITY_ADMIN_PASSWORD
valueFrom:
secretKeyRef:
key: admin-password
name: grafana
- name: GF_INSTALL_PLUGINS
valueFrom:
configMapKeyRef:
key: plugins
name: grafana
image: grafana/grafana:6.2.5
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 10
httpGet:
path: /api/health
port: 3000
scheme: HTTP
initialDelaySeconds: 60
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 30
name: grafana
ports:
- containerPort: 80
name: service
protocol: TCP
- containerPort: 3000
name: grafana
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /api/health
port: 3000
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/grafana/grafana.ini
name: config
subPath: grafana.ini
- mountPath: /etc/grafana/ldap.toml
name: ldap
subPath: ldap.toml
- mountPath: /var/lib/grafana
name: storage
dnsPolicy: ClusterFirst
initContainers:
- command:
- chown
- -R
- 472:472
- /var/lib/grafana
image: busybox:1.30
imagePullPolicy: IfNotPresent
name: init-chown-data
resources: {}
securityContext:
runAsUser: 0
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/lib/grafana
name: storage
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
fsGroup: 472
runAsUser: 472
serviceAccount: grafana
serviceAccountName: grafana
terminationGracePeriodSeconds: 30
volumes:
- configMap:
defaultMode: 420
name: grafana
name: config
- name: ldap
secret:
defaultMode: 420
items:
- key: ldap-toml
path: ldap.toml
secretName: grafana
- name: storage
persistentVolumeClaim:
claimName: grafana
And this is the output of the yaml with kubectl describe for the pod
Name: grafana-66f99d7dff-qsffd
Namespace: default
Priority: 0
Node: gke-micah-prod-new-pool-f3184925-5n50/10.1.15.208
Start Time: Tue, 16 Mar 2021 12:05:25 +0200
Labels: app=grafana
pod-template-hash=66f99d7dff
release=grafana
Annotations: checksum/config: 26c545fd5de1c9c9af86777a84500c5b1ec229ecb0355ee764271e69639cfd96
checksum/dashboards-json-config: 01ba4719c80b6fe911b091a7c05124b64eeece964e09c058ef8f9805daca546b
checksum/sc-dashboard-provider-config: 01ba4719c80b6fe911b091a7c05124b64eeece964e09c058ef8f9805daca546b
checksum/secret: 940f74350e2a595924ed2ce4d579942346ba465ada21acdcff4916d95f59dbe5
kubectl.kubernetes.io/restartedAt: 2021-03-15T18:26:31+02:00
kubernetes.io/limit-ranger: LimitRanger plugin set: cpu request for container grafana; cpu request for init container init-chown-data
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/grafana-66f99d7dff
Init Containers:
init-chown-data:
Container ID:
Image: busybox:1.30
Image ID:
Port: <none>
Host Port: <none>
Command:
chown
-R
472:472
/var/lib/grafana
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Requests:
cpu: 100m
Environment: <none>
Mounts:
/var/lib/grafana from storage (rw)
/var/run/secrets/kubernetes.io/serviceaccount from grafana-token-wmgg9 (ro)
Containers:
grafana:
Container ID:
Image: grafana/grafana:6.2.5
Image ID:
Ports: 80/TCP, 3000/TCP
Host Ports: 0/TCP, 0/TCP
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Requests:
cpu: 100m
Liveness: http-get http://:3000/api/health delay=60s timeout=30s period=10s #success=1 #failure=10
Readiness: http-get http://:3000/api/health delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
GF_SECURITY_ADMIN_USER: <set to the key 'admin-user' in secret 'grafana'> Optional: false
GF_SECURITY_ADMIN_PASSWORD: <set to the key 'admin-password' in secret 'grafana'> Optional: false
GF_INSTALL_PLUGINS: <set to the key 'plugins' of config map 'grafana'> Optional: false
Mounts:
/etc/grafana/grafana.ini from config (rw,path="grafana.ini")
/etc/grafana/ldap.toml from ldap (rw,path="ldap.toml")
/var/lib/grafana from storage (rw)
/var/run/secrets/kubernetes.io/serviceaccount from grafana-token-wmgg9 (ro)
Conditions:
Type Status
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: grafana
Optional: false
ldap:
Type: Secret (a volume populated by a Secret)
SecretName: grafana
Optional: false
storage:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: grafana
ReadOnly: false
grafana-token-wmgg9:
Type: Secret (a volume populated by a Secret)
SecretName: grafana-token-wmgg9
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedMount 19m (x82 over 169m) kubelet MountVolume.SetUp failed for volume "ldap" : secret "grafana" not found
Warning FailedMount 9m24s (x18 over 167m) kubelet Unable to attach or mount volumes: unmounted volumes=[ldap], unattached volumes=[grafana-token-wmgg9 config ldap storage]: timed out waiting for the condition
Warning FailedMount 4m50s (x32 over 163m) kubelet Unable to attach or mount volumes: unmounted volumes=[ldap], unattached volumes=[storage grafana-token-wmgg9 config ldap]: timed out waiting for the condition
As I am still exploring and trying to research how to approach this, any advice, or even probing Qs are more than welcome to think through this.
Appreciate your time and effort!

Kubernetes Pod Evicted due to disk pressure

I have a k8s environment with one master and two slave nodes. In one of the node two pods(assume pod-A and pod-B) are running and in that, pod-A get evicted due to disk pressure but another one pod-B was running in the same node without evicting. Even though i have checked the node resources(ram and disk space), plenty of the space is available. Also i have checked the docker thing using "docker system df", there it is showing reclaimable space is 48% for images and all remaining thing as 0% reclaimable. So, at-last i have removed all evicted pods of pod-B, it is running fine now.
1)When pod-B is running in the same node why pod-A got evicted?
2)Why pod-B is evicted, when sufficient resources are available?
apiVersion: datas/v1
kind: Deployment
metadata:
annotations:
kompose.cmd: kompose convert
kompose.version: 1.17.0 (0c01409)
creationTimestamp: null
labels:
io.kompose.service: zuul
name: zuul
spec:
progressDeadlineSeconds: 2145893647
replicas: 1
revisionHistoryLimit: 2145893647
selector:
matchLabels:
io.kompose.service: zuul
strategy:
type: Recreate
template:
metadata:
creationTimestamp: null
labels:
io.kompose.service: zuul
spec:
containers:
- env:
- name: DATA_DIR
value: /data/work/
- name: log_file_path
value: /data/work/logs/zuul/
- name: spring_cloud_zookeeper_connectString
value: zoo_host:5168
image: repository/zuul:version
imagePullPolicy: Always
name: zuul
ports:
- containerPort: 9090
hostPort: 9090
protocol: TCP
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /data/work/
name: zuul-claim0
dnsPolicy: ClusterFirst
hostNetwork: true
nodeSelector:
disktype: node1
imagePullSecrets:
- name: regcred
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
volumes:
- hostPath:
path: /opt/DATA_DIR
type: ""
name: zuul-claim0
status: {}
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
kompose.cmd: kompose convert
kompose.version: 1.17.0 (0c01409)
creationTimestamp: null
labels:
io.kompose.service: routing
name: routing
spec:
progressDeadlineSeconds: 2148483657
replicas: 1
revisionHistoryLimit: 2148483657
selector:
matchLabels:
io.kompose.service: routing
strategy:
type: Recreate
template:
metadata:
creationTimestamp: null
labels:
io.kompose.service: routing
spec:
containers:
- env:
- name: DATA_DIR
value: /data/work/
- name: log_file_path
value: /data/logs/routing/
- name: spring_cloud_zookeeper_connectString
value: zoo_host:5168
image: repository/routing:version
imagePullPolicy: Always
name: routing
ports:
- containerPort: 8090
hostPort: 8090
protocol: TCP
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /data/work/
name: routing-claim0
dnsPolicy: ClusterFirst
hostNetwork: true
nodeSelector:
disktype: node1
imagePullSecrets:
- name: regcred
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
volumes:
- hostPath:
path: /opt/DATA_DIR
type: ""
name: routing-claim0
status: {}

None persistent Prometheus metrics on Kubernetes

I'm collecting Prometheus metrics from a uwsgi application hosted on Kubernetes, the metrics are not retained after the pods are deleted. Prometheus server is hosted on the same kubernetes cluster and I have assigned a persistent storage to it.
How do I retain the metrics from the pods even after they deleted?
The Prometheus deployment yaml:
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: prometheus
namespace: default
spec:
replicas: 1
template:
metadata:
labels:
app: prometheus
spec:
containers:
- name: prometheus
image: prom/prometheus
args:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus/"
- "--storage.tsdb.retention=2200h"
ports:
- containerPort: 9090
volumeMounts:
- name: prometheus-config-volume
mountPath: /etc/prometheus/
- name: prometheus-storage-volume
mountPath: /prometheus/
volumes:
- name: prometheus-config-volume
configMap:
defaultMode: 420
name: prometheus-server-conf
- name: prometheus-storage-volume
persistentVolumeClaim:
claimName: azurefile
---
apiVersion: v1
kind: Service
metadata:
labels:
app: prometheus
name: prometheus
spec:
type: LoadBalancer
loadBalancerIP: ...
ports:
- port: 80
protocol: TCP
targetPort: 9090
selector:
app: prometheus
Application deployment yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-app
spec:
replicas: 2
selector:
matchLabels:
app: api-app
template:
metadata:
labels:
app: api-app
spec:
containers:
- name: nginx
image: nginx
lifecycle:
preStop:
exec:
command: ["/usr/sbin/nginx","-s","quit"]
ports:
- containerPort: 80
protocol: TCP
resources:
limits:
cpu: 50m
memory: 100Mi
requests:
cpu: 10m
memory: 50Mi
volumeMounts:
- name: app-api
mountPath: /var/run/app
- name: nginx-conf
mountPath: /etc/nginx/conf.d
- name: api-app
image: azurecr.io/app_api_se:opencv
workingDir: /app
command: ["/usr/local/bin/uwsgi"]
args:
- "--die-on-term"
- "--manage-script-name"
- "--mount=/=api:app_dispatch"
- "--socket=/var/run/app/uwsgi.sock"
- "--chmod-socket=777"
- "--pyargv=se"
- "--metrics-dir=/storage"
- "--metrics-dir-restore"
resources:
requests:
cpu: 150m
memory: 1Gi
volumeMounts:
- name: app-api
mountPath: /var/run/app
- name: storage
mountPath: /storage
volumes:
- name: app-api
emptyDir: {}
- name: storage
persistentVolumeClaim:
claimName: app-storage
- name: nginx-conf
configMap:
name: app
tolerations:
- key: "sku"
operator: "Equal"
value: "test"
effect: "NoSchedule"
---
apiVersion: v1
kind: Service
metadata:
labels:
app: api-app
name: api-app
spec:
ports:
- port: 80
protocol: TCP
targetPort: 80
selector:
app: api-app
Your issue is with the wrong type of controller used to deploy Prometheus. The Deployment controller is wrong choice in this case (it's meant for Stateless applications, that don't need to maintain any persistence identifiers between Pods rescheduling - like persistence data).
You should switch to StatefulSet kind*, if you require persistence of data (metrics scraped by Prometheus) across Pod (re)scheduling.
*This is how Prometheus is deployed by default with prometheus-operator.
With this configuration for a volume, it will be removed when you release a pod. You are basically looking for a PersistentVolumne, documentation and example.
Also check, PersistentVolumeClaim.