How to prevent my RabbitMQ pod from autoscaling and restarting

How to prevent my RabbitMQ pod from autoscaling and restarting - amazon-web-services

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
name: rabbitmq
namespace: staging
spec:
chart:
spec:
chart: rabbitmq
sourceRef:
kind: HelmRepository
name: bitnami
namespace: kube-system
version: "8.22.3"
interval: 5m
releaseName: rabbitmq
values:
priorityClassName: system-node-critical
replicaCount: 4
auth:
username: staging
password: rabbitmq-secret
existingPasswordSecret: rabbitmq-secret
erlangCookie: **************************
extraConfiguration: |-
default_vhost=use2-mmc-1021
consumer_timeout=86400000
memoryHighWatermark:
enabled: "true"
type: absolute
value: 1536MB
resources:
limits:
cpu: 500m
memory: 2048Mi
requests:
cpu: 500m
memory: 2048Mi
livenessProbe:
initialDelaySeconds: 120
timeoutSeconds: 20
periodSeconds: 30
failureThreshold: 6
successThreshold: 1
readinessProbe:
initialDelaySeconds: 10
timeoutSeconds: 20
periodSeconds: 30
failureThreshold: 3
successThreshold: 1
service:
type: ClusterIP
serviceAccount:
create: true
This is my code for rabbitmq.yaml file.
I want to prevent my pod from restarting , as restarting the pod takes some time which results in server errors.
cluster autoscaler should not affect my rabbitmq pod.
i've already tried using system-node-critical in my code , it didn't make any difference.
please help me out with some answers to my problem.
cluster autoscaler should not affect my rabbitmq pod or how to make my pod as System-node-critical.

Related

pod stuck on `ContainerCreating` state in AWS EKS

I deployed a k8s cluster on AWS EKS fargate. And deployed a elasticsearch container to the pod. The pod is stuck on ContainerCreating state and describe pod shows below error:
$ kubectl describe pod es-0
Name: es-0
Namespace: default
Priority: 2000001000
Priority Class Name: system-node-critical
Node: fargate-ip-10-0-1-207.ap-southeast-2.compute.internal/10.0.1.207
Start Time: Fri, 28 May 2021 16:39:07 +1000
Labels: controller-revision-hash=es-86f54d94fb
eks.amazonaws.com/fargate-profile=elk_profile
name=es
statefulset.kubernetes.io/pod-name=es-0
Annotations: CapacityProvisioned: 1vCPU 2GB
Logging: LoggingDisabled: LOGGING_CONFIGMAP_NOT_FOUND
kubernetes.io/psp: eks.privileged
Status: Pending
IP:
IPs: <none>
Controlled By: StatefulSet/es
Containers:
es:
Container ID:
Image: elasticsearch:7.10.1
Image ID:
Ports: 9200/TCP, 9300/TCP
Host Ports: 0/TCP, 0/TCP
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Limits:
cpu: 2
memory: 8
Requests:
cpu: 1
memory: 4
Environment: <none>
Mounts:
/usr/share/elasticsearch/config/elasticsearch.yml from es-config (rw,path="elasticsearch.yml")
/var/run/secrets/kubernetes.io/serviceaccount from default-token-6qql4 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
es-config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: es-config
Optional: false
default-token-6qql4:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-6qql4
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedCreatePodSandBox 75s (x4252 over 16h) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:319: getting the final child's pid from pipe caused \"read init-p: connection reset by peer\"": unknown
How do I know what the issue is and how to fix it? I have tried to restart the Statefulset but it didn't restart. It seems the pod stucked.
apiVersion: v1
kind: ConfigMap
metadata:
name: es-config
data:
elasticsearch.yml: |
cluster.name: my-elastic-cluster
network.host: "0.0.0.0"
bootstrap.memory_lock: false
discovery.zen.ping.unicast.hosts: elasticsearch-cluster
discovery.zen.minimum_master_nodes: 1
discovery.type: single-node
ES_JAVA_OPTS: -Xms2g -Xmx4g
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: es
namespace: default
spec:
serviceName: es-entrypoint
replicas: 1
selector:
matchLabels:
name: es
template:
metadata:
labels:
name: es
spec:
volumes:
- name: es-config
configMap:
name: es-config
items:
- key: elasticsearch.yml
path: elasticsearch.yml
# - name: persistent-storage
# persistentVolumeClaim:
# claimName: efs-es-claim
securityContext:
fsGroup: 1000
runAsUser: 1000
runAsGroup: 1000
containers:
- name: es
image: elasticsearch:7.10.1
resources:
limits:
cpu: 2
memory: 8
requests:
cpu: 1
memory: 4
ports:
- name: http
containerPort: 9200
- containerPort: 9300
name: inter-node
volumeMounts:
- name: es-config
mountPath: /usr/share/elasticsearch/config/elasticsearch.yml
subPath: elasticsearch.yml
# - name: persistent-storage
# mountPath: /usr/share/elasticsearch/data
---
apiVersion: v1
kind: Service
metadata:
name: es-entrypoint
spec:
selector:
name: es
ports:
- port: 9200
targetPort: 9200
protocol: TCP
type: NodePort

Figured out why it happens, after remove the limits resources, it works. Not sure why it doesn't allow limits
limits:
cpu: 2
memory: 8

Error from server (BadRequest): container "grafana" in pod is waiting to start: PodInitializing

Recently worked on a deployment for grafana instance which I edited the replicas within the spec: block from "1" to "0" --- intention was to scale down the replicas of the deployment but did something totally different which caused things to end up in the following state:
container "grafana" in pod "grafana-66f99d7dff-qsffd" is waiting to start: PodInitializing
Even though, I brought back the replicas to their initial state with the default value, the pod's state still stays on PodInitializing
Since then, I have tried the following things:
Rolling Restart by running kubectl rollout restart deployment [deployment_name]
Get logs by running kubectl logs [pod name] -c [init_container_name]
Check if nodes are in healthy state by running kubectl get nodes
Get some additional logs for the overall health of the cluster with kubectl cluster-info dump
Here is an output of the yaml for the grafana deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "1"
creationTimestamp: "2019-08-27T11:22:44Z"
generation: 3
labels:
app: grafana
chart: grafana-3.7.2
heritage: Tiller
release: grafana
name: grafana
namespace: default
resourceVersion: "371133807"
selfLink: /apis/apps/v1/namespaces/default/deployments/grafana
uid: fd7a12a5-c8bc-11e9-8b38-42010af0015f
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: grafana
release: grafana
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
annotations:
checksum/config: 26c545fd5de1c9c9af86777a84500c5b1ec229ecb0355ee764271e69639cfd96
checksum/dashboards-json-config: 01ba4719c80b6fe911b091a7c05124b64eeece964e09c058ef8f9805daca546b
checksum/sc-dashboard-provider-config: 01ba4719c80b6fe911b091a7c05124b64eeece964e09c058ef8f9805daca546b
checksum/secret: 940f74350e2a595924ed2ce4d579942346ba465ada21acdcff4916d95f59dbe5
creationTimestamp: null
labels:
app: grafana
release: grafana
spec:
containers:
- env:
- name: GF_SECURITY_ADMIN_USER
valueFrom:
secretKeyRef:
key: admin-user
name: grafana
- name: GF_SECURITY_ADMIN_PASSWORD
valueFrom:
secretKeyRef:
key: admin-password
name: grafana
- name: GF_INSTALL_PLUGINS
valueFrom:
configMapKeyRef:
key: plugins
name: grafana
image: grafana/grafana:6.2.5
imagePullPolicy: IfNotPresent
livenessProbe:
failureThreshold: 10
httpGet:
path: /api/health
port: 3000
scheme: HTTP
initialDelaySeconds: 60
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 30
name: grafana
ports:
- containerPort: 80
name: service
protocol: TCP
- containerPort: 3000
name: grafana
protocol: TCP
readinessProbe:
failureThreshold: 3
httpGet:
path: /api/health
port: 3000
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /etc/grafana/grafana.ini
name: config
subPath: grafana.ini
- mountPath: /etc/grafana/ldap.toml
name: ldap
subPath: ldap.toml
- mountPath: /var/lib/grafana
name: storage
dnsPolicy: ClusterFirst
initContainers:
- command:
- chown
- -R
- 472:472
- /var/lib/grafana
image: busybox:1.30
imagePullPolicy: IfNotPresent
name: init-chown-data
resources: {}
securityContext:
runAsUser: 0
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/lib/grafana
name: storage
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
fsGroup: 472
runAsUser: 472
serviceAccount: grafana
serviceAccountName: grafana
terminationGracePeriodSeconds: 30
volumes:
- configMap:
defaultMode: 420
name: grafana
name: config
- name: ldap
secret:
defaultMode: 420
items:
- key: ldap-toml
path: ldap.toml
secretName: grafana
- name: storage
persistentVolumeClaim:
claimName: grafana
And this is the output of the yaml with kubectl describe for the pod
Name: grafana-66f99d7dff-qsffd
Namespace: default
Priority: 0
Node: gke-micah-prod-new-pool-f3184925-5n50/10.1.15.208
Start Time: Tue, 16 Mar 2021 12:05:25 +0200
Labels: app=grafana
pod-template-hash=66f99d7dff
release=grafana
Annotations: checksum/config: 26c545fd5de1c9c9af86777a84500c5b1ec229ecb0355ee764271e69639cfd96
checksum/dashboards-json-config: 01ba4719c80b6fe911b091a7c05124b64eeece964e09c058ef8f9805daca546b
checksum/sc-dashboard-provider-config: 01ba4719c80b6fe911b091a7c05124b64eeece964e09c058ef8f9805daca546b
checksum/secret: 940f74350e2a595924ed2ce4d579942346ba465ada21acdcff4916d95f59dbe5
kubectl.kubernetes.io/restartedAt: 2021-03-15T18:26:31+02:00
kubernetes.io/limit-ranger: LimitRanger plugin set: cpu request for container grafana; cpu request for init container init-chown-data
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/grafana-66f99d7dff
Init Containers:
init-chown-data:
Container ID:
Image: busybox:1.30
Image ID:
Port: <none>
Host Port: <none>
Command:
chown
-R
472:472
/var/lib/grafana
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Requests:
cpu: 100m
Environment: <none>
Mounts:
/var/lib/grafana from storage (rw)
/var/run/secrets/kubernetes.io/serviceaccount from grafana-token-wmgg9 (ro)
Containers:
grafana:
Container ID:
Image: grafana/grafana:6.2.5
Image ID:
Ports: 80/TCP, 3000/TCP
Host Ports: 0/TCP, 0/TCP
State: Waiting
Reason: PodInitializing
Ready: False
Restart Count: 0
Requests:
cpu: 100m
Liveness: http-get http://:3000/api/health delay=60s timeout=30s period=10s #success=1 #failure=10
Readiness: http-get http://:3000/api/health delay=0s timeout=1s period=10s #success=1 #failure=3
Environment:
GF_SECURITY_ADMIN_USER: <set to the key 'admin-user' in secret 'grafana'> Optional: false
GF_SECURITY_ADMIN_PASSWORD: <set to the key 'admin-password' in secret 'grafana'> Optional: false
GF_INSTALL_PLUGINS: <set to the key 'plugins' of config map 'grafana'> Optional: false
Mounts:
/etc/grafana/grafana.ini from config (rw,path="grafana.ini")
/etc/grafana/ldap.toml from ldap (rw,path="ldap.toml")
/var/lib/grafana from storage (rw)
/var/run/secrets/kubernetes.io/serviceaccount from grafana-token-wmgg9 (ro)
Conditions:
Type Status
Initialized False
Ready False
ContainersReady False
PodScheduled True
Volumes:
config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: grafana
Optional: false
ldap:
Type: Secret (a volume populated by a Secret)
SecretName: grafana
Optional: false
storage:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: grafana
ReadOnly: false
grafana-token-wmgg9:
Type: Secret (a volume populated by a Secret)
SecretName: grafana-token-wmgg9
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedMount 19m (x82 over 169m) kubelet MountVolume.SetUp failed for volume "ldap" : secret "grafana" not found
Warning FailedMount 9m24s (x18 over 167m) kubelet Unable to attach or mount volumes: unmounted volumes=[ldap], unattached volumes=[grafana-token-wmgg9 config ldap storage]: timed out waiting for the condition
Warning FailedMount 4m50s (x32 over 163m) kubelet Unable to attach or mount volumes: unmounted volumes=[ldap], unattached volumes=[storage grafana-token-wmgg9 config ldap]: timed out waiting for the condition
As I am still exploring and trying to research how to approach this, any advice, or even probing Qs are more than welcome to think through this.
Appreciate your time and effort!

Kubernetes DaemonSet Pods schedule on all nodes expect one

I'm trying to deploy a Prometheus nodeexporter Daemonset in my AWS EKS K8s cluster.
apiVersion: extensions/v1beta1
kind: DaemonSet
metadata:
labels:
app: prometheus
chart: prometheus-11.12.1
component: node-exporter
heritage: Helm
release: prometheus
name: prometheus-node-exporter
namespace: operations-tools-test
spec:
selector:
matchLabels:
app: prometheus
component: node-exporter
release: prometheus
template:
metadata:
labels:
app: prometheus
chart: prometheus-11.12.1
component: node-exporter
heritage: Helm
release: prometheus
spec:
containers:
- args:
- --path.procfs=/host/proc
- --path.sysfs=/host/sys
- --web.listen-address=:9100
image: prom/node-exporter:v1.0.1
imagePullPolicy: IfNotPresent
name: prometheus-node-exporter
ports:
- containerPort: 9100
hostPort: 9100
name: metrics
protocol: TCP
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /host/proc
name: proc
readOnly: true
- mountPath: /host/sys
name: sys
readOnly: true
dnsPolicy: ClusterFirst
hostNetwork: true
hostPID: true
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: prometheus-node-exporter
serviceAccountName: prometheus-node-exporter
terminationGracePeriodSeconds: 30
volumes:
- hostPath:
path: /proc
type: ""
name: proc
- hostPath:
path: /sys
type: ""
name: sys
After deploying however, its not getting deployed on one node.
pod.yml file for that file looks like this:
apiVersion: v1
kind: Pod
metadata:
annotations:
kubernetes.io/psp: eks.privileged
generateName: prometheus-node-exporter-
labels:
app: prometheus
chart: prometheus-11.12.1
component: node-exporter
heritage: Helm
pod-template-generation: "1"
release: prometheus
name: prometheus-node-exporter-xxxxx
namespace: operations-tools-test
ownerReferences:
- apiVersion: apps/v1
blockOwnerDeletion: true
controller: true
kind: DaemonSet
name: prometheus-node-exporter
resourceVersion: "51496903"
selfLink: /api/v1/namespaces/namespace-x/pods/prometheus-node-exporter-xxxxx
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchFields:
- key: metadata.name
operator: In
values:
- ip-xxx-xx-xxx-xxx.ec2.internal
containers:
- args:
- --path.procfs=/host/proc
- --path.sysfs=/host/sys
- --web.listen-address=:9100
image: prom/node-exporter:v1.0.1
imagePullPolicy: IfNotPresent
name: prometheus-node-exporter
ports:
- containerPort: 9100
hostPort: 9100
name: metrics
protocol: TCP
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /host/proc
name: proc
readOnly: true
- mountPath: /host/sys
name: sys
readOnly: true
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: prometheus-node-exporter-token-xxxx
readOnly: true
dnsPolicy: ClusterFirst
enableServiceLinks: true
hostNetwork: true
hostPID: true
priority: 0
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
serviceAccount: prometheus-node-exporter
serviceAccountName: prometheus-node-exporter
terminationGracePeriodSeconds: 30
tolerations:
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/disk-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/memory-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/pid-pressure
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/unschedulable
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/network-unavailable
operator: Exists
volumes:
- hostPath:
path: /proc
type: ""
name: proc
- hostPath:
path: /sys
type: ""
name: sys
- name: prometheus-node-exporter-token-xxxxx
secret:
defaultMode: 420
secretName: prometheus-node-exporter-token-xxxxx
status:
conditions:
- lastProbeTime: null
lastTransitionTime: "2020-11-06T23:56:47Z"
message: '0/4 nodes are available: 2 node(s) didn''t have free ports for the requested
pod ports, 3 Insufficient pods, 3 node(s) didn''t match node selector.'
reason: Unschedulable
status: "False"
type: PodScheduled
phase: Pending
qosClass: BestEffort
As seen above POD nodeAffinity looks up metadata.name which matches exactly what I have as a label in my node.
But when I run the below command,
kubectl describe po prometheus-node-exporter-xxxxx
I get in the events:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 60m default-scheduler 0/4 nodes are available: 1 Insufficient pods, 3 node(s) didn't match node selector.
Warning FailedScheduling 4m46s (x37 over 58m) default-scheduler 0/4 nodes are available: 2 node(s) didn't have free ports for the requested pod ports, 3 Insufficient pods, 3 node(s) didn't match node selector.
I have also checked Cloud-watch logs for Scheduler and I don't see any logs for my failed pod.
The Node has ample resources left
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 520m (26%) 210m (10%)
memory 386Mi (4%) 486Mi (6%)
I don't see a reason why it should not schedule a pod.
Can anyone help me with this?
TIA

As posted in the comments:
Please add to the question the steps that you followed (editing any values in the Helm chart etc). Also please check if the nodes are not over the limit of pods that can be scheduled on it. Here you can find the link for more reference: LINK.
no processes occupying 9100 on the given node. #DawidKruk The POD limit was reached. Thanks! I expected them to give me some error regarding that rather than vague node selector property not matching
Not really sure why the following messages were displayed:
node(s) didn't have free ports for the requested pod ports
node(s) didn't match node selector
The issue that Pods couldn't be scheduled on the nodes (Pending state) was connected with the Insufficient pods message in the $ kubectl get events command.
Above message is displayed when the nodes reached their maximum capacity of pods (example: node1 can schedule maximum of 30 pods).
More on the Insufficient Pods can be found in this github issue comment:
That's true. That's because the CNI implementation on EKS. Max pods number is limited by the network interfaces attached to instance multiplied by the number of ips per ENI - which varies depending on the size of instance. It's apparent for small instances, this number can be quite a low number.
Docs.aws.amazon.com: AWSEC2: User Guide: Using ENI: Available IP per ENI
-- Github.com: Kubernetes: Autoscaler: Issue 1576: Comment 454100551
Additional resources:
Stackoverflow.com: Questions: Pod limit on node AWS EKS

GKE Ingress with NEGs: backend healthcheck doesn't pass

I have created GKE Ingress as follows:
apiVersion: cloud.google.com/v1beta1 #tried cloud.google.com/v1 as well
kind: BackendConfig
metadata:
name: backend-config
namespace: prod
spec:
healthCheck:
checkIntervalSec: 30
port: 8080
type: HTTP #case-sensitive
requestPath: /healthcheck
connectionDraining:
drainingTimeoutSec: 60
---
apiVersion: v1
kind: Service
metadata:
name: web-engine-service
namespace: prod
annotations:
cloud.google.com/neg: '{"ingress": true}' # Creates a NEG after an Ingress is created.
cloud.google.com/backend-config: '{"ports": {"web-engine-port":"backend-config"}}' #https://cloud.google.com/kubernetes-engine/docs/how-to/ingress-features#associating_backendconfig_with_your_ingress
spec:
selector:
app: web-engine-pod
ports:
- name: web-engine-port
protocol: TCP
port: 8080
targetPort: 5000
---
apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "1"
labels:
app: web-engine-deployment
environment: prod
name: web-engine-deployment
namespace: prod
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: web-engine-pod
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
name: web-engine-pod
labels:
app: web-engine-pod
environment: prod
spec:
containers:
- image: my-image:my-tag
imagePullPolicy: Always
name: web-engine-1
resources: {}
ports:
- name: flask-port
containerPort: 5000
protocol: TCP
readinessProbe:
httpGet:
path: /healthcheck
port: 5000
initialDelaySeconds: 30
periodSeconds: 100
restartPolicy: Always
terminationGracePeriodSeconds: 30
---
apiVersion: networking.gke.io/v1beta2
kind: ManagedCertificate
metadata:
name: my-certificate
namespace: prod
spec:
domains:
- api.mydomain.com #https://cloud.google.com/load-balancing/docs/ssl-certificates/google-managed-certs#renewal
---
apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
name: prod-ingress
namespace: prod
annotations:
kubernetes.io/ingress.allow-http: "false"
kubernetes.io/ingress.global-static-ip-name: load-balancer-ip
networking.gke.io/managed-certificates: my-certificate
spec:
rules:
- http:
paths:
- path: /model
backend:
serviceName: web-engine-service
servicePort: 8080
I don't know what I'm doing wrong because my heathchecks are not Ok. And based on the perimeter logging I added to the app, nothing is even trying to hit that pod.
I've tried BackendConfig for both 8080 and 5000.
By the way, it's not 100% clear based on the docs if Load Balancer should be configured to targetPorts of corresponding Pods or Services.
The healthcheck is registered with HTTP Load Balancer and Compute Engine:
It seems that something is not right with the Backend service IP.
The corresponding backend service configuration:
$ gcloud compute backend-services describe k8s1-85ef2f9a-prod-web-engine-service-8080-b938a707
...
affinityCookieTtlSec: 0
backends:
- balancingMode: RATE
capacityScaler: 1.0
group: https://www.googleapis.com/compute/v1/projects/wnd/zones/europe-west3-a/networkEndpointGroups/k8s1-85ef2f9a-prod-web-engine-service-8080-b938a707
maxRatePerEndpoint: 1.0
connectionDraining:
drainingTimeoutSec: 60
creationTimestamp: '2020-08-01T11:14:06.096-07:00'
description: '{"kubernetes.io/service-name":"prod/web-engine-service","kubernetes.io/service-port":"8080","x-features":["NEG"]}'
enableCDN: false
fingerprint: 5Vkqvg9lcRg=
healthChecks:
- https://www.googleapis.com/compute/v1/projects/wnd/global/healthChecks/k8s1-85ef2f9a-prod-web-engine-service-8080-b938a707
id: '2233674285070159361'
kind: compute#backendService
loadBalancingScheme: EXTERNAL
logConfig:
enable: true
sampleRate: 1.0
name: k8s1-85ef2f9a-prod-web-engine-service-8080-b938a707
port: 80
portName: port0
protocol: HTTP
selfLink: https://www.googleapis.com/compute/v1/projects/wnd/global/backendServices/k8s1-85ef2f9a-prod-web-engine-service-8080-b938a707
sessionAffinity: NONE
timeoutSec: 30
(port 80 looks really suspicious but I thought maybe it's just left there as default and is not in use when NEGs are configured).

Figured it out. By default, even the latest GKE clusters are created with no IP Alias support. It's also called VPC-native. I didn't even bother to check that initially because:
NEGs are supported out of the box, and what's more they seem to be default with no need for explicit annotation when used on the GKE version I had(1.17.8-gke.17). It doesn't make sense to not enable IP Aliases by default then because it basically means that cluster is in a non-functional state by default.
I didn't check VPC-Native support initially, because this name for the feature is simply misleading. I had extensive prior experience with AWS and my faulty assumption was that VPC-Native is like EC2-VPC, as opposed to EC2-Classic, which is legacy.

Why isn't Kubernetes service DNS working?

I have set up DNS in my Kubernetes (v1.1.2+1abf20d) system, on CoreOS/AWS, but I cannot look up services via DNS. I have tried debugging, but cannot for the life of me find out why. This is what happens when I try to look up the kubernetes service, which should always be available:
$ ~/.local/bin/kubectl --kubeconfig=/etc/kubernetes/kube.conf exec busybox-sleep -- nslookup kubernetes.default
Server: 10.3.0.10
Address 1: 10.3.0.10 ip-10-3-0-10.eu-central-1.compute.internal
nslookup: can't resolve 'kubernetes.default'
error: error executing remote command: Error executing command in container: Error executing in Docker Container: 1
I have installed the DNS addon according to this spec:
apiVersion: v1
kind: ReplicationController
metadata:
name: kube-dns-v10
namespace: kube-system
labels:
k8s-app: kube-dns
version: v10
kubernetes.io/cluster-service: "true"
spec:
replicas: 1
selector:
k8s-app: kube-dns
version: v10
template:
metadata:
labels:
k8s-app: kube-dns
version: v10
kubernetes.io/cluster-service: "true"
spec:
containers:
- name: etcd
image: gcr.io/google_containers/etcd-amd64:2.2.1
resources:
# keep request = limit to keep this container in guaranteed class
limits:
cpu: 100m
memory: 50Mi
requests:
cpu: 100m
memory: 50Mi
command:
- /usr/local/bin/etcd
- -data-dir
- /var/etcd/data
- -listen-client-urls
- http://127.0.0.1:2379,http://127.0.0.1:4001
- -advertise-client-urls
- http://127.0.0.1:2379,http://127.0.0.1:4001
- -initial-cluster-token
- skydns-etcd
volumeMounts:
- name: etcd-storage
mountPath: /var/etcd/data
- name: kube2sky
image: gcr.io/google_containers/kube2sky:1.12
resources:
# keep request = limit to keep this container in guaranteed class
limits:
cpu: 100m
memory: 50Mi
requests:
cpu: 100m
memory: 50Mi
args:
# command = "/kube2sky"
- --domain=cluster.local
- name: skydns
image: gcr.io/google_containers/skydns:2015-10-13-8c72f8c
resources:
# keep request = limit to keep this container in guaranteed class
limits:
cpu: 100m
memory: 50Mi
requests:
cpu: 100m
memory: 50Mi
args:
# command = "/skydns"
- -machines=http://127.0.0.1:4001
- -addr=0.0.0.0:53
- -ns-rotate=false
- -domain=cluster.local.
ports:
- containerPort: 53
name: dns
protocol: UDP
- containerPort: 53
name: dns-tcp
protocol: TCP
livenessProbe:
httpGet:
path: /healthz
port: 8080
scheme: HTTP
initialDelaySeconds: 30
timeoutSeconds: 5
readinessProbe:
httpGet:
path: /healthz
port: 8080
scheme: HTTP
initialDelaySeconds: 1
timeoutSeconds: 5
- name: healthz
image: gcr.io/google_containers/exechealthz:1.0
resources:
# keep request = limit to keep this container in guaranteed class
limits:
cpu: 10m
memory: 20Mi
requests:
cpu: 10m
memory: 20Mi
args:
- -cmd=nslookup kubernetes.default.svc.cluster.local 127.0.0.1 >/dev/null
- -port=8080
ports:
- containerPort: 8080
protocol: TCP
volumes:
- name: etcd-storage
emptyDir: {}
dnsPolicy: Default # Don't use cluster DNS.
---
apiVersion: v1
kind: Service
metadata:
name: kube-dns
namespace: kube-system
labels:
k8s-app: kube-dns
kubernetes.io/cluster-service: "true"
kubernetes.io/name: "KubeDNS"
spec:
selector:
k8s-app: kube-dns
clusterIP: 10.3.0.10
ports:
- name: dns
port: 53
protocol: UDP
- name: dns-tcp
port: 53
protocol: TCP
Why isn't DNS lookup for services working in my Kubernetes setup? Please let me know what other info I need to provide.

There were two things I needed to do:
Configure kube2sky via kubeconfig, so that it's properly configured for TLS.
Configure kube-proxy via kubeconfig, so that it's properly configured for TLS and finds the master node.
/etc/kubernetes/kube.conf on master node
apiVersion: v1
kind: Config
clusters:
- name: kube
cluster:
server: https://127.0.0.1:443
certificate-authority: /etc/ssl/etcd/ca.pem
users:
- name: kubelet
user:
client-certificate: /etc/ssl/etcd/master-client.pem
client-key: /etc/ssl/etcd/master-client-key.pem
contexts:
- context:
cluster: kube
user: kubelet
/etc/kubernetes/kube.conf on worker node
apiVersion: v1
kind: Config
clusters:
- name: local
cluster:
certificate-authority: /etc/ssl/etcd/ca.pem
server: https://<master IP>:443
users:
- name: kubelet
user:
client-certificate: /etc/ssl/etcd/worker.pem
client-key: /etc/ssl/etcd/worker-key.pem
contexts:
- context:
cluster: local
user: kubelet
name: kubelet-context
current-context: kubelet-context
dns-addon.yaml (install this on master)
apiVersion: v1
kind: ReplicationController
metadata:
name: kube-dns-v11
namespace: kube-system
labels:
k8s-app: kube-dns
version: v11
kubernetes.io/cluster-service: "true"
spec:
replicas: 1
selector:
k8s-app: kube-dns
version: v11
template:
metadata:
labels:
k8s-app: kube-dns
version: v11
kubernetes.io/cluster-service: "true"
spec:
containers:
- name: etcd
image: gcr.io/google_containers/etcd-amd64:2.2.1
resources:
# TODO: Set memory limits when we've profiled the container for large
# clusters, then set request = limit to keep this container in
# guaranteed class. Currently, this container falls into the
# "burstable" category so the kubelet doesn't backoff from restarting
# it.
limits:
cpu: 100m
memory: 500Mi
requests:
cpu: 100m
memory: 50Mi
command:
- /usr/local/bin/etcd
- -data-dir
- /var/etcd/data
- -listen-client-urls
- http://127.0.0.1:2379,http://127.0.0.1:4001
- -advertise-client-urls
- http://127.0.0.1:2379,http://127.0.0.1:4001
- -initial-cluster-token
- skydns-etcd
volumeMounts:
- name: etcd-storage
mountPath: /var/etcd/data
- name: kube2sky
image: gcr.io/google_containers/kube2sky:1.14
resources:
# TODO: Set memory limits when we've profiled the container for large
# clusters, then set request = limit to keep this container in
# guaranteed class. Currently, this container falls into the
# "burstable" category so the kubelet doesn't backoff from restarting
# it.
limits:
cpu: 100m
# Kube2sky watches all pods.
memory: 200Mi
requests:
cpu: 100m
memory: 50Mi
livenessProbe:
httpGet:
path: /healthz
port: 8080
scheme: HTTP
initialDelaySeconds: 60
timeoutSeconds: 5
volumeMounts:
- name: kubernetes-etc
mountPath: /etc/kubernetes
readOnly: true
- name: etcd-ssl
mountPath: /etc/ssl/etcd
readOnly: true
readinessProbe:
httpGet:
path: /readiness
port: 8081
scheme: HTTP
# we poll on pod startup for the Kubernetes master service and
# only setup the /readiness HTTP server once that's available.
initialDelaySeconds: 30
timeoutSeconds: 5
args:
# command = "/kube2sky"
- --domain=cluster.local.
- --kubecfg-file=/etc/kubernetes/kube.conf
- name: skydns
image: gcr.io/google_containers/skydns:2015-10-13-8c72f8c
resources:
# TODO: Set memory limits when we've profiled the container for large
# clusters, then set request = limit to keep this container in
# guaranteed class. Currently, this container falls into the
# "burstable" category so the kubelet doesn't backoff from restarting
# it.
limits:
cpu: 100m
memory: 200Mi
requests:
cpu: 100m
memory: 50Mi
args:
# command = "/skydns"
- -machines=http://127.0.0.1:4001
- -addr=0.0.0.0:53
- -ns-rotate=false
- -domain=cluster.local
ports:
- containerPort: 53
name: dns
protocol: UDP
- containerPort: 53
name: dns-tcp
protocol: TCP
- name: healthz
image: gcr.io/google_containers/exechealthz:1.0
resources:
# keep request = limit to keep this container in guaranteed class
limits:
cpu: 10m
memory: 20Mi
requests:
cpu: 10m
memory: 20Mi
args:
- -cmd=nslookup kubernetes.default.svc.cluster.local \
127.0.0.1 >/dev/null
- -port=8080
ports:
- containerPort: 8080
protocol: TCP
volumes:
- name: etcd-storage
emptyDir: {}
- name: kubernetes-etc
hostPath:
path: /etc/kubernetes
- name: etcd-ssl
hostPath:
path: /etc/ssl/etcd
dnsPolicy: Default # Don't use cluster DNS.
/etc/kubernetes/manifests/kube-proxy.yaml on master node
apiVersion: v1
kind: Pod
metadata:
name: kube-proxy
namespace: kube-system
spec:
hostNetwork: true
containers:
- name: kube-proxy
image: gcr.io/google_containers/hyperkube:v1.1.2
command:
- /hyperkube
- proxy
- --master=https://127.0.0.1:443
- --proxy-mode=iptables
- --kubeconfig=/etc/kubernetes/kube.conf
securityContext:
privileged: true
volumeMounts:
- mountPath: /etc/ssl/certs
name: ssl-certs-host
readOnly: true
- mountPath: /etc/kubernetes
name: kubernetes
readOnly: true
- mountPath: /etc/ssl/etcd
name: kubernetes-certs
readOnly: true
volumes:
- hostPath:
path: /usr/share/ca-certificates
name: ssl-certs-host
- hostPath:
path: /etc/kubernetes
name: kubernetes
- hostPath:
path: /etc/ssl/etcd
name: kubernetes-certs
/etc/kubernetes/manifests/kube-proxy.yaml on worker node
apiVersion: v1
kind: Pod
metadata:
name: kube-proxy
namespace: kube-system
spec:
hostNetwork: true
containers:
- name: kube-proxy
image: gcr.io/google_containers/hyperkube:v1.1.2
command:
- /hyperkube
- proxy
- --kubeconfig=/etc/kubernetes/kube.conf
- --proxy-mode=iptables
- --v=2
securityContext:
privileged: true
volumeMounts:
- mountPath: /etc/ssl/certs
name: "ssl-certs"
- mountPath: /etc/kubernetes/kube.conf
name: "kubeconfig"
readOnly: true
- mountPath: /etc/ssl/etcd
name: "etc-kube-ssl"
readOnly: true
volumes:
- name: "ssl-certs"
hostPath:
path: "/usr/share/ca-certificates"
- name: "kubeconfig"
hostPath:
path: "/etc/kubernetes/kube.conf"
- name: "etc-kube-ssl"
hostPath:
path: "/etc/ssl/etcd"

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js