k8s - Using Promethues with cAdvisor to monitor microservice/Pod data - amazon-web-services

I'm running Prometheus operator in the new Kubernetes cluster and I try to get container details.
The query dashboard of Prometheus doesn't provide any container data, when I look at the target I see the following
Maybe it's because of the roles but I'm not sure since I'm new to this topic
I saw also this:
https://github.com/coreos/prometheus-operator/issues/867
and I add the authentication-token-webhook which doesn't help, but maybe I didn't do it in the right place...
Any idea what am I missing here?
my operator.yml config look like following
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: prometheus-operator
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus-operator
subjects:
- kind: ServiceAccount
name: prometheus-operator
namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
name: prometheus-operator
rules:
- apiGroups:
- extensions
resources:
- thirdpartyresources
verbs:
- "*"
- apiGroups:
- apiextensions.k8s.io
resources:
- customresourcedefinitions
verbs:
- "*"
- apiGroups:
- monitoring.coreos.com
resources:
- alertmanagers
- prometheuses
- prometheuses/finalizers
- servicemonitors
verbs:
- "*"
- apiGroups:
- apps
resources:
- statefulsets
verbs: ["*"]
- apiGroups: [""]
resources:
- configmaps
- secrets
verbs: ["*"]
- apiGroups: [""]
resources:
- pods
verbs: ["list", "delete"]
- apiGroups: [""]
resources:
- services
- endpoints
verbs: ["get", "create", "update"]
- apiGroups: [""]
resources:
- nodes
verbs: ["list", "watch"]
- apiGroups: [""]
resources:
- namespaces
verbs: ["list"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus-operator
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
labels:
k8s-app: prometheus-operator
name: prometheus-operator
spec:
replicas: 1
template:
metadata:
labels:
k8s-app: prometheus-operator
spec:
containers:
- args:
- --kubelet-service=kube-system/kubelet
- --config-reloader-image=quay.io/coreos/configmap-reload:v0.0.1
- --authentication-token-webhook=true
- --extra-config=kubelet.authentication-token-webhook=true
image: quay.io/coreos/prometheus-operator:v0.17.0
name: prometheus-operator
ports:
- containerPort: 8080
name: http
resources:
limits:
cpu: 200m
memory: 100Mi
requests:
cpu: 100m
memory: 50Mi
securityContext:
runAsNonRoot: true
runAsUser: 65534
serviceAccountName: prometheus-operator
my rbac looks like following
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups: [""]
resources:
- nodes
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources:
- configmaps
verbs: ["get"]
- nonResourceURLs: ["/metrics"]
verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: prometheus-k8s
rules:
- apiGroups:
- ""
resources:
- nodes/metrics
verbs:
- get
- nonResourceURLs:
- /metrics
verbs:
- get
If some config file is missing please let me know and I'll add it.

Add the below params to kubelet config on each workder node
--authentication-token-webhook=true
--extra-config=kubelet.authorization-mode=Webhook
then run the below commands
systemctl daemon-reload
systemctl restart kubelet

Related

K8s Cluster Autoscaler on Self-Managed Kubernetes setup on AWS: no node group config

I have a cluster with an autoscaling group and I am using Microk8s on the K8s nodes. I deployed the following K8s-cluster autoscaler:
---
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
k8s-addon: cluster-autoscaler.addons.k8s.io
k8s-app: cluster-autoscaler
name: cluster-autoscaler
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: cluster-autoscaler
labels:
k8s-addon: cluster-autoscaler.addons.k8s.io
k8s-app: cluster-autoscaler
rules:
- apiGroups: [""]
resources: ["events", "endpoints"]
verbs: ["create", "patch"]
- apiGroups: [""]
resources: ["pods/eviction"]
verbs: ["create"]
- apiGroups: [""]
resources: ["pods/status"]
verbs: ["update"]
- apiGroups: [""]
resources: ["endpoints"]
resourceNames: ["cluster-autoscaler"]
verbs: ["get", "update"]
- apiGroups: [""]
resources: ["nodes"]
verbs: ["watch", "list", "get", "update"]
- apiGroups: [""]
resources:
- "namespaces"
- "pods"
- "services"
- "replicationcontrollers"
- "persistentvolumeclaims"
- "persistentvolumes"
verbs: ["watch", "list", "get"]
- apiGroups: ["extensions"]
resources: ["replicasets", "daemonsets"]
verbs: ["watch", "list", "get"]
- apiGroups: ["policy"]
resources: ["poddisruptionbudgets"]
verbs: ["watch", "list"]
- apiGroups: ["apps"]
resources: ["statefulsets", "replicasets", "daemonsets"]
verbs: ["watch", "list", "get"]
- apiGroups: ["storage.k8s.io"]
resources:
["storageclasses", "csinodes", "csidrivers", "csistoragecapacities"]
verbs: ["watch", "list", "get"]
- apiGroups: ["batch", "extensions"]
resources: ["jobs"]
verbs: ["get", "list", "watch", "patch"]
- apiGroups: ["coordination.k8s.io"]
resources: ["leases"]
verbs: ["create"]
- apiGroups: ["coordination.k8s.io"]
resourceNames: ["cluster-autoscaler"]
resources: ["leases"]
verbs: ["get", "update"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: cluster-autoscaler
namespace: kube-system
labels:
k8s-addon: cluster-autoscaler.addons.k8s.io
k8s-app: cluster-autoscaler
rules:
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["create", "list", "watch"]
- apiGroups: [""]
resources: ["configmaps"]
resourceNames:
["cluster-autoscaler-status", "cluster-autoscaler-priority-expander"]
verbs: ["delete", "get", "update", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: cluster-autoscaler
labels:
k8s-addon: cluster-autoscaler.addons.k8s.io
k8s-app: cluster-autoscaler
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-autoscaler
subjects:
- kind: ServiceAccount
name: cluster-autoscaler
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: cluster-autoscaler
namespace: kube-system
labels:
k8s-addon: cluster-autoscaler.addons.k8s.io
k8s-app: cluster-autoscaler
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: cluster-autoscaler
subjects:
- kind: ServiceAccount
name: cluster-autoscaler
namespace: kube-system
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: cluster-autoscaler
namespace: kube-system
labels:
app: cluster-autoscaler
spec:
replicas: 1
selector:
matchLabels:
app: cluster-autoscaler
template:
metadata:
labels:
app: cluster-autoscaler
spec:
priorityClassName: system-cluster-critical
securityContext:
runAsNonRoot: true
runAsUser: 65534
fsGroup: 65534
serviceAccountName: cluster-autoscaler
containers:
- image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.23.0
name: cluster-autoscaler
env:
- name: AWS_DEFAULT_REGION
value: eu-west-1
- name: AWS_REGION
value: eu-west-1
resources:
limits:
cpu: 100m
memory: 600Mi
requests:
cpu: 100m
memory: 600Mi
command:
- ./cluster-autoscaler
- --v=4
- --stderrthreshold=info
- --cloud-provider=aws
- --skip-nodes-with-local-storage=false
- --expander=least-waste
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/GP-ARM-ASG
volumeMounts:
- name: ssl-certs
mountPath: /etc/ssl/certs/ca-certificates.crt
readOnly: true
imagePullPolicy: "Always"
volumes:
- name: ssl-certs
hostPath:
path: "/etc/ssl/certs/ca-certificates.crt"
And I get these errors: Node ip-10-0-0-78 should not be processed by cluster autoscaler (no node group config)
But the node in question has these EC2 tags:
1. k8s.io/cluster-autoscaler/enabled true
2. k8s.io/cluster-autoscaler/GP-ARM-ASG owned
3. k8s.io/cluster-autoscaler/node-template/label/type gp
4. kubernetes.io/cluster/GP-ARM-ASG owned
I added 3 because I need the hint that these nodes will have different specs so I am using the label type=gp for that.
I also added 4 because of this AWS EKS documentation, hoping that that would solve the 'no node group config' error.
After 10 minutes of a node joining the cluster it does not get scaled down (the desired state of the ASG doesn't get changed) and I think that this is due to the fact that there is this error.
The ASG has the tags and is set to propagate on launch so each individual EC2 has the tags as well.
Can anyone provide some help on this?

Creating sidecar Metricbeat with AWS EKS Fargate

I'm trying to create a deployment on AWS EKS with my application and metricbeat as sidecar, so I have the following YML:
---
apiVersion: v1
kind: ConfigMap
metadata:
name: metricbeat-modules
namespace: testframework
labels:
k8s-app: metricbeat
data:
kubernetes.yml: |-
- module: kubernetes
metricsets:
- node
- system
- pod
- container
- volume
period: 10s
host: ${NODE_NAME}
hosts: [ "https://${NODE_IP}:10250" ]
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
ssl.verification_mode: "none"
---
apiVersion: v1
kind: ConfigMap
metadata:
name: metricbeat-config
namespace: testframework
labels:
k8s-app: metricbeat
data:
metricbeat.yml: |-
processors:
- add_cloud_metadata:
- add_tags:
tags: ["EKSCORP_DEV"]
target: "cluster_test"
metricbeat.config.modules:
path: ${path.config}/modules.d/*.yml
reload.enabled: false
output.elasticsearch:
index: "metricbeat-k8s-%{[agent.version]}-%{+yyyy.MM.dd}"
setup.template.name: "metricbeat-k8s"
setup.template.pattern: "metricbeat-k8s-*"
setup.ilm.enabled: false
cloud.id: ${ELASTIC_CLOUD_ID}
cloud.auth: ${ELASTIC_CLOUD_AUTH}
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: testframework-initializr-deploy
namespace: testframework
spec:
replicas: 1
selector:
matchLabels:
app: testframework-initializr
template:
metadata:
labels:
app: testframework-initializr
annotations:
co.elastic.logs/enabled: 'true'
co.elastic.logs/json.keys_under_root: 'true'
co.elastic.logs/json.add_error_key: 'true'
co.elastic.logs/json.message_key: 'message'
spec:
containers:
- name: testframework-initializr
image: XXXXX.dkr.ecr.us-east-1.amazonaws.com/testframework-initializr
ports:
- containerPort: 8080
livenessProbe:
httpGet:
path: /health/liveness
port: 8080
initialDelaySeconds: 300
periodSeconds: 10
timeoutSeconds: 60
failureThreshold: 5
readinessProbe:
httpGet:
port: 8080
path: /health
initialDelaySeconds: 300
periodSeconds: 10
timeoutSeconds: 10
failureThreshold: 3
- name: metricbeat-sidecar
image: docker.elastic.co/beats/metricbeat:7.12.0
args: [
"-c", "/etc/metricbeat.yml",
"-e",
"-system.hostfs=/hostfs"
]
env:
- name: ELASTIC_CLOUD_ID
value: xxxx
- name: ELASTIC_CLOUD_AUTH
value: xxxx
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: NODE_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
securityContext:
runAsUser: 0
volumeMounts:
- name: config
mountPath: /etc/metricbeat.yml
readOnly: true
subPath: metricbeat.yml
- name: modules
mountPath: /usr/share/metricbeat/modules.d
readOnly: true
volumes:
- name: config
configMap:
defaultMode: 0640
name: metricbeat-config
- name: modules
configMap:
defaultMode: 0640
name: metricbeat-modules
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
name: prom-admin
rules:
- apiGroups: [""]
resources: ["pods", "nodes"]
verbs: ["get", "watch", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: prom-rbac
subjects:
- kind: ServiceAccount
name: default
namespace: testframework
roleRef:
kind: ClusterRole
name: prom-admin
apiGroup: rbac.authorization.k8s.io
---
apiVersion: v1
kind: Service
metadata:
name: testframework-initializr-service
namespace: testframework
spec:
type: NodePort
ports:
- port: 80
targetPort: 8080
selector:
app: testframework-initializr
---
apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
name: testframework-initializr-ingress
annotations:
kubernetes.io/ingress.class: alb
alb.ingress.kubernetes.io/scheme: internal
alb.ingress.kubernetes.io/target-type: ip
spec:
rules:
- host: dev-initializr.test.net
http:
paths:
- backend:
serviceName: testframework-initializr-service
servicePort: 80
Well, after startup the POD in AWS EKS, I got the following error in Kubernetes Metricbeat Container:
INFO module/wrapper.go:259 Error fetching data for metricset kubernetes.system: error doing HTTP request to fetch 'system' Metricset data: error making http request: Get "https://IP_FROM_FARGATE_HERE:10250/stats/summary": dial tcp IP_FROM_FARGATE_HERE:10250: connect: connection refused
I tried to use the "NODE_NAME" instead "NODE_IP", but I got "No Such Host". Any idea how can I fix it?

External DNS - All records are already up to date, there are no changes for the matching hosted zones

I created an external-DNS on my cluster, but no records are getting created for alb endpoints. logs show "Skipping record because no hosted zone matching record DNS Name was detected All records are already up to date, there are no changes for the matching hosted zones"
Here is my external DNS manifest: I followed this [tutorial][1]
apiVersion: v1
kind: ServiceAccount
metadata:
name: external-dns
namespace: test
# If you're using Amazon EKS with IAM Roles for Service Accounts, specify the following annotation.
# Otherwise, you may safely omit it.
annotations:
# Substitute your account ID and IAM service role name below.
eks.amazonaws.com/role-arn: arn:aws:iam::XXXXXXXXX:role/ExternalDNSRole
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
name: external-dns
rules:
- apiGroups: [""]
resources: ["services", "endpoints", "pods"]
verbs: ["get", "watch", "list"]
- apiGroups: ["extensions", "networking.k8s.io"]
resources: ["ingresses"]
verbs: ["get", "watch", "list"]
- apiGroups: [""]
resources: ["nodes"]
verbs: ["list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: external-dns-viewer
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: external-dns
subjects:
- kind: ServiceAccount
name: external-dns
magento: test
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: external-dns
magento: test
spec:
strategy:
type: Recreate
selector:
matchLabels:
app: external-dns
template:
metadata:
labels:
app: external-dns
spec:
serviceAccountName: external-dns
containers:
- name: external-dns
image: k8s.gcr.io/external-dns/external-dns:v0.7.3
args:
- --source=service
- --source=ingress
- --domain-filter=test.cloud
- --provider=aws
- --aws-prefer-cname
# - --policy=upsert-only # would prevent ExternalDNS from deleting any records, omit to enable full synchronization
- --aws-zone-type=public # only look at public hosted zones (valid values are public, private or no value for both)
- --registry=txt
- --txt-owner-id=XXXXX
- --txt-prefix={{ test-frontend. }}
- --log-level=debug
resources:
limits:
cpu: 10m
memory: 128Mi
requests:
cpu: 10m
memory: 128Mi
securityContext:
fsGroup: 65534 # For ExternalDNS to be able to read Kubernetes and AWS token files
[1]: http://%20https://github.com/kubernetes-sigs/external-dns/blob/master/docs/tutorials/aws.md)
Following is service manifest:
apiVersion: v1
kind: Service
metadata:
name: "test-web"
namespace: magento
annotations:
external-dns.alpha.kubernetes.io/hostname: test-frontend.test.cloud
labels:
app: test-web
k8s-app: test
spec:
ports:
- name: "http"
port: 80
protocol: TCP
targetPort: 80
type: NodePort
selector:
app: test-web
And this is my ingress manifest:
apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
name: main
namespace: magento
annotations:
kubernetes.io/ingress.class: alb
external-dns.alpha.kubernetes.io/hostname: test.cloud
alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:eu-west-2:342366666223132:certificate/aac2312b13231213a03-a2d3123123b-433312312324f-b2f9-058ca1213951f30
alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS":443}]'
alb.ingress.kubernetes.io/actions.ssl-redirect: '{"Type": "redirect", "RedirectConfig": {"Protocol": "HTTPS", "Port": "443", "StatusCode": "HTTP_301"}}'
alb.ingress.kubernetes.io/scheme: internet-facing
labels:
app: test-web
spec:
rules:
- host: test-frontend.test.cloud
- http:
paths:
- path: /*
backend:
serviceName: magento-web
servicePort: 80

RabbitMQ only shows one node

I have been trying to set up RabbitMQ on a k8s cluster, I finally got everything set up, but only one node shows up on the managementUI. Here are my steps:
1. Dockerfile Setup
I do this to enable autocluster:
FROM rabbitmq:3.8-rc-management-alpine
MAINTAINER kevlai
RUN rabbitmq-plugins --offline enable rabbitmq_peer_discovery_k8s
2. Set up RBAC
apiVersion: v1
kind: ServiceAccount
metadata:
name: borecast-rabbitmq
namespace: borecast-production
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
name: borecast-rabbitmq
namespace: borecast-production
rules:
- apiGroups:
- ""
resources:
- endpoints
verbs:
- get
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
name: borecast-rabbitmq
namespace: borecast-production
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: dev
subjects:
- kind: ServiceAccount
name: borecast-rabbitmq
namespace: borecast-production
3. Set up Secrets
apiVersion: v1
kind: Secret
metadata:
name: rabbitmq-secret
namespace: borecast-production
type: Opaque
data:
username: a2V2
password: Ym9yZWNhc3RydWx6
secretCookie: c2VjcmV0Y29va2llaGVyZQ==
4. Set up StorageClass
I'm setting up StorageClass so k8s will automatically do provision for me on AWS.
kind: StorageClass
apiVersion: storage.k8s.io/v1beta1
metadata:
name: rabbitmq-sc
provisioner: kubernetes.io/aws-ebs
parameters:
type: gp2
zone: us-east-2a
reclaimPolicy: Retain
5. Set up StatefulSets and Services
You can see there are two services. The headless service is for the pods themselves. As for the management service, I'll expose the service for an Ingress controller in order for it to be accessible from outside.
---
apiVersion: v1
kind: Service
metadata:
name: borecast-rabbitmq-management-service
namespace: borecast-production
labels:
app: borecast-rabbitmq
spec:
ports:
- port: 15672
targetPort: 15672
name: http
- port: 5672
targetPort: 5672
name: amqp
selector:
app: borecast-rabbitmq
---
apiVersion: v1
kind: Service
metadata:
name: borecast-rabbitmq-service
namespace: borecast-production
labels:
app: borecast-rabbitmq
spec:
clusterIP: None
ports:
- port: 5672
name: amqp
selector:
app: borecast-rabbitmq
---
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: borecast-rabbitmq
namespace: borecast-production
spec:
serviceName: borecast-rabbitmq-service
replicas: 3
template:
metadata:
labels:
app: borecast-rabbitmq
spec:
serviceAccountName: borecast-rabbitmq
containers:
- image: docker.borecast.com/borecast-rabbitmq:v1.0.3
name: borecast-rabbitmq
imagePullPolicy: Always
resources:
requests:
memory: "256Mi"
cpu: "150m"
limits:
memory: "512Mi"
cpu: "250m"
ports:
- containerPort: 5672
name: amqp
env:
- name: RABBITMQ_DEFAULT_USER
valueFrom:
secretKeyRef:
name: rabbitmq-secret
key: username
- name: RABBITMQ_DEFAULT_PASS
valueFrom:
secretKeyRef:
name: rabbitmq-secret
key: password
- name: RABBITMQ_ERLANG_COOKIE
valueFrom:
secretKeyRef:
name: rabbitmq-secret
key: secretCookie
- name: MY_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: K8S_SERVICE_NAME
# value: borecast-rabbitmq-service.borecast-production.svc.cluster.local
value: borecast-rabbitmq-service
- name: RABBITMQ_USE_LONGNAME
value: "true"
- name: RABBITMQ_NODENAME
value: "rabbit#$(MY_POD_NAME).$(K8S_SERVICE_NAME)"
# value: rabbit#$(MY_POD_NAME).borecast-rabbitmq-service.borecast-production.svc.cluster.local
- name: RABBITMQ_NODE_TYPE
value: disc
- name: AUTOCLUSTER_TYPE
value: "k8s"
- name: AUTOCLUSTER_DELAY
value: "10"
- name: AUTOCLUSTER_CLEANUP
value: "true"
- name: CLEANUP_WARN_ONLY
value: "false"
- name: K8S_ADDRESS_TYPE
value: "hostname"
- name: K8S_HOSTNAME_SUFFIX
value: ".$(K8S_SERVICE_NAME)"
# value: .borecast-rabbitmq-service.borecast-production.svc.cluster.local
volumeMounts:
- name: rabbitmq-volume
mountPath: /var/lib/rabbitmq
imagePullSecrets:
- name: regcred
volumeClaimTemplates:
- metadata:
name: rabbitmq-volume
namespace: borecast-production
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: rabbitmq-sc
resources:
requests:
storage: 5Gi
Problem
Everything is working. However, when I access the management UI (i.e. I'm access the borecast-rabbitmq-management-service, port 15672), I only see one node showing up, when it should be three:
Also notice that the cluster name is
rabbit#borecast-rabbitmq-0.borecast-rabbitmq-service.borecast-production.svc.cluster.local
but when I log out and log in again, sometimes the number 0 will be changed to 1 or 2 for borecast-rabbitmq-0.
And also notice the node name is
rabbit#borecast-rabbitmq-1.borecast-rabbitmq-service
And you guessed it, sometimes the number is 2 or 0 for borecast-rabbitmq-1.
I have been trying to debug but to no avail. The logs for each pod doesn't raise any suspicions and every service and statefulset are working normally. I repeated the five steps multiple times, and if your cluster is on AWS, you can totally replicate my setup by following the steps (after creating the namespace borecast-production of course). If anybody can shed some light on the matter, I'll be eternally grateful.
The problem is with the headless service name definition:
- name: K8S_SERVICE_NAME
# value: borecast-rabbitmq-service.borecast-production.svc.cluster.local
value: borecast-rabbitmq-service
which is a building block of node name:
- name: RABBITMQ_NODENAME
value: "rabbit#$(MY_POD_NAME).$(K8S_SERVICE_NAME)"
whereas the proper node name, should be of FQDN of the POD (<statefulset name>-<ordinal index>.<headless_svc_name>.<namespace>.svc.cluster.local):
- name: RABBITMQ_NODENAME
value: "rabbit#$(MY_POD_NAME).$(K8S_SERVICE_NAME).$(MY_POD_NAMESPACE).svc.cluster.local"
Therefore you ended up with NodeName
borecast-rabbitmq-1.borecast-rabbitmq-service
instead of:
borecast-rabbitmq-1.borecast-rabbitmq-service.borecast-production.svc.cluster.local
Look up the fqdn of the pod created by borecast-rabbitmq StatefulSet (in other word: SRV records of the Pods) with nslookup util from inside of your cluster as explained here, to see what form the RABBITMQ_NODENAME is expected to have.
try exposing 4369 for headless service;
https://www.rabbitmq.com/clustering.html
see the port access section
Had the same issue, and it came down to
Deleting all the rabbitmq resources including the pvc created under the statefulset.
Then reinstalling everything from the manifests.

Kubernetes Autoscaler on AWS not working

I am trying to setup Kubernetes autoscaler with Amazon AWS as described here: DOCS but I am getting this error in my cluster-autoscaler pod logs:
E0411 09:23:25.529212 1 static_autoscaler.go:118] Failed to update node registry: RequestError: send request failed caused by: Post https://autoscaling.us-west-2a.amazonaws.com/: dial tcp: lookup autoscaling.us-west-2a.amazonaws.com on 10.96.0.10:53: no such host
Context:
I've created AWS Autoscaling Group named KubeAutoscale from Launch Configration with my custom instance AMI which has installed Ubuntu server 16.04 LTS (HVM) and Docker with Kubernetes (just raw install).
In AWS Autoscaling Group I've put 2 instances as minimum and maximum of 5 instances (they are in us-west-2a region) and I logged in on one of those 2 and setup Kubernetes cluster, logged in on other instance and add it to created cluster and logged again on master (first) instance run Autoscaler with configuration:
---
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
k8s-addon: cluster-autoscaler.addons.k8s.io
k8s-app: cluster-autoscaler
name: cluster-autoscaler
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
name: cluster-autoscaler
labels:
k8s-addon: cluster-autoscaler.addons.k8s.io
k8s-app: cluster-autoscaler
rules:
- apiGroups: [""]
resources: ["events","endpoints"]
verbs: ["create", "patch"]
- apiGroups: [""]
resources: ["pods/eviction"]
verbs: ["create"]
- apiGroups: [""]
resources: ["pods/status"]
verbs: ["update"]
- apiGroups: [""]
resources: ["endpoints"]
resourceNames: ["cluster-autoscaler"]
verbs: ["get","update"]
- apiGroups: [""]
resources: ["nodes"]
verbs: ["watch","list","get","update"]
- apiGroups: [""]
resources: ["pods","services","replicationcontrollers","persistentvolumeclaims","persistentvolumes"]
verbs: ["watch","list","get"]
- apiGroups: ["extensions"]
resources: ["replicasets","daemonsets"]
verbs: ["watch","list","get"]
- apiGroups: ["policy"]
resources: ["poddisruptionbudgets"]
verbs: ["watch","list"]
- apiGroups: ["apps"]
resources: ["statefulsets"]
verbs: ["watch","list","get"]
- apiGroups: ["storage.k8s.io"]
resources: ["storageclasses"]
verbs: ["watch","list","get"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: Role
metadata:
name: cluster-autoscaler
namespace: kube-system
labels:
k8s-addon: cluster-autoscaler.addons.k8s.io
k8s-app: cluster-autoscaler
rules:
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["create"]
- apiGroups: [""]
resources: ["configmaps"]
resourceNames: ["cluster-autoscaler-status"]
verbs: ["delete","get","update"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: cluster-autoscaler
labels:
k8s-addon: cluster-autoscaler.addons.k8s.io
k8s-app: cluster-autoscaler
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-autoscaler
subjects:
- kind: ServiceAccount
name: cluster-autoscaler
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: RoleBinding
metadata:
name: cluster-autoscaler
namespace: kube-system
labels:
k8s-addon: cluster-autoscaler.addons.k8s.io
k8s-app: cluster-autoscaler
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: cluster-autoscaler
subjects:
- kind: ServiceAccount
name: cluster-autoscaler
namespace: kube-system
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: cluster-autoscaler
namespace: kube-system
labels:
app: cluster-autoscaler
spec:
replicas: 1
selector:
matchLabels:
app: cluster-autoscaler
template:
metadata:
labels:
app: cluster-autoscaler
spec:
serviceAccountName: cluster-autoscaler
containers:
- image: k8s.gcr.io/cluster-autoscaler:v0.6.0
name: cluster-autoscaler
resources:
limits:
cpu: 100m
memory: 300Mi
requests:
cpu: 100m
memory: 300Mi
command:
- ./cluster-autoscaler
- --v=4
- --stderrthreshold=info
- --cloud-provider=aws
- --skip-nodes-with-local-storage=false
- --nodes=2:5:KubeAutoscale
env:
- name: AWS_REGION
value: us-west-2a
volumeMounts:
- name: ssl-certs
mountPath: /etc/ssl/certs/ca-certificates.crt
readOnly: true
imagePullPolicy: "Always"
volumes:
- name: ssl-certs
hostPath:
path: "/etc/ssl/certs/ca-certificates.crt"
You have the configuration issue:
env:
- name: AWS_REGION
value: us-west-2a
Your AWS region is us-west-2, but AZ is us-west-2a. That's why when Autoscaling generates the URL of autoscaling endpoint, the result is https://autoscaling.us-west-2a.amazonaws.com/ instead of https://autoscaling.us-west-2.amazonaws.com/ - which is the correct one.
To fix it, just set AWS_REGION to us-west-2 instead of us-west-2a.