kubectl - cert manager - credentials not found - google-cloud-platform

I want to have TLS termination enabled on ingress (on top of kubernetes) on google cloud platform.
My ingress cluster is working, my cert manager is failing with the error message
textPayload: "2018/07/05 22:04:00 Error while processing certificate during sync: Error while creating ACME client for 'domain': Error while initializing challenge provider googlecloud: Unable to get Google Cloud client: google: error getting credentials using GOOGLE_APPLICATION_CREDENTIALS environment variable: open /opt/google/kube-cert-manager.json: no such file or directory
"
This is what I did in order to get into the current state:
created cluster, deployment, service, ingress
executed:
gcloud --project 'project' iam service-accounts create kube-cert-manager-sv-security --display-name "kube-cert-manager-sv-security"
gcloud --project 'project' iam service-accounts keys create ~/.config/gcloud/kube-cert-manager-sv-security.json --iam-account kube-cert-manager-sv-security#'project'.iam.gserviceaccount.com
gcloud --project 'project' projects add-iam-policy-binding --member serviceAccount:kube-cert-manager-sv-security#'project'.iam.gserviceaccount.com --role roles/dns.admin
kubectl create secret generic kube-cert-manager-sv-security-secret --from-file=/home/perre/.config/gcloud/kube-cert-manager-sv-security.json
and created the following resources:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: kube-cert-manager-sv-security-data
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 5Gi
apiVersion: v1
kind: ServiceAccount
metadata:
name: kube-cert-manager-sv-security
namespace: default
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
name: kube-cert-manager-sv-security
rules:
- apiGroups: ["*"]
resources: ["certificates", "ingresses"]
verbs: ["get", "list", "watch"]
- apiGroups: ["*"]
resources: ["secrets"]
verbs: ["get", "list", "create", "update", "delete"]
- apiGroups: ["*"]
resources: ["events"]
verbs: ["create"]
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
name: kube-cert-manager-sv-security-service-account
subjects:
- kind: ServiceAccount
namespace: default
name: kube-cert-manager-sv-security
roleRef:
kind: ClusterRole
name: kube-cert-manager-sv-security
apiGroup: rbac.authorization.k8s.io
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
name: certificates.stable.k8s.psg.io
spec:
scope: Namespaced
group: stable.k8s.psg.io
version: v1
names:
kind: Certificate
plural: certificates
singular: certificate
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
labels:
app: kube-cert-manager-sv-security
name: kube-cert-manager-sv-security
spec:
replicas: 1
template:
metadata:
labels:
app: kube-cert-manager-sv-security
name: kube-cert-manager-sv-security
spec:
serviceAccount: kube-cert-manager-sv-security
containers:
- name: kube-cert-manager
env:
- name: GCE_PROJECT
value: solidair-vlaanderen-207315
- name: GOOGLE_APPLICATION_CREDENTIALS
value: /opt/google/kube-cert-manager.json
image: bcawthra/kube-cert-manager:2017-12-10
args:
- "-data-dir=/var/lib/cert-manager-sv-security"
#- "-acme-url=https://acme-staging.api.letsencrypt.org/directory"
# NOTE: the URL above points to the staging server, where you won't get real certs.
# Uncomment the line below to use the production LetsEncrypt server:
- "-acme-url=https://acme-v01.api.letsencrypt.org/directory"
# You can run multiple instances of kube-cert-manager for the same namespace(s),
# each watching for a different value for the 'class' label
- "-class=kube-cert-manager"
# You can choose to monitor only some namespaces, otherwise all namespaces will be monitored
#- "-namespaces=default,test"
# If you set a default email, you can omit the field/annotation from Certificates/Ingresses
- "-default-email=viae.it#gmail.com"
# If you set a default provider, you can omit the field/annotation from Certificates/Ingresses
- "-default-provider=googlecloud"
volumeMounts:
- name: data-sv-security
mountPath: /var/lib/cert-manager-sv-security
- name: google-application-credentials
mountPath: /opt/google
volumes:
- name: data-sv-security
persistentVolumeClaim:
claimName: kube-cert-manager-sv-security-data
- name: google-application-credentials
secret:
secretName: kube-cert-manager-sv-security-secret
anyone knows what I'm missing?

Your secret resource kube-cert-manager-sv-security-secret may contains a JSON file named kube-cert-manager-sv-security.json and it is not matched with GOOGLE_APPLICATION_CREDENTIALS value. You can confirm file name in the secret resource with kubectl get secret -oyaml YOUR-SECRET-NAME.
So you change the file path to the actual file name, cert-manager works fine.
- name: GOOGLE_APPLICATION_CREDENTIALS
# value: /opt/google/kube-cert-manager.json
value: /opt/google/kube-cert-manager-sv-security.json

Related

Cloudwatch agent with role instead access and secret key

I got this deployment from the internet and tested it with great result. My question is, are there config parameters that i can use to pass a role ARN instead of access key and secret key? I tried passing a role ARN in various forms inside aws-credentials. But it was to no avail.
---
# Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: cwagent-prometheus
namespace: amazon-cloudwatch
spec:
replicas: 1
selector:
matchLabels:
app: cwagent-prometheus
template:
metadata:
labels:
app: cwagent-prometheus
spec:
containers:
- name: cloudwatch-agent
image: amazon/cloudwatch-agent:1.247348.0b251302
imagePullPolicy: Always
env:
- name: CI_VERSION
value: "k8s/1.3.8"
volumeMounts:
- name: prometheus-cwagentconfig
mountPath: /etc/cwagentconfig
- name: prometheus-config
mountPath: /etc/prometheusconfig
- name: aws-credentials
mountPath: /root/.aws
volumes:
- name: prometheus-cwagentconfig
configMap:
name: prometheus-cwagentconfig
- name: prometheus-config
configMap:
name: prometheus-config
- name: aws-credentials
secret:
secretName: aws-credentials
serviceAccountName: cwagent-prometheus
The typical working solution is to provide aws-credentials with the format:
[AmazonCloudWatchAgent]
aws_access_key_id = $AWS_ID
aws_secret_access_key = $AWS_KEY
For instance, I tried changing it to:
[AmazonCloudWatchAgent]
role_arn = $ROLE_ARN
With this solution, the cloudwatch agent will complain about not finding aws_access_key_id in the credentials.
This is a know issue and still not resolved yet.
Use IAM Roles for Service Accounts issue on amazon-cloudwatch-agent
aws-helm-eks-charts issue

Unable to connect to AWS services form EKS

I have created an EKS cluster using eksctl. I am following these steps to establish connectivity to AWS services like S3, cloudwatch using spring-boot.
Create EKS using eksctl - This has my service account details and OIDC enabled.
List the service accounts to see if they were created fine
Create a deployment using the account name
Create a service
I am seeing a 403 in the logs:
User: arn:aws:sts:xxx/xxxx is not authorized to perform:
cloudformation:DescribeStackResources because no identity-based policy allows
the cloudformation:DescribeStackResources action (Service: AmazonCloudFormation; Status Code: 403;
Error Code: AccessDenied; Request ID: xxxx)
Can I get some help here to troubleshoot this issue, please?
What I have figured out after posting this issue is my node which is provisioned by eksctl, has been applied with rules. This is the rule which my app is picking up due to the default CredentialChain.
What I haven't still figured out is how do I enable the apps in the pod to assume a service account role.
YAML for #1
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: name
region: ap-south-1
availabilityZones: ["xxxx", "xxxx", "xxxx"]
managedNodeGroups:
- name: c5large-nodes
desiredCapacity: 1
instanceType: c5.large
labels:
node-type: large
volumeSize: 5
cloudWatch:
clusterLogging:
enableTypes: [ "*" ]
iam:
withOIDC: true
serviceAccounts:
- metadata:
name: cluster-autoscaler
namespace: kube-system
labels: {aws-usage: "autoscaling"}
wellKnownPolicies:
autoScaler: true
roleName: eksctl-cluster-autoscaler-role
roleOnly: true
- metadata:
name: backend-stage-iam-role
namespace: backend-stage
labels: { aws-usage: "all-backend-allow" }
attachPolicyARNs:
- "arn:aws:iam::xxxx"
- metadata:
name: mq-access
namespace: backend-stage
labels: { aws-usage: "MQ" }
attachPolicyARNs:
- "arn:aws:iam::aws:policy/AmazonMQFullAccess"
YAML for deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
labels:
app: my-app
namespace: backend-stage
spec:
replicas: 1
selector:
matchLabels:
app: my-app
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
labels:
app: my-app
spec:
serviceAccountName: backend-stage-iam-role
containers:
- image: xxx/my-app:latest
imagePullPolicy: Always
name: my-app
ports:
- containerPort: 8080
protocol: TCP
env:
- name: SPRING_PROFILES_ACTIVE
value: stage
YAML for service
apiVersion: v1
kind: Service
metadata:
name: my-app
namespace: backend-stage
spec:
selector:
app: my-app
type: LoadBalancer
ports:
- protocol: TCP
port: 80
targetPort: 8080
The role is defined like this for now:
- Effect: Allow
Action:
- cloudformation:*
Resource: "*"
I did further debugging, by describing the pod, I can see the role passed as an ENV parameter:
AWS_ROLE_ARN: arn:aws:iam::MYACCOUNT:role/MyRole```
Just add missing permission to arn:aws:sts:xxx/xxxx assumed role.

External DNS - All records are already up to date, there are no changes for the matching hosted zones

I created an external-DNS on my cluster, but no records are getting created for alb endpoints. logs show "Skipping record because no hosted zone matching record DNS Name was detected All records are already up to date, there are no changes for the matching hosted zones"
Here is my external DNS manifest: I followed this [tutorial][1]
apiVersion: v1
kind: ServiceAccount
metadata:
name: external-dns
namespace: test
# If you're using Amazon EKS with IAM Roles for Service Accounts, specify the following annotation.
# Otherwise, you may safely omit it.
annotations:
# Substitute your account ID and IAM service role name below.
eks.amazonaws.com/role-arn: arn:aws:iam::XXXXXXXXX:role/ExternalDNSRole
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
name: external-dns
rules:
- apiGroups: [""]
resources: ["services", "endpoints", "pods"]
verbs: ["get", "watch", "list"]
- apiGroups: ["extensions", "networking.k8s.io"]
resources: ["ingresses"]
verbs: ["get", "watch", "list"]
- apiGroups: [""]
resources: ["nodes"]
verbs: ["list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: external-dns-viewer
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: external-dns
subjects:
- kind: ServiceAccount
name: external-dns
magento: test
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: external-dns
magento: test
spec:
strategy:
type: Recreate
selector:
matchLabels:
app: external-dns
template:
metadata:
labels:
app: external-dns
spec:
serviceAccountName: external-dns
containers:
- name: external-dns
image: k8s.gcr.io/external-dns/external-dns:v0.7.3
args:
- --source=service
- --source=ingress
- --domain-filter=test.cloud
- --provider=aws
- --aws-prefer-cname
# - --policy=upsert-only # would prevent ExternalDNS from deleting any records, omit to enable full synchronization
- --aws-zone-type=public # only look at public hosted zones (valid values are public, private or no value for both)
- --registry=txt
- --txt-owner-id=XXXXX
- --txt-prefix={{ test-frontend. }}
- --log-level=debug
resources:
limits:
cpu: 10m
memory: 128Mi
requests:
cpu: 10m
memory: 128Mi
securityContext:
fsGroup: 65534 # For ExternalDNS to be able to read Kubernetes and AWS token files
[1]: http://%20https://github.com/kubernetes-sigs/external-dns/blob/master/docs/tutorials/aws.md)
Following is service manifest:
apiVersion: v1
kind: Service
metadata:
name: "test-web"
namespace: magento
annotations:
external-dns.alpha.kubernetes.io/hostname: test-frontend.test.cloud
labels:
app: test-web
k8s-app: test
spec:
ports:
- name: "http"
port: 80
protocol: TCP
targetPort: 80
type: NodePort
selector:
app: test-web
And this is my ingress manifest:
apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
name: main
namespace: magento
annotations:
kubernetes.io/ingress.class: alb
external-dns.alpha.kubernetes.io/hostname: test.cloud
alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:eu-west-2:342366666223132:certificate/aac2312b13231213a03-a2d3123123b-433312312324f-b2f9-058ca1213951f30
alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS":443}]'
alb.ingress.kubernetes.io/actions.ssl-redirect: '{"Type": "redirect", "RedirectConfig": {"Protocol": "HTTPS", "Port": "443", "StatusCode": "HTTP_301"}}'
alb.ingress.kubernetes.io/scheme: internet-facing
labels:
app: test-web
spec:
rules:
- host: test-frontend.test.cloud
- http:
paths:
- path: /*
backend:
serviceName: magento-web
servicePort: 80

Read-only user gets full access

Aim is to create a read-only user for production namespace for my EKS cluster. However, I see the user has full access. As I am new to EKS and Kubernetes, please let me know the error that I have injected.
I have created an IAM user without any permission added. ARN is: arn:aws:iam::460764xxxxxx:user/eks-prod-readonly-user. Also, I have noted down the access key id and secret access key -
aws_access_key_id= AKIAWWxxxxxxxxxxxxx
aws_secret_access_key= lAbtyv3zlbAMPxym8Jktal+xxxxxxxxxxxxxxxxxxxx
Then, I have created the production namespace, role, and role binding as follows –
ubuntu#ip-172-16-0-252:~$ sudo kubectl create namespace production
ubuntu#ip-172-16-0-252:~$ cat role.yaml
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
namespace: production
name: prod-viewer-role
rules:
- apiGroups: ["", "extensions", "apps"]
resources: ["*"]
verbs: ["get", "list", "watch"]
ubuntu#ip-172-16-0-252:~$ sudo kubectl apply -f role.yaml
ubuntu#ip-172-16-0-252:~$ cat rolebinding.yaml
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: prod-viewer-binding
namespace: production
subjects:
- kind: User
name: eks-prod-readonly-user
apiGroup: ""
roleRef:
kind: Role
name: prod-viewer-role
apiGroup: ""
ubuntu#ip-172-16-0-252:~$ sudo kubectl apply -f rolebinding.yaml
Then, we have added the newly created user to aws-auth configuration map and have applied the changes -
ubuntu#ip-172-16-0-252:~$ sudo kubectl -n kube-system get configmap aws-auth -o yaml > aws-auth-configmap.yaml
ubuntu#ip-172-16-0-252:~$ vi aws-auth-configmap.yaml
The following section is added under ‘mapUsers’ –
- userarn: arn:aws:iam::460764xxxxxx:user/eks-prod-readonly-user
username: eks-prod-readonly-user
groups:
- prod-viewer-role
ubuntu#ip-172-16-0-252:~$ sudo kubectl apply -f aws-auth-configmap.yaml
Now, I include this user details as a new section inside AWS credential file ( ~/.aws/credentials ) so that this user can be authenticated to API server of Kubernetes -
[eksprodreadonlyuser]
aws_access_key_id= AKIAWWxxxxxxxxxxxxx
aws_secret_access_key= lAbtyv3zlbAMPxym8Jktal+xxxxxxxxxxxxxxxxxxxx
region=eu-west-2
output=json
I activate this AWS profile -
ubuntu#ip-172-16-0-252:~$ export AWS_PROFILE="eksprodreadonlyuser"
ubuntu#ip-172-16-0-252:~$ aws sts get-caller-identity
We see the correct user ARN in the output of get-caller-identity command.
While trying to see pods of default namespace, it works. Ideally it shall not as the user is given access on the production namespace only -
ubuntu#ip-172-16-0-252:~$ sudo kubectl get pods
NAME READY STATUS RESTARTS AGE
test-autoscaler-697b95d8b-5wl5d 1/1 Running 0 7d20h
ubuntu#ip-172-16-0-252:~$
Let know pointers to resolve. Thanks in advance!
Please try first to export all your credentials to the terminal as environment variables instead of using profiles:
export AWS_ACCESS_KEY_ID=XXX
export AWS_SECRET_ACCESS_KEY=XXX
export AWS_DEFAULT_REGION=us-east-2
This is just for debugging and making sure that the problem is not in your configuration.
If this doesn't work - try using the configuration below.
ClusterRoleBinding and ClusterRole:
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: eks-ro-user-binding
subjects:
- kind: User
name: eks-ro-user
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: ClusterRole
name: eks-ro-user-cluster-role
apiGroup: rbac.authorization.k8s.io
---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: eks-ro-user-cluster-role
rules:
- apiGroups:
- ""
resources:
- '*'
verbs:
- get
- list
- watch
- apiGroups:
- extensions
resources:
- '*'
verbs:
- get
- list
- watch
- apiGroups:
- apps
resources:
- '*'
verbs:
- get
- list
- watch
AWS auth config map (after you created an IAM user):
apiVersion: v1
kind: ConfigMap
metadata:
name: aws-auth
namespace: kube-system
data:
mapRoles: |
- rolearn: arn:aws:iam::<account-id>:role/eks-node-group-role
username: system:node:{{EC2PrivateDNSName}}
groups:
- system:bootstrappers
- system:nodes
mapUsers: |
- userarn: arn:aws:iam::<account-id>:user/eks-ro-user
username: eks-ro-user

RabbitMQ only shows one node

I have been trying to set up RabbitMQ on a k8s cluster, I finally got everything set up, but only one node shows up on the managementUI. Here are my steps:
1. Dockerfile Setup
I do this to enable autocluster:
FROM rabbitmq:3.8-rc-management-alpine
MAINTAINER kevlai
RUN rabbitmq-plugins --offline enable rabbitmq_peer_discovery_k8s
2. Set up RBAC
apiVersion: v1
kind: ServiceAccount
metadata:
name: borecast-rabbitmq
namespace: borecast-production
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
name: borecast-rabbitmq
namespace: borecast-production
rules:
- apiGroups:
- ""
resources:
- endpoints
verbs:
- get
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
name: borecast-rabbitmq
namespace: borecast-production
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: dev
subjects:
- kind: ServiceAccount
name: borecast-rabbitmq
namespace: borecast-production
3. Set up Secrets
apiVersion: v1
kind: Secret
metadata:
name: rabbitmq-secret
namespace: borecast-production
type: Opaque
data:
username: a2V2
password: Ym9yZWNhc3RydWx6
secretCookie: c2VjcmV0Y29va2llaGVyZQ==
4. Set up StorageClass
I'm setting up StorageClass so k8s will automatically do provision for me on AWS.
kind: StorageClass
apiVersion: storage.k8s.io/v1beta1
metadata:
name: rabbitmq-sc
provisioner: kubernetes.io/aws-ebs
parameters:
type: gp2
zone: us-east-2a
reclaimPolicy: Retain
5. Set up StatefulSets and Services
You can see there are two services. The headless service is for the pods themselves. As for the management service, I'll expose the service for an Ingress controller in order for it to be accessible from outside.
---
apiVersion: v1
kind: Service
metadata:
name: borecast-rabbitmq-management-service
namespace: borecast-production
labels:
app: borecast-rabbitmq
spec:
ports:
- port: 15672
targetPort: 15672
name: http
- port: 5672
targetPort: 5672
name: amqp
selector:
app: borecast-rabbitmq
---
apiVersion: v1
kind: Service
metadata:
name: borecast-rabbitmq-service
namespace: borecast-production
labels:
app: borecast-rabbitmq
spec:
clusterIP: None
ports:
- port: 5672
name: amqp
selector:
app: borecast-rabbitmq
---
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: borecast-rabbitmq
namespace: borecast-production
spec:
serviceName: borecast-rabbitmq-service
replicas: 3
template:
metadata:
labels:
app: borecast-rabbitmq
spec:
serviceAccountName: borecast-rabbitmq
containers:
- image: docker.borecast.com/borecast-rabbitmq:v1.0.3
name: borecast-rabbitmq
imagePullPolicy: Always
resources:
requests:
memory: "256Mi"
cpu: "150m"
limits:
memory: "512Mi"
cpu: "250m"
ports:
- containerPort: 5672
name: amqp
env:
- name: RABBITMQ_DEFAULT_USER
valueFrom:
secretKeyRef:
name: rabbitmq-secret
key: username
- name: RABBITMQ_DEFAULT_PASS
valueFrom:
secretKeyRef:
name: rabbitmq-secret
key: password
- name: RABBITMQ_ERLANG_COOKIE
valueFrom:
secretKeyRef:
name: rabbitmq-secret
key: secretCookie
- name: MY_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: K8S_SERVICE_NAME
# value: borecast-rabbitmq-service.borecast-production.svc.cluster.local
value: borecast-rabbitmq-service
- name: RABBITMQ_USE_LONGNAME
value: "true"
- name: RABBITMQ_NODENAME
value: "rabbit#$(MY_POD_NAME).$(K8S_SERVICE_NAME)"
# value: rabbit#$(MY_POD_NAME).borecast-rabbitmq-service.borecast-production.svc.cluster.local
- name: RABBITMQ_NODE_TYPE
value: disc
- name: AUTOCLUSTER_TYPE
value: "k8s"
- name: AUTOCLUSTER_DELAY
value: "10"
- name: AUTOCLUSTER_CLEANUP
value: "true"
- name: CLEANUP_WARN_ONLY
value: "false"
- name: K8S_ADDRESS_TYPE
value: "hostname"
- name: K8S_HOSTNAME_SUFFIX
value: ".$(K8S_SERVICE_NAME)"
# value: .borecast-rabbitmq-service.borecast-production.svc.cluster.local
volumeMounts:
- name: rabbitmq-volume
mountPath: /var/lib/rabbitmq
imagePullSecrets:
- name: regcred
volumeClaimTemplates:
- metadata:
name: rabbitmq-volume
namespace: borecast-production
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: rabbitmq-sc
resources:
requests:
storage: 5Gi
Problem
Everything is working. However, when I access the management UI (i.e. I'm access the borecast-rabbitmq-management-service, port 15672), I only see one node showing up, when it should be three:
Also notice that the cluster name is
rabbit#borecast-rabbitmq-0.borecast-rabbitmq-service.borecast-production.svc.cluster.local
but when I log out and log in again, sometimes the number 0 will be changed to 1 or 2 for borecast-rabbitmq-0.
And also notice the node name is
rabbit#borecast-rabbitmq-1.borecast-rabbitmq-service
And you guessed it, sometimes the number is 2 or 0 for borecast-rabbitmq-1.
I have been trying to debug but to no avail. The logs for each pod doesn't raise any suspicions and every service and statefulset are working normally. I repeated the five steps multiple times, and if your cluster is on AWS, you can totally replicate my setup by following the steps (after creating the namespace borecast-production of course). If anybody can shed some light on the matter, I'll be eternally grateful.
The problem is with the headless service name definition:
- name: K8S_SERVICE_NAME
# value: borecast-rabbitmq-service.borecast-production.svc.cluster.local
value: borecast-rabbitmq-service
which is a building block of node name:
- name: RABBITMQ_NODENAME
value: "rabbit#$(MY_POD_NAME).$(K8S_SERVICE_NAME)"
whereas the proper node name, should be of FQDN of the POD (<statefulset name>-<ordinal index>.<headless_svc_name>.<namespace>.svc.cluster.local):
- name: RABBITMQ_NODENAME
value: "rabbit#$(MY_POD_NAME).$(K8S_SERVICE_NAME).$(MY_POD_NAMESPACE).svc.cluster.local"
Therefore you ended up with NodeName
borecast-rabbitmq-1.borecast-rabbitmq-service
instead of:
borecast-rabbitmq-1.borecast-rabbitmq-service.borecast-production.svc.cluster.local
Look up the fqdn of the pod created by borecast-rabbitmq StatefulSet (in other word: SRV records of the Pods) with nslookup util from inside of your cluster as explained here, to see what form the RABBITMQ_NODENAME is expected to have.
try exposing 4369 for headless service;
https://www.rabbitmq.com/clustering.html
see the port access section
Had the same issue, and it came down to
Deleting all the rabbitmq resources including the pvc created under the statefulset.
Then reinstalling everything from the manifests.