Im trying to create prometheus with operator in fresh new k8s cluster
I use the following files ,
First step I’m creating a namespace monitoring
apply this file , which works ok
apiVersion: apps/v1beta2
kind: Deployment
metadata:
labels:
k8s-app: prometheus-operator
name: prometheus-operator
namespace: monitoring
spec:
replicas: 2
selector:
matchLabels:
k8s-app: prometheus-operator
template:
metadata:
labels:
k8s-app: prometheus-operator
spec:
priorityClassName: "operator-critical"
tolerations:
- key: "WorkGroup"
operator: "Equal"
value: "operator"
effect: "NoSchedule"
- key: "WorkGroup"
operator: "Equal"
value: "operator"
effect: "NoExecute"
containers:
- args:
- --kubelet-service=kube-system/kubelet
- --logtostderr=true
- --config-reloader-image=quay.io/coreos/configmap-reload:v0.0.1
- --prometheus-config-reloader=quay.io/coreos/prometheus-config-reloader:v0.29.0
image: quay.io/coreos/prometheus-operator:v0.29.0
name: prometheus-operator
ports:
- containerPort: 8080
name: http
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
nodeSelector:
serviceAccountName: prometheus-operator
Now I want to apply this file (CRD)
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus
namespace: monitoring
labels:
prometheus: prometheus
spec:
replica: 1
priorityClassName: "operator-critical"
serviceAccountName: prometheus
nodeSelector:
worker.garden.sapcloud.io/group: operator
serviceMonitorNamespaceSelector: {}
serviceMonitorSelector:
matchLabels:
role: observeable
tolerations:
- key: "WorkGroup"
operator: "Equal"
value: "operator"
effect: "NoSchedule"
- key: "WorkGroup"
operator: "Equal"
value: "operator"
effect: "NoExecute"
And Im getting error :
error: unable to recognize "1500-prometheus-crd.yaml": no matches for kind "Prometheus" in version "monitoring.coreos.com/v1"
I found this https://github.com/coreos/prometheus-operator/issues/1866 , but I try to do it as mentioned, I.e.
Wait a few second and deploy again but it doesn’t help. Any idea ?
Also tried to delete the ns and create it again with the configs and I got the same issue. please advice
You need to install the custom resources as available objects in Kubernetes before you can create instances of them.
Related
I am deploying a jenkins on one master one node Kubernetes cluster, iam getting error when i try to do Dynamic Volume Provisioning. not sure what went wrong. please help.
my storageclass file
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: standard
provisioner: kubernetes.io/aws-ebs
parameters:
type: gp2
reclaimPolicy: Retain
mountOptions:
- debug
volumeBindingMode: Immediate
my PVC file
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: jenkins-pvc
labels:
type: amazonEBS
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 40Gi
storageClassName: standard
volumeMode: Filesystem
Deployment file
apiVersion: apps/v1
kind: Deployment
metadata:
name: jenkins
spec:
selector:
matchLabels:
app: jcasc
replicas: 1
template:
metadata:
labels:
app: jcasc
spec:
volumes:
- name: jenkins-pvc
persistentVolumeClaim:
claimName: jenkins-pvc
containers:
- name: jenkins
image: jenkins:latest
ports:
- containerPort: 8080
volumeMounts:
- name: jenkins-pvc
mountPath: "/var/jenkins_home"
Find this troubleshooting doc for Fixing Pod Has Unbound Immediate PersistentVolumeClaims Error and also
For dynamic provisioning you can see this doc.
For pv doc
I have three containers in a pod: nginx, redis, custom django app. It seems like none of them talk to each other with kubernetes. In docker compose they do but I can't use docker compose in production.
The django container gets this error:
[2022-06-20 21:45:49,420: ERROR/MainProcess] consumer: Cannot connect to redis://redis:6379/0: Error 111 connecting to redis:6379. Connection refused..
Trying again in 32.00 seconds... (16/100)
and the nginx container starts but never shows any traffic. Trying to connect to localhost:8000 gets no reply.
Any idea whats wrong with my yml file?
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
creationTimestamp: null
name: djangonetwork
spec:
ingress:
- from:
- podSelector:
matchLabels:
io.kompose.network/djangonetwork: "true"
podSelector:
matchLabels:
io.kompose.network/djangonetwork: "true"
---
apiVersion: v1
data:
DB_HOST: db
DB_NAME: django_db
DB_PASSWORD: password
DB_PORT: "5432"
DB_USER: user
kind: ConfigMap
metadata:
creationTimestamp: null
labels:
io.kompose.service: web
name: envs--django
---
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
io.kompose.service: web
name: web
spec:
replicas: 1
selector:
matchLabels:
io.kompose.service: web
strategy:
type: Recreate
template:
metadata:
labels:
io.kompose.network/djangonetwork: "true"
io.kompose.service: web
spec:
containers:
- image: nginx:alpine
name: nginxcontainer
ports:
- containerPort: 8000
- image: redis:alpine
name: rediscontainer
ports:
- containerPort: 6379
resources: {}
- env:
- name: DB_HOST
valueFrom:
configMapKeyRef:
key: DB_HOST
name: envs--django
- name: DB_NAME
valueFrom:
configMapKeyRef:
key: DB_NAME
name: envs--django
- name: DB_PASSWORD
valueFrom:
configMapKeyRef:
key: DB_PASSWORD
name: envs--django
- name: DB_PORT
valueFrom:
configMapKeyRef:
key: DB_PORT
name: envs--django
- name: DB_USER
valueFrom:
configMapKeyRef:
key: DB_USER
name: envs--django
image: localhost:5000/integration/web:latest
name: djangocontainer
ports:
- containerPort: 8000
resources: {}
restartPolicy: Always
status: {}
---
apiVersion: v1
kind: Service
metadata:
labels:
io.kompose.service: web
name: web
spec:
ports:
- name: "8000"
port: 8000
targetPort: 8000
selector:
io.kompose.service: web
You've put all three containers into a single Pod. That's usually not the preferred approach: it means you can't restart one of the containers without restarting all of them (any update to your application code requires discarding your Redis cache) and you can't individually scale the component parts (if you need five replicas of your application, do you also need five reverse proxies and can you usefully use five Redises?).
Instead, a preferred approach is to split these into three separate Deployments (or possibly use a StatefulSet for Redis with persistence). Each has a corresponding Service, and then those Service names can be used as DNS names.
A very minimal example for Redis could look like:
apiVersion: apps/v1
kind: Deployment
metadata:
name: redis
spec:
replicas: 1
template:
metadata:
labels:
service: web
component: redis
spec:
containers:
- name: redis
image: redis
ports:
- name: redis
containerPort: 6379
---
apiVersion: v1
kind: Service
metadata:
name: redis # <-- this name will be a DNS name
spec:
selector: # matches the template: { metadata: { labels: } }
service: web
component: redis
ports:
- name: redis
port: 6379
targetPort: redis # matches a containerPorts: [{ name: }]
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: app
spec:
...
env:
- name: REDIS_HOST
value: redis # matches the Service
If all three parts are in the same Pod, then the Service can't really distinguish which part it's talking to. In principle, between these containers, they share a network namespace and need to talk to each other as localhost; the containers: [{ name: }] have no practical effect.
I have created an EKS cluster using eksctl. I am following these steps to establish connectivity to AWS services like S3, cloudwatch using spring-boot.
Create EKS using eksctl - This has my service account details and OIDC enabled.
List the service accounts to see if they were created fine
Create a deployment using the account name
Create a service
I am seeing a 403 in the logs:
User: arn:aws:sts:xxx/xxxx is not authorized to perform:
cloudformation:DescribeStackResources because no identity-based policy allows
the cloudformation:DescribeStackResources action (Service: AmazonCloudFormation; Status Code: 403;
Error Code: AccessDenied; Request ID: xxxx)
Can I get some help here to troubleshoot this issue, please?
What I have figured out after posting this issue is my node which is provisioned by eksctl, has been applied with rules. This is the rule which my app is picking up due to the default CredentialChain.
What I haven't still figured out is how do I enable the apps in the pod to assume a service account role.
YAML for #1
---
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: name
region: ap-south-1
availabilityZones: ["xxxx", "xxxx", "xxxx"]
managedNodeGroups:
- name: c5large-nodes
desiredCapacity: 1
instanceType: c5.large
labels:
node-type: large
volumeSize: 5
cloudWatch:
clusterLogging:
enableTypes: [ "*" ]
iam:
withOIDC: true
serviceAccounts:
- metadata:
name: cluster-autoscaler
namespace: kube-system
labels: {aws-usage: "autoscaling"}
wellKnownPolicies:
autoScaler: true
roleName: eksctl-cluster-autoscaler-role
roleOnly: true
- metadata:
name: backend-stage-iam-role
namespace: backend-stage
labels: { aws-usage: "all-backend-allow" }
attachPolicyARNs:
- "arn:aws:iam::xxxx"
- metadata:
name: mq-access
namespace: backend-stage
labels: { aws-usage: "MQ" }
attachPolicyARNs:
- "arn:aws:iam::aws:policy/AmazonMQFullAccess"
YAML for deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app
labels:
app: my-app
namespace: backend-stage
spec:
replicas: 1
selector:
matchLabels:
app: my-app
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
labels:
app: my-app
spec:
serviceAccountName: backend-stage-iam-role
containers:
- image: xxx/my-app:latest
imagePullPolicy: Always
name: my-app
ports:
- containerPort: 8080
protocol: TCP
env:
- name: SPRING_PROFILES_ACTIVE
value: stage
YAML for service
apiVersion: v1
kind: Service
metadata:
name: my-app
namespace: backend-stage
spec:
selector:
app: my-app
type: LoadBalancer
ports:
- protocol: TCP
port: 80
targetPort: 8080
The role is defined like this for now:
- Effect: Allow
Action:
- cloudformation:*
Resource: "*"
I did further debugging, by describing the pod, I can see the role passed as an ENV parameter:
AWS_ROLE_ARN: arn:aws:iam::MYACCOUNT:role/MyRole```
Just add missing permission to arn:aws:sts:xxx/xxxx assumed role.
I'm trying to create a deployment on AWS EKS with my application and metricbeat as sidecar, so I have the following YML:
---
apiVersion: v1
kind: ConfigMap
metadata:
name: metricbeat-modules
namespace: testframework
labels:
k8s-app: metricbeat
data:
kubernetes.yml: |-
- module: kubernetes
metricsets:
- node
- system
- pod
- container
- volume
period: 10s
host: ${NODE_NAME}
hosts: [ "https://${NODE_IP}:10250" ]
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
ssl.verification_mode: "none"
---
apiVersion: v1
kind: ConfigMap
metadata:
name: metricbeat-config
namespace: testframework
labels:
k8s-app: metricbeat
data:
metricbeat.yml: |-
processors:
- add_cloud_metadata:
- add_tags:
tags: ["EKSCORP_DEV"]
target: "cluster_test"
metricbeat.config.modules:
path: ${path.config}/modules.d/*.yml
reload.enabled: false
output.elasticsearch:
index: "metricbeat-k8s-%{[agent.version]}-%{+yyyy.MM.dd}"
setup.template.name: "metricbeat-k8s"
setup.template.pattern: "metricbeat-k8s-*"
setup.ilm.enabled: false
cloud.id: ${ELASTIC_CLOUD_ID}
cloud.auth: ${ELASTIC_CLOUD_AUTH}
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: testframework-initializr-deploy
namespace: testframework
spec:
replicas: 1
selector:
matchLabels:
app: testframework-initializr
template:
metadata:
labels:
app: testframework-initializr
annotations:
co.elastic.logs/enabled: 'true'
co.elastic.logs/json.keys_under_root: 'true'
co.elastic.logs/json.add_error_key: 'true'
co.elastic.logs/json.message_key: 'message'
spec:
containers:
- name: testframework-initializr
image: XXXXX.dkr.ecr.us-east-1.amazonaws.com/testframework-initializr
ports:
- containerPort: 8080
livenessProbe:
httpGet:
path: /health/liveness
port: 8080
initialDelaySeconds: 300
periodSeconds: 10
timeoutSeconds: 60
failureThreshold: 5
readinessProbe:
httpGet:
port: 8080
path: /health
initialDelaySeconds: 300
periodSeconds: 10
timeoutSeconds: 10
failureThreshold: 3
- name: metricbeat-sidecar
image: docker.elastic.co/beats/metricbeat:7.12.0
args: [
"-c", "/etc/metricbeat.yml",
"-e",
"-system.hostfs=/hostfs"
]
env:
- name: ELASTIC_CLOUD_ID
value: xxxx
- name: ELASTIC_CLOUD_AUTH
value: xxxx
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
- name: NODE_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
securityContext:
runAsUser: 0
volumeMounts:
- name: config
mountPath: /etc/metricbeat.yml
readOnly: true
subPath: metricbeat.yml
- name: modules
mountPath: /usr/share/metricbeat/modules.d
readOnly: true
volumes:
- name: config
configMap:
defaultMode: 0640
name: metricbeat-config
- name: modules
configMap:
defaultMode: 0640
name: metricbeat-modules
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
name: prom-admin
rules:
- apiGroups: [""]
resources: ["pods", "nodes"]
verbs: ["get", "watch", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: prom-rbac
subjects:
- kind: ServiceAccount
name: default
namespace: testframework
roleRef:
kind: ClusterRole
name: prom-admin
apiGroup: rbac.authorization.k8s.io
---
apiVersion: v1
kind: Service
metadata:
name: testframework-initializr-service
namespace: testframework
spec:
type: NodePort
ports:
- port: 80
targetPort: 8080
selector:
app: testframework-initializr
---
apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
name: testframework-initializr-ingress
annotations:
kubernetes.io/ingress.class: alb
alb.ingress.kubernetes.io/scheme: internal
alb.ingress.kubernetes.io/target-type: ip
spec:
rules:
- host: dev-initializr.test.net
http:
paths:
- backend:
serviceName: testframework-initializr-service
servicePort: 80
Well, after startup the POD in AWS EKS, I got the following error in Kubernetes Metricbeat Container:
INFO module/wrapper.go:259 Error fetching data for metricset kubernetes.system: error doing HTTP request to fetch 'system' Metricset data: error making http request: Get "https://IP_FROM_FARGATE_HERE:10250/stats/summary": dial tcp IP_FROM_FARGATE_HERE:10250: connect: connection refused
I tried to use the "NODE_NAME" instead "NODE_IP", but I got "No Such Host". Any idea how can I fix it?
I have been trying to set up RabbitMQ on a k8s cluster, I finally got everything set up, but only one node shows up on the managementUI. Here are my steps:
1. Dockerfile Setup
I do this to enable autocluster:
FROM rabbitmq:3.8-rc-management-alpine
MAINTAINER kevlai
RUN rabbitmq-plugins --offline enable rabbitmq_peer_discovery_k8s
2. Set up RBAC
apiVersion: v1
kind: ServiceAccount
metadata:
name: borecast-rabbitmq
namespace: borecast-production
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
name: borecast-rabbitmq
namespace: borecast-production
rules:
- apiGroups:
- ""
resources:
- endpoints
verbs:
- get
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
name: borecast-rabbitmq
namespace: borecast-production
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: dev
subjects:
- kind: ServiceAccount
name: borecast-rabbitmq
namespace: borecast-production
3. Set up Secrets
apiVersion: v1
kind: Secret
metadata:
name: rabbitmq-secret
namespace: borecast-production
type: Opaque
data:
username: a2V2
password: Ym9yZWNhc3RydWx6
secretCookie: c2VjcmV0Y29va2llaGVyZQ==
4. Set up StorageClass
I'm setting up StorageClass so k8s will automatically do provision for me on AWS.
kind: StorageClass
apiVersion: storage.k8s.io/v1beta1
metadata:
name: rabbitmq-sc
provisioner: kubernetes.io/aws-ebs
parameters:
type: gp2
zone: us-east-2a
reclaimPolicy: Retain
5. Set up StatefulSets and Services
You can see there are two services. The headless service is for the pods themselves. As for the management service, I'll expose the service for an Ingress controller in order for it to be accessible from outside.
---
apiVersion: v1
kind: Service
metadata:
name: borecast-rabbitmq-management-service
namespace: borecast-production
labels:
app: borecast-rabbitmq
spec:
ports:
- port: 15672
targetPort: 15672
name: http
- port: 5672
targetPort: 5672
name: amqp
selector:
app: borecast-rabbitmq
---
apiVersion: v1
kind: Service
metadata:
name: borecast-rabbitmq-service
namespace: borecast-production
labels:
app: borecast-rabbitmq
spec:
clusterIP: None
ports:
- port: 5672
name: amqp
selector:
app: borecast-rabbitmq
---
apiVersion: apps/v1beta1
kind: StatefulSet
metadata:
name: borecast-rabbitmq
namespace: borecast-production
spec:
serviceName: borecast-rabbitmq-service
replicas: 3
template:
metadata:
labels:
app: borecast-rabbitmq
spec:
serviceAccountName: borecast-rabbitmq
containers:
- image: docker.borecast.com/borecast-rabbitmq:v1.0.3
name: borecast-rabbitmq
imagePullPolicy: Always
resources:
requests:
memory: "256Mi"
cpu: "150m"
limits:
memory: "512Mi"
cpu: "250m"
ports:
- containerPort: 5672
name: amqp
env:
- name: RABBITMQ_DEFAULT_USER
valueFrom:
secretKeyRef:
name: rabbitmq-secret
key: username
- name: RABBITMQ_DEFAULT_PASS
valueFrom:
secretKeyRef:
name: rabbitmq-secret
key: password
- name: RABBITMQ_ERLANG_COOKIE
valueFrom:
secretKeyRef:
name: rabbitmq-secret
key: secretCookie
- name: MY_POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: K8S_SERVICE_NAME
# value: borecast-rabbitmq-service.borecast-production.svc.cluster.local
value: borecast-rabbitmq-service
- name: RABBITMQ_USE_LONGNAME
value: "true"
- name: RABBITMQ_NODENAME
value: "rabbit#$(MY_POD_NAME).$(K8S_SERVICE_NAME)"
# value: rabbit#$(MY_POD_NAME).borecast-rabbitmq-service.borecast-production.svc.cluster.local
- name: RABBITMQ_NODE_TYPE
value: disc
- name: AUTOCLUSTER_TYPE
value: "k8s"
- name: AUTOCLUSTER_DELAY
value: "10"
- name: AUTOCLUSTER_CLEANUP
value: "true"
- name: CLEANUP_WARN_ONLY
value: "false"
- name: K8S_ADDRESS_TYPE
value: "hostname"
- name: K8S_HOSTNAME_SUFFIX
value: ".$(K8S_SERVICE_NAME)"
# value: .borecast-rabbitmq-service.borecast-production.svc.cluster.local
volumeMounts:
- name: rabbitmq-volume
mountPath: /var/lib/rabbitmq
imagePullSecrets:
- name: regcred
volumeClaimTemplates:
- metadata:
name: rabbitmq-volume
namespace: borecast-production
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: rabbitmq-sc
resources:
requests:
storage: 5Gi
Problem
Everything is working. However, when I access the management UI (i.e. I'm access the borecast-rabbitmq-management-service, port 15672), I only see one node showing up, when it should be three:
Also notice that the cluster name is
rabbit#borecast-rabbitmq-0.borecast-rabbitmq-service.borecast-production.svc.cluster.local
but when I log out and log in again, sometimes the number 0 will be changed to 1 or 2 for borecast-rabbitmq-0.
And also notice the node name is
rabbit#borecast-rabbitmq-1.borecast-rabbitmq-service
And you guessed it, sometimes the number is 2 or 0 for borecast-rabbitmq-1.
I have been trying to debug but to no avail. The logs for each pod doesn't raise any suspicions and every service and statefulset are working normally. I repeated the five steps multiple times, and if your cluster is on AWS, you can totally replicate my setup by following the steps (after creating the namespace borecast-production of course). If anybody can shed some light on the matter, I'll be eternally grateful.
The problem is with the headless service name definition:
- name: K8S_SERVICE_NAME
# value: borecast-rabbitmq-service.borecast-production.svc.cluster.local
value: borecast-rabbitmq-service
which is a building block of node name:
- name: RABBITMQ_NODENAME
value: "rabbit#$(MY_POD_NAME).$(K8S_SERVICE_NAME)"
whereas the proper node name, should be of FQDN of the POD (<statefulset name>-<ordinal index>.<headless_svc_name>.<namespace>.svc.cluster.local):
- name: RABBITMQ_NODENAME
value: "rabbit#$(MY_POD_NAME).$(K8S_SERVICE_NAME).$(MY_POD_NAMESPACE).svc.cluster.local"
Therefore you ended up with NodeName
borecast-rabbitmq-1.borecast-rabbitmq-service
instead of:
borecast-rabbitmq-1.borecast-rabbitmq-service.borecast-production.svc.cluster.local
Look up the fqdn of the pod created by borecast-rabbitmq StatefulSet (in other word: SRV records of the Pods) with nslookup util from inside of your cluster as explained here, to see what form the RABBITMQ_NODENAME is expected to have.
try exposing 4369 for headless service;
https://www.rabbitmq.com/clustering.html
see the port access section
Had the same issue, and it came down to
Deleting all the rabbitmq resources including the pvc created under the statefulset.
Then reinstalling everything from the manifests.