mkdir ${HOME}/environment/grafana
cat << EoF > ${HOME}/environment/grafana/grafana.yaml
datasources:
datasources.yaml:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
url: http://prometheus-server.prometheus.svc.cluster.local
access: proxy
isDefault: true
EoF
I get this error when I try deploy grafana to eks cluster
Related
I am using the AWS secrets store CSI provider to sync secrets from the AWS Secret Manager into Kubernetes/EKS.
The SecretProviderClass is:
apiVersion: secrets-store.csi.x-k8s.io/v1
kind: SecretProviderClass
metadata:
name: test-provider
spec:
provider: aws
parameters:
objects: |
- objectName: mysecret
objectType: secretsmanager
jmesPath:
- path: APP_ENV
objectAlias: APP_ENV
- path: APP_DEBUG
objectAlias: APP_DEBUG
And the Pod mounting these secrets is:
apiVersion: v1
kind: Pod
metadata:
name: secret-pod
spec:
restartPolicy: Never
serviceAccountName: my-account
terminationGracePeriodSeconds: 2
containers:
- name: dotfile-test-container
image: registry.k8s.io/busybox
volumeMounts:
- name: secret-volume
readOnly: true
mountPath: "/mnt/secret-volume"
volumes:
- name: secret-volume
csi:
driver: secrets-store.csi.k8s.io
readOnly: true
volumeAttributes:
secretProviderClass: test-provider
The secret exists in the Secret Provider:
{
"APP_ENV": "staging",
"APP_DEBUG": false
}
(this is an example, I am aware I do not need to store these particular variables as secrets)
But when I create the resources, the Pod fails to run with
Warning
FailedMount
96s (x10 over 5m47s)
kubelet
MountVolume.SetUp failed for volume "secret-volume" : rpc error: code = Unknown desc = failed to mount secrets store objects for pod pace/secret-dotfiles-pod,
err: rpc error: code = Unknown desc = Failed to fetch secret from all regions: mysecret
Turns out the error message is very misleading. The problem in my case was due to the type of the APP_DEBUG value. Changing it from a boolean to string
fixed the problem and now the pod starts correctly.
{
"APP_ENV": "staging",
"APP_DEBUG": "false"
}
Seems like a bug in the provider to me.
I used official procedure from AWS and this one to enable logging.
Here is yaml files I've applied:
---
kind: Namespace
apiVersion: v1
metadata:
name: aws-observability
labels:
aws-observability: enabled
---
kind: ConfigMap
apiVersion: v1
metadata:
name: aws-logging
namespace: aws-observability
data:
flb_log_cw: "true"
output.conf: |
[OUTPUT]
Name cloudwatch_logs
Match *
region us-east-1
log_group_name fluent-bit-cloudwatch
log_stream_prefix from-fluent-bit-
auto_create_group true
log_key log
parsers.conf: |
[PARSER]
Name crio
Format Regex
Regex ^(?<time>[^ ]+) (?<stream>stdout|stderr) (?<logtag>P|F) (?<log>.*)$
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L%z
filters.conf: |
[FILTER]
Name parser
Match *
Key_name log
Parser crio
Inside the pod I can see that logging was enabled:
apiVersion: v1
kind: Pod
metadata:
annotations:
CapacityProvisioned: 2vCPU 4GB
Logging: LoggingEnabled
kubectl.kubernetes.io/restartedAt: "2023-01-17T19:31:20+01:00"
kubernetes.io/psp: eks.privileged
creationTimestamp: "2023-01-17T18:31:28Z"
Logs exists inside the container:
kubectl logs dev-768647846c-hbmv7 -n dev-fargate
But in AWS CloudWatch log groups are not created, even for fluent-bit itself
From the pod cli I can create log groups in AWS Cloudwatch, so the permissions are ok
I also tried cloudwatch instead of cloudwatch_logs plugin, but no luck
I've solved my issue.
The tricky thing is: IAM policy must be attached to the default pod execution role which created automatically with the namespace and it has no relation to the service account \ custom pod execution role
I read all aws articles. I followed each one by one. But it didn't work any of them. Let me briefly summarize my situation. I created EKS automation with terraform. 1 vpc, 3 public subnets, 3 private subnets, 3 security group, 1 nat gateway(on public), and 2 autoscaled worker node groups. I checked all infra which created with terraform. There are no problem.
My main problem is that after the installation I can't see the nodes and nodes are not join to the cluster. I applied below steps but didn't worked. What should I do? By the way don't tag my question as a duplication I checked all similar questions on stackoverflow. My steps look true but does not work.
kubectl get nodes
No resources found
Before checking node with above command.Firstly I applied below command for setting kubeconfig.
aws eks update-kubeconfig --name eks-DS7h --region us-east-1
Here my kubeconfig:
apiVersion: v1
clusters:
- cluster:
certificate-authority-data: LS0tLS1CRUdJfgzsfhadfzasdfrzsd.........
server: https://0F97E579A.gr7.us-east-1.eks.amazonaws.com
name: arn:aws:eks:us-east-1:545153234644:cluster/eks-DS7h
contexts:
- context:
cluster: arn:aws:eks:us-east-1:545153234644:cluster/eks-DS7h
user: arn:aws:eks:us-east-1:545153234644:cluster/eks-DS7h
name: arn:aws:eks:us-east-1:545153234644:cluster/eks-DS7h
current-context: arn:aws:eks:us-east-1:545153234644:cluster/eks-DS7h
kind: Config
preferences: {}
users:
- name: arn:aws:eks:us-east-1:545153234644:cluster/eks-DS7h
user:
exec:
apiVersion: client.authentication.k8s.io/v1beta1
args:
- --region
- us-east-1
- eks
- get-token
- --cluster-name
- eks-DS7h
command: aws
After this I checked the nodes again but I still get no resource found. Than I try to edit aws-auth. Before the edit I check my user on the terminal where I triggered all terraform steps installation.
aws sts get-caller-identity
{
"UserId": "ASDFGSDFGDGSDGDFHSFDSDC",
"Account": "545153234644",
"Arn": "arn:aws:iam::545153234644:user/white"
}
I took my user info and I added blank mapuser area in aws-auth. But still getting No resources found.
kubectl get cm -n kube-system aws-auth
apiVersion: v1
data:
mapAccounts: |
[]
mapRoles: |
- "groups":
- "system:bootstrappers"
- "system:nodes"
- "system:masters"
"rolearn": "arn:aws:iam::545153234644:role/eks-DS7h22060508195731770000000e"
"username": "system:node:{{EC2PrivateDNSName}}"
mapUsers: "- \"userarn\": \"arn:aws:iam::545153234644:user/white\"\n \"username\":
\"white\"\n \"groups\":\n - \"system:masters\"\n - \"system:nodes\" \n"
kind: ConfigMap
metadata:
creationTimestamp: "2022-06-05T08:20:02Z"
labels:
app.kubernetes.io/managed-by: Terraform
terraform.io/module: terraform-aws-modules.eks.aws
name: aws-auth
namespace: kube-system
resourceVersion: "4976"
uid: b12341-33ff-4f78-af0a-758f88
Oh also when I check EKS cluster on dashboard I see below warning too. I don't know is it relevant or not. I want to share it too maybe it will help.
A few months ago I integrated DataDog into my Kubernetes cluster by using a DaemonSet configuration. Since then I've been getting congestion alerts with the following message:
Please tune the hot-shots settings
https://github.com/brightcove/hot-shots#errors
By attempting to follow the docs with my limited Orchestration/DevOps knowledge, what I could gather is that I need to add the following to my DaemonSet config:
spec
.
.
securityContext:
sysctls:
- name: net.unix.max_dgram_qlen
value: "1024"
- name: net.core.wmem_max
value: "4194304"
I attempted to add that configuration piece to one of the auto-deployed DataDog pods directly just to try it out but it hangs indefinitely and doesn't save the configuration (Instead of adding to DaemonSet and risking bringing all agents down).
That hot-shots documentation also mentions that the above sysctl configuration requires unsafe sysctls to be enabled in the nodes that contain the pods:
kubelet --allowed-unsafe-sysctls \
'net.unix.max_dgram_qlen, net.core.wmem_max'
The cluster I am working with is fully deployed with EKS by using the Dashboard in AWS (Little knowledge on how it is configured). The above seems to be indicated for manually deployed and managed cluster.
Why is the configuration I am attempting to apply to a single DataDog agent pod not saving/applying? Is it because it is managed by DaemonSet or is it because it doesn't have the proper unsafe sysctl allowed? Something else?
If I do need to enable the suggested unsafe sysctlon all nodes of my cluster. How do I go about it since the cluster is fully deployed and managed by Amazon EKS?
So we managed to achieve this using a custom launch template with our managed node group and then passing in a custom bootstrap script. This does mean however you need to supply the AMI id yourself and lose the alerts in the console when it is outdated. In Terraform this would look like:
resource "aws_eks_node_group" "group" {
...
launch_template {
id = aws_launch_template.nodes.id
version = aws_launch_template.nodes.latest_version
}
...
}
data "template_file" "bootstrap" {
template = file("${path.module}/files/bootstrap.tpl")
vars = {
cluster_name = aws_eks_cluster.cluster.name
cluster_auth_base64 = aws_eks_cluster.cluster.certificate_authority.0.data
endpoint = aws_eks_cluster.cluster.endpoint
}
}
data "aws_ami" "eks_node" {
owners = ["602401143452"]
most_recent = true
filter {
name = "name"
values = ["amazon-eks-node-1.21-v20211008"]
}
}
resource "aws_launch_template" "nodes" {
...
image_id = data.aws_ami.eks_node.id
user_data = base64encode(data.template_file.bootstrap.rendered)
...
}
Then the bootstrap.hcl file looks like this:
#!/bin/bash
set -o xtrace
systemctl stop kubelet
/etc/eks/bootstrap.sh '${cluster_name}' \
--b64-cluster-ca '${cluster_auth_base64}' \
--apiserver-endpoint '${endpoint}' \
--kubelet-extra-args '"--allowed-unsafe-sysctls=net.unix.max_dgram_qlen"'
The next step is to set up the PodSecurityPolicy, ClusterRole and RoleBinding in your cluster so you can use the securityContext as you described above and then pods in that namespace will be able to run without a SysctlForbidden message.
apiVersion: policy/v1beta1
kind: PodSecurityPolicy
metadata:
name: sysctl
spec:
allowPrivilegeEscalation: false
allowedUnsafeSysctls:
- net.unix.max_dgram_qlen
defaultAllowPrivilegeEscalation: false
fsGroup:
rule: RunAsAny
runAsUser:
rule: RunAsAny
seLinux:
rule: RunAsAny
supplementalGroups:
rule: RunAsAny
volumes:
- '*'
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: allow-sysctl
rules:
- apiGroups:
- policy
resourceNames:
- sysctl
resources:
- podsecuritypolicies
verbs:
- '*'
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: allow-sysctl
namespace: app-namespace
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: allow-sysctl
subjects:
- apiGroup: rbac.authorization.k8s.io
kind: Group
name: system:serviceaccounts:app-namespace
If using the DataDog Helm chart you can set the following values to update the securityContext of the agent. But you will have to update the chart PSP manually to set allowedUnsafeSysctls
datadog:
securityContext:
sysctls:
- name: net.unix.max_dgram_qlen"
value: 512"
I created a Deployment Manager Template (python) to create a GKE Zonal cluster (v1beta1 feature). When I run gcloud deployment-manager deployments create <deploymentname> --config <config.yaml>, GKE cluster is created as expected.
I used type:gcp-types/container-v1beta1:projects.zones.clusters in my python template.
However, when I run the delete command on DM i.e. gcloud deployment-manager deployments delete <deploymentname> I get the following error:
Error says that field name could not be found. However, I did specify name in my config.yaml file.
Error in Operation [operation-1536152440470-5751f5c88f9f3-5ca3a167-d12a593d]: errors:
- code: RESOURCE_ERROR
location: /deployments/test-project-gke-xhqgxn6pkd/resources/test-gkecluster-xhqgxn6pkd
message: "{"ResourceType":"gcp-types/container-v1beta1:projects.zones.clusters"
,"ResourceErrorCode":"400","ResourceErrorMessage":{"code":400,"message"
:"Invalid JSON payload received. Unknown name "name": Cannot bind query
parameter. Field 'name' could not be found in request message.","status"
:"INVALID_ARGUMENT","details":[{"#type":"type.googleapis.com/google.rpc.BadRequest"
,"fieldViolations":[{"description":"Invalid JSON payload received. Unknown
name "name": Cannot bind query parameter. Field 'name' could not be found
in request message."}]}],"statusMessage":"Bad Request","requestPath"
:"https://container.googleapis.com/v1beta1/projects/test-project/zones/us-east1-b/clusters/"
,"httpMethod":"GET"}}"
Here's the sample config.yaml
imports:
- path: templates/gke/gke.py
name: gke.py
resources:
- name: ${CLUSTER_NAME}
type: gke.py
properties:
zone: ${ZONE}
cluster:
name: ${CLUSTER_NAME}
description: test gke cluster
network: ${NETWORK_NAME}
subnetwork: ${SUBNET_NAME}
initialClusterVersion: ${CLUSTER_VERSION}
nodePools:
- name: ${NODEPOOL_NAME}
initialNodeCount: ${NODE_COUNT}
config:
machineType: ${MACHINE_TYPE}
diskSizeGb: 100
imageType: cos
oauthScopes:
- https://www.googleapis.com/auth/compute
- https://www.googleapis.com/auth/devstorage.read_only
- https://www.googleapis.com/auth/logging.write
- https://www.googleapis.com/auth/monitoring
localSsdCount: ${LOCALSSD_COUNT}
Any ideas what I'm missing here?