NFS connection timing out on EKS - amazon-web-services

I have an NFS helm chart. It is one of the charts for an application that has 5 more sub-charts. 2 of the charts have a shared storage which I am using NFS. In GCP when I provide NFS service name in the PV it works.
apiVersion: v1
kind: PersistentVolume
metadata:
name: {{ include "nfs.name" . }}
spec:
capacity:
storage: {{ .Values.persistence.nfsVolumes.size }}
accessModes:
- {{ .Values.persistence.nfsVolumes.accessModes }}
mountOptions:
- nfsvers=4.1
nfs:
server: nfs.default.svc.cluster.local # nfs is from svc {{ include "nfs.name" .}}
path: "/opt/shared-shibboleth-idp"
But the same doesn't work on AWS EKS. The error there - on AWS EKS - is connection timeout so it can't mount the volume.
When I change the server to
server: a4eab2d4aef2311e9a2880227e884517-1524131093.us-west-2.elb.amazonaws.com .
I get connection timed out.
All the mounts are okay since it works well with GCP.
What am I doing wrong?

Related

Redis deployed in AWS - Connection time out from localhost SpringBoot app

Small question regarding Redis deployed in AWS (not AWS Elastic Cache) and an issue connecting to it.
Here is the setup of the Redis deployed in AWS: (pasting only the Kubernetes StatefulSet and Service)
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: redis
spec:
serviceName: redis
replicas: 3
selector:
matchLabels:
app: redis
template:
metadata:
labels:
app: redis
spec:
initContainers:
- name: config
image: redis:7.0.5-alpine
command: [ "sh", "-c" ]
args:
- |
cp /tmp/redis/redis.conf /etc/redis/redis.conf
echo "finding master..."
MASTER_FDQN=`hostname -f | sed -e 's/redis-[0-9]\./redis-0./'`
if [ "$(redis-cli -h sentinel -p 5000 ping)" != "PONG" ]; then
echo "master not found, defaulting to redis-0"
if [ "$(hostname)" = "redis-0" ]; then
echo "this is redis-0, not updating config..."
else
echo "updating redis.conf..."
echo "slaveof $MASTER_FDQN 6379" >> /etc/redis/redis.conf
fi
else
echo "sentinel found, finding master"
MASTER="$(redis-cli -h sentinel -p 5000 sentinel get-master-addr-by-name mymaster | grep -E '(^redis-\d{1,})|([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})')"
echo "master found : $MASTER, updating redis.conf"
echo "slaveof $MASTER 6379" >> /etc/redis/redis.conf
fi
volumeMounts:
- name: redis-config
mountPath: /etc/redis/
- name: config
mountPath: /tmp/redis/
containers:
- name: redis
image: redis:7.0.5-alpine
command: ["redis-server"]
args: ["/etc/redis/redis.conf"]
ports:
- containerPort: 6379
name: redis
volumeMounts:
- name: data
mountPath: /data
- name: redis-config
mountPath: /etc/redis/
volumes:
- name: redis-config
emptyDir: {}
- name: config
configMap:
name: redis-config
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: nfs-1
resources:
requests:
storage: 50Mi
---
apiVersion: v1
kind: Service
metadata:
name: redis
spec:
ports:
- port: 6379
targetPort: 6379
name: redis
selector:
app: redis
type: LoadBalancer
The pods are healthy, I can exec into it and perform operations fine. Here is the get all:
NAME READY STATUS RESTARTS AGE
pod/redis-0 1/1 Running 0 22h
pod/redis-1 1/1 Running 0 22h
pod/redis-2 1/1 Running 0 22h
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/redis LoadBalancer 192.168.45.55 10.51.5.2 6379:30315/TCP 26h
NAME READY AGE
statefulset.apps/redis 3/3 22h
Here is the describe of the service:
Name: redis
Namespace: Namespace
Labels: <none>
Annotations: <none>
Selector: app=redis
Type: LoadBalancer
IP Family Policy: SingleStack
IP Families: IPv4
IP: 192.168.22.33
IPs: 192.168.22.33
LoadBalancer Ingress: 10.51.5.2
Port: redis 6379/TCP
TargetPort: 6379/TCP
NodePort: redis 30315/TCP
Endpoints: 192.xxx:6379,192.xxx:6379,192.xxx:6379
Session Affinity: None
External Traffic Policy: Cluster
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal IPAllocated 68s metallb-controller Assigned IP ["10.51.5.2"]
Normal nodeAssigned 58s (x5 over 66s) metallb-speaker announcing from node "someaddress.com" with protocol "bgp"
Normal nodeAssigned 58s (x5 over 66s) metallb-speaker announcing from node "someaddress.com" with protocol "bgp"
I then try to connect to it, i.e. inserting some data with a very straightforward Spring Boot application. The application has no business logic, just trying to insert data.
Here are the relevant parts:
#Configuration
public class RedisConfiguration {
#Bean
public ReactiveRedisConnectionFactory reactiveRedisConnectionFactory() {
return new LettuceConnectionFactory("10.51.5.2", 30315);
}
#Repository
public class RedisRepository {
private final ReactiveRedisOperations<String, String> reactiveRedisOperations;
public RedisRepository(ReactiveRedisOperations<String, String> reactiveRedisOperations) {
this.reactiveRedisOperations = reactiveRedisOperations;
}
public Mono<RedisPojo> save(RedisPojo redisPojo) {
return reactiveRedisOperations.opsForValue().set(redisPojo.getInput(), redisPojo.getOutput()).map(__ -> redisPojo);
}
Each time I am trying to write the data, I am getting this exception:
2022-12-02T20:20:08.015+08:00 ERROR 1184 --- [ctor-http-nio-3] a.w.r.e.AbstractErrorWebExceptionHandler : [8f16a752-1] 500 Server Error for HTTP POST "/save"
org.springframework.data.redis.RedisConnectionFailureException: Unable to connect to Redis
at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory$ExceptionTranslatingConnectionProvider.translateException(LettuceConnectionFactory.java:1602) ~[spring-data-redis-3.0.0.jar:3.0.0]
Suppressed: reactor.core.publisher.FluxOnAssembly$OnAssemblyException:
Error has been observed at the following site(s):
*__checkpoint ⇢ Handler com.redis.controller.RedisController#test(RedisRequest) [DispatcherHandler]
*__checkpoint ⇢ HTTP POST "/save" [ExceptionHandlingWebHandler]
Original Stack Trace:
at org.springframework.data.redis.connection.lettuce.LettuceConnectionFactory$ExceptionTranslatingConnectionProvider.translateException(LettuceConnectionFactory.java:1602) ~[spring-data-redis-3.0.0.jar:3.0.0]
Caused by: io.lettuce.core.RedisConnectionException: Unable to connect to 10.51.5.2/<unresolved>:30315
at io.lettuce.core.RedisConnectionException.create(RedisConnectionException.java:78) ~[lettuce-core-6.2.1.RELEASE.jar:6.2.1.RELEASE]
at io.lettuce.core.RedisConnectionException.create(RedisConnectionException.java:56) ~[lettuce-core-6.2.1.RELEASE.jar:6.2.1.RELEASE]
at io.lettuce.core.AbstractRedisClient.getConnection(AbstractRedisClient.java:350) ~[lettuce-core-6.2.1.RELEASE.jar:6.2.1.RELEASE]
at io.lettuce.core.RedisClient.connect(RedisClient.java:216) ~[lettuce-core-6.2.1.RELEASE.jar:6.2.1.RELEASE]
Caused by: io.netty.channel.ConnectTimeoutException: connection timed out: /10.51.5.2:30315
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:261) ~[netty-transport-4.1.85.Final.jar:4.1.85.Final]
at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) ~[netty-common-4.1.85.Final.jar:4.1.85.Final]
This is particularly puzzling, because I am quite sure the code of the Spring Boot app is working. When I change the IP of return new LettuceConnectionFactory("10.51.5.2", 30315);: to
a regular Redis on my laptop ("localhost", 6379),
a dockerized Redis on my laptop,
a dockerized Redis on prem, all are working fine.
Therefore, I am quite puzzled what did I do wrong with the setup of this Redis in AWS.
What should I do in order to connect to it properly.
May I get some help please?
Thank you
By default, Redis binds itself to the IP addresses 127.0.0.1 and ::1 and does not accept connections against non-local interfaces. Chances are high that this is your main issue and you may want to review your redis.conf file to bind Redis to the interface you need or to the generic * -::*, as explained in the comments of the config file itself (which I have linked above).
With that being said, Redis also does not accept connections on non-local interfaces if the default user has no password - a security layer named Protected mode. Thus you should either give your default user a password or disable protected mode in your redis.conf file.
Not sure if this applies to your case but, as a side note, I would suggest to always avoid exposing Redis to the Internet.
You are mixing 2 things.
To enable this service for pods in different namespaces you do not need external load balancer, you can just try to use redis.namespace-name:6379 dns name and it will just work. Such dns is there for every service you create (but works only inside kubernetes)
Kubernetes will make sure that your traffic will be routed to proper pods (assuming there is more than one).
If you want to expose redis from outside of kubernetes then you need to make sure there is connectivity from the outside and then you need network load balancer that will forward traffic to your kubernetes service (in your case node port, so you need NLB with eks worker nodes: 30315 as a targets)
If your worker nodes have public IP and their SecurityGroups allow connecting to them directly, you could try to connect to worker node's IP directly just to test things out (without LB).
And regardless off yout setup you can always create proxy via kubectl
kubectl port-forward -n redisNS svc/redis 6379:6379
and connect from spring boot app to localhost:6379
How do you want to connect from app to redis in a final setup?

AWS EFS CSI Driver Reattachable Persistent Volumes

I'm currently running AWS EFS CSI driver v1.37 on EKS v1.20. The idea is to deploy a statefulset application which can persist its volumes post undeploy, and then reattach for subsequent deployments.
The initial process considered can be seen here - Kube AWS EFS CSI Driver However - the volumes do not reattach.
AWS Support have indicated that perhaps the best approach would be to use the static provisioning, whereby creating the EFS Access Points up front, and assigning them via the persistent volume templates similar to:
{{- $name := include "fullname" . -}}
{{- $labels := include "labels" . -}}
{{- range $k, $v := .Values.persistentVolume }}
{{- if $v.enabled }}
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: {{ $v.metadata.name }}-{{ $name }}
labels:
name: "{{ $v.metadata.name }}-{{ $name }}"
{{- $labels | nindent 4 }}
spec:
capacity:
storage: {{ $v.spec.capacity.storage | quote}}
volumeMode: Filesystem
accessModes:
{{- toYaml $v.spec.accessModes | nindent 4 }}
persistentVolumeReclaimPolicy: {{ $v.spec.persistentVolumeReclaimPolicy }}
storageClassName: {{ $v.spec.storageClassName }}
csi:
driver: efs.csi.aws.com
volumeHandle: {{ $v.spec.csi.volumeHandle }}
volumeAttributes:
encryptInTransit: "true"
{{- end }}
{{- end }}
The key var to note above is:
{{ $v.spec.csi.volumeHandle }}
Whereby the the EFS ID and AP ID can be combined.
Has anyone tried this or something similar in order to establish persistent data volumes, which can be reattached to?
The answer is yes.
When running a statefulset the trick is to swap out the volume claim template, for a persistent volume claim.
The subpath is based on the pod name inside the volume mounts:
- name: data
mountPath: /var/rabbitmq
subPath: $(MY_POD_NAME)
And in turn mount the persistent volume claims inside the volumes:
- name: data
persistentVolumeClaim:
claimName: data-rabbitmq
The persistent volume claim is then tied back to the persistent volume, by setting this inside the persistent volume claim:
volumeName: <pv-name>
Both the persistent volume and persistent volume claim have their storage classes like so:
storageClassName: "\"\""
The persistent volume sets both the EFS ID and EFS AP ID like so:
volumeHandle: fs-123::fsap-456
NB: the EFS AP is created up front via Terraform, not via the AWS EFS CSI driver.
And if sharing a single EFS cluster across multiple EKS clusters, the remaining piece of magic is, to ensure the base path inside the storage class is unique for all volumes, across all applications, this is set inside the storage class like so:
basePath: "/green_infra/queuing/rabbitmq_data"
Happy DevOps :~)

Prometheus Alert manager not printing {{ $labels.instance }} vaule

We have multiple AWS accounts and network access is configured between two AWS accounts and service discovery is working with node-exporter. I have Prometheus configuration with some of the rules configured for the docker containers and now I have added one of the rules similar to the existing one to check if by mistakenly same container is launched in another AWS account and below is the rule. for exiting rules, {{ $labels.instance }} is printing in Alerts email, but not for the new rule which I have written newly
Scrape config for labels:
- job_name: 'aws-conatiners'
scheme: http
ec2_sd_configs:
- region: {{region}}
port: 8181
relabel_configs:
- source_labels: [__meta_ec2_tag_Name]
target_label: instance
The new rule which I have created to check if more than one container is running:
# Alert to check if more than one instance is running for backendapi service
- alert: multiple_instances_are_running
expr: sum(container_last_seen{name=~"backendapi"}) > 1
for: 5m
labels:
severity: critical
annotations:
summary: "More than one Instance (instance {{ $labels.instance }}) is running"
description: "More than one Instance (instance {{ $labels.instance }}) is running for 5 minutes."
Can someone please check and help me to get the instance name printed in alert emails

Kubernetes - force restarting on specific memory usage

our server running using Kubernetes for auto-scaling and we use newRelic for observability
but we face some issues
1- we need to restart pods when memory usage reaches 1G it automatically restarts when it reaches 1.2G but everything goes slowly.
2- terminate pods when there no requests to the server
my configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{ .Release.Name }}
labels:
app: {{ .Release.Name }}
spec:
revisionHistoryLimit: 2
replicas: {{ .Values.replicas }}
selector:
matchLabels:
app: {{ .Release.Name }}
template:
metadata:
labels:
app: {{ .Release.Name }}
spec:
containers:
- name: {{ .Release.Name }}
image: "{{ .Values.imageRepository }}:{{ .Values.tag }}"
env:
{{- include "api.env" . | nindent 12 }}
resources:
limits:
memory: {{ .Values.memoryLimit }}
cpu: {{ .Values.cpuLimit }}
requests:
memory: {{ .Values.memoryRequest }}
cpu: {{ .Values.cpuRequest }}
imagePullSecrets:
- name: {{ .Values.imagePullSecret }}
{{- if .Values.tolerations }}
tolerations:
{{ toYaml .Values.tolerations | indent 8 }}
{{- end }}
{{- if .Values.nodeSelector }}
nodeSelector:
{{ toYaml .Values.nodeSelector | indent 8 }}
{{- end }}
my values file
memoryLimit: "2Gi"
cpuLimit: "1.0"
memoryRequest: "1.0Gi"
cpuRequest: "0.75"
thats what I am trying to approach
If you want to be sure your pod/deployment won't consume more than 1.0Gi of memory then setting that MemoryLimit will do job just fine.
Once you set that limits and your container exceed it it becomes a potential candidate for termination. If it continues to consume memory beyond its limit, the Container will be terminated. If a terminated Container can be restarted, kubelet restarts it, as with any other type of runtime container failure.
For more readying please visit section exceeding a container's memory limit
Moving on if you wish to scale your deployment based on requests you would require to have custom metrics to be provided by external adapter such as prometheus. Horizontal pod autoascaler natively provides you scaling based only on CPU and Memory (based on the metrics from metrics server).
The adapter documents provides you walkthrough how to configure it with Kubernetes API and HPA. The list of other adapters can be found here.
Then you can scale your deployment based on the http_requests metric as showed here or request-per-seconds as described here.

How to serve static files in Django application running inside Kubernetes

I have a small application built in Django. it serves as a frontend and it's being installed in one of out K8S clusters.
I'm using helm to deploy the charts and I fail to serve the static files of Django correctly.
Iv'e searched in multiple locations, but I ended up with inability to find one that will fix my problem.
That's my ingress file:
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: orion-toolbelt
namespace: {{ .Values.global.namespace }}
annotations:
# ingress.kubernetes.io/secure-backends: "false"
# nginx.ingress.kubernetes.io/secure-backends: "false"
ingress.kubernetes.io/rewrite-target: /
nginx.ingress.kubernetes.io/rewrite-target: /
ingress.kubernetes.io/force-ssl-redirect: "false"
nginx.ingress.kubernetes.io/force-ssl-redirect: "false"
ingress.kubernetes.io/ssl-redirect: "false"
nginx.ingress.kubernetes.io/ssl-redirect: "false"
ingress.kubernetes.io/ingress.allow-http: "true"
nginx.ingress.kubernetes.io/ingress.allow-http: "true"
nginx.ingress.kubernetes.io/proxy-body-size: 500m
spec:
rules:
- http:
paths:
- path: /orion-toolbelt
backend:
serviceName: orion-toolbelt
servicePort: {{ .Values.service.port }}
the static file location in django is kept default e.g.
STATIC_URL = "/static"
the user ended up with inability to access the static files that way..
what should I do next?
attached is the error:
HTML-static_files-error
-- EDIT: 5/8/19 --
The pod's deployment.yaml looks like the following:
apiVersion: apps/v1
kind: StatefulSet
metadata:
namespace: {{ .Values.global.namespace }}
name: orion-toolbelt
labels:
app: orion-toolbelt
spec:
replicas: 1
selector:
matchLabels:
app: orion-toolbelt
template:
metadata:
labels:
app: orion-toolbelt
spec:
containers:
- name: orion-toolbelt
image: {{ .Values.global.repository.imagerepo }}/orion-toolbelt:10.4-SNAPSHOT-15
ports:
- containerPort: {{ .Values.service.port }}
env:
- name: "USERNAME"
valueFrom:
secretKeyRef:
key: username
name: {{ .Values.global.secretname }}
- name: "PASSWORD"
valueFrom:
secretKeyRef:
key: password
name: {{ .Values.global.secretname }}
- name: "MASTER_IP"
valueFrom:
secretKeyRef:
key: master_ip
name: {{ .Values.global.secretname }}
imagePullPolicy: {{ .Values.global.pullPolicy }}
imagePullSecrets:
- name: {{ .Values.global.secretname }}
EDIT2: 20/8/19 - adding service.yam
apiVersion: v1
kind: Service
metadata:
namespace: {{ .Values.global.namespace }}
name: orion-toolbelt
spec:
selector:
app: orion-toolbelt
ports:
- protocol: TCP
port: {{ .Values.service.port }}
targetPort: {{ .Values.service.port }}
You should simply contain the /static directory within the container, and adjust the path to it in the application.
Otherwise, if it must be /static, or you don't want to contain the static files in the container, or you want other containers to access the volume, you should think about mounting a Kubernetes volume to your Deployment/ Statefulset.
#Edit
You can test, whether this path exists in your kubernetes pod this way:
kubectl get po <- this command will give you the name of your pod
kubectl exec -it <name of pod> sh <-this command will let you execute commands in the container shell.
There you can test, if your path exists. If it does, it is fault of your application, if it does not, you added it wrong in the Docker.
You can also add path to your Kubernetes pod, without specifying it in the
Docker container. Check this link for details
As described by community member Marcin Ginszt
According to the informatiom applied in the post. It's difficult to quess where is the problem with your django/app config/settings.
Please refer to Managing static files (e.g. images, JavaScript, CSS)
NOTE:
Serving the files - STATIC_URL = '/static/'
In addition to these configuration steps, you’ll also need to actually serve the static files.
During development, if you use django.contrib.staticfiles, this will be done automatically by runserver when DEBUG is set to True (see django.contrib.staticfiles.views.serve()).
This method is grossly inefficient and probably insecure, so it is unsuitable for production.
See Deploying static files for proper strategies to serve static files in production environments.
Django doesn’t serve files itself; it leaves that job to whichever Web server you choose.
We recommend using a separate Web server – i.e., one that’s not also running Django – for serving media. Here are some good choices:
Nginx
A stripped-down version of Apache
Here you can find example how you can serve static files using collectstatic command.
Please let me know if it helped.