Kernel error when trying to start an NFS server in a container - google-cloud-platform

I was trying to run through the NFS example in the Kubernetes codebase on Container Engine, but I couldn't get the shares to mount. Turns out every time the nfs-server pod is launched, the kernel is throwing an error:
Apr 27 00:11:06 k8s-cluster-6-node-1 kernel: [60165.482242] ------------[ cut here ]------------
Apr 27 00:11:06 k8s-cluster-6-node-1 kernel: [60165.483060] WARNING: CPU: 0 PID: 7160 at /build/linux-50mAO0/linux-3.16.7-ckt4/fs/nfsd/nfs4recover.c:1195 nfsd4_umh_cltrack_init+0x4a/0x60 nfsd
Full output here: http://pastebin.com/qLzCFpAa
Any thoughts on how to solve this?

The NFS example doesn't work because GKE (by default) doesn't support running privileged containers, such as the nfs-server. I just tested this with a v0.16.0 cluster and kubectl v0.15.0 (the current gcloud default) and got a nice error message when I tried to start the nfs-server pod:
$ kubectl create -f nfs-server-pod.yaml
Error: Pod "nfs-server" is invalid: spec.containers[0].privileged: forbidden 'true'

Related

Does GKE support container checkpoints

Does Google Kubernetes Engine support creating a checkpoint for a container running on a given host. This issue kubernetes #3949 says that kubernetes is not going to support it anytime soon. But is it possible to do it from the host where the containers are running.
GKE allows a maintainer to ssh on the compute for troubleshooting. The host has a utility ctr to interact with the containerd socket (which is the default container engine). The ctr utility does have options to perform a checkpoint in 2 different ways
user#gke-cluster-1-default-pool-ef0e509f-p9lw ~ $ ctr containers checkpoint <container_id> test2
ctr: image "test2": already exists
user#gke-cluster-1-default-pool-ef0e509f-p9lw ~ $ ctr containers checkpoint <container_id> test3
user#gke-cluster-1-default-pool-ef0e509f-p9lw ~ $ echo $?
0
The utility performs a check for an already existing checkpoint and does not throw out an error, but during a checkpoint we get this error from containerd,
Aug 15 20:29:11 gke-cluster-1-default-pool-ef0e509f-p9lw containerd[1441]: time="2021-08-15T20:29:11.892718841Z" level=info msg="ImageCreate event &ImageCreate{Name:test2,Labels:map[string]string{},XXX_unrecognized:[],}"
Aug 15 20:29:11 gke-cluster-1-default-pool-ef0e509f-p9lw containerd[1441]: time="2021-08-15T20:29:11.893607606Z" level=error msg="Failed to handle event &ImageCreate{Name:test2,Labels:map[string]string{},XXX_unrecognized:[],} for test2" error="get image id: unexpected media type application/vnd
.containerd.container.checkpoint.runtime.options+proto for sha256:d8183a03f8f3429623e6aa55d13c70d1bfc282fe5c3d6562180fdc55c7614589: not found"
The other way just says criu is not present in the PATH which was verified as checkpoints usually works using the criu utility.
Is the feature supported (perhaps a configuration that needs to be enabled) or not and how to verify it.

CockroachDB on AWS EKS cluster - [n?] no stores bootstrapped

I am attempting to deploy CockroachDB:v2.1.6 to a new AWS EKS cluster. Everything is deployed successfully; statefulset, services, pv's & pvc's are created. The AWS EBS volumes are created successfully too.
The issue is the pods never get to a READY state.
pod/cockroachdb-0 0/1 Running 0 14m
pod/cockroachdb-1 0/1 Running 0 14m
pod/cockroachdb-2 0/1 Running 0 14m
If I 'describe' the pods I get the following:
Normal Pulled 46s kubelet, ip-10-5-109-70.eu-central-1.compute.internal Container image "cockroachdb/cockroach:v2.1.6" already present on machine
Normal Created 46s kubelet, ip-10-5-109-70.eu-central-1.compute.internal Created container cockroachdb
Normal Started 46s kubelet, ip-10-5-109-70.eu-central-1.compute.internal Started container cockroachdb
Warning Unhealthy 1s (x8 over 36s) kubelet, ip-10-5-109-70.eu-central-1.compute.internal Readiness probe failed: HTTP probe failed with statuscode: 503
If I examine the logs of a pod I see this:
I200409 11:45:18.073666 14 server/server.go:1403 [n?] no stores bootstrapped and --join flag specified, awaiting init command.
W200409 11:45:18.076826 87 vendor/google.golang.org/grpc/clientconn.go:1293 grpc: addrConn.createTransport failed to connect to {cockroachdb-0.cockroachdb:26257 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp: lookup cockroachdb-0.cockroachdb on 172.20.0.10:53: no such host". Reconnecting...
W200409 11:45:18.076942 21 gossip/client.go:123 [n?] failed to start gossip client to cockroachdb-0.cockroachdb:26257: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp: lookup cockroachdb-0.cockroachdb on 172.20.0.10:53: no such host"
I came across this comment from the CockroachDB forum (https://forum.cockroachlabs.com/t/http-probe-failed-with-statuscode-503/2043/6)
Both the cockroach_out.log and cockroach_output1.log files you sent me (corresponding to mycockroach-cockroachdb-0 and mycockroach-cockroachdb-2) print out no stores bootstrapped during startup and prefix all their log lines with n?, indicating that they haven’t been allocated a node ID. I’d say that they may have never been properly initialized as part of the cluster.
I have deleted everything including pv's, pvc's & AWS EBS volumes through the kubectl delete command and reapplied with the same issue.
Any thoughts would be very much appreciated. Thank you
I was not aware that you had to initialize the CockroachDB cluster after creating it. I did the following to resolve my issue:
kubectl exec -it cockroachdb-0 -n /bin/sh
/cockroach/cockroach init
See here for more details - https://www.cockroachlabs.com/docs/v19.2/cockroach-init.html
After this the pods started running correctly.

Docker-compose blkio: device_write_iops not working for AWS/EBS instance

I am trying to limit iops on a particular container in my docker-compose stack. To do this I am using the following config:
blkio_config:
device_write_iops:
- path: "/dev/xvda1"
rate: 20
device_read_iops:
- path: "/dev/xvda1"
rate: 20
I cannot provide the rest of the file for security reasons however it is isolated to this statement. I confirmed that this is the correct path for my ebs volume using the df -h command.
When I then run docker-compose up -d I get the following error:
Recreating e1c25c41b612_drone ... error
ERROR: for e1c25c41b612_drone Cannot start service drone: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:430: container init caused \"process_linux.go:396: setting cgroup config for procHooks process caused \\\"failed to write 202:1 20 to blkio.throttle.read_iops_device: write /sys/fs/cgroup/blkio/docker/a674e86d50111afa576d5fd4e16a131070c100b7db3ac22f95986904a47ae82a/blkio.throttle.read_iops_device: invalid argument\\\"\"": unknown
ERROR: for drone Cannot start service drone: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:430: container init caused \"process_linux.go:396: setting cgroup config for procHooks process caused \\\"failed to write 202:1 20 to blkio.throttle.read_iops_device: write /sys/fs/cgroup/blkio/docker/a674e86d50111afa576d5fd4e16a131070c100b7db3ac22f95986904a47ae82a/blkio.throttle.read_iops_device: invalid argument\\\"\"": unknown
The iops limit on my EBS instance is 120 and so I tested using a variety of different values to no avail.
Any help is massively appreciated.

Can't attach gdbserver to process through kubectl

It looks like I have some sort of permissions problem with kubectl. I have a Docker image, that contains server with native dynamic library + gdbserver. When I'm trying to debug Docker container running on my local machine all is fine. I'm using the following workflow:
start gdb
target remote | docker exec -i CONTAINER gdbserver - --attach PID
set sysroot /path/to/local/binary
Good to go!
But when I'm trying to do such operation with kubectl I'm getting the following error:
Cannot attach to lwp 7: Operation not permitted (1)
Exiting
Remote connection closed
The only difference is step 2:
target remote | kubectl exec -i POD -- gdbserver - --attach PID
I think you might need to add ptrace() capabilities and seccomm profile in your yaml file.
--cap-add sys_ptrace

Confd error: ERROR 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]

While debugging I realised that confd doesn't pick up the keys and my journal looks like this:
Sep 18 18:31:50 ip-10-171-54-76.ec2.internal docker[24891]: [nginx] waiting for confd to refresh nginx.conf
Sep 18 18:31:56 ip-10-171-54-76.ec2.internal docker[24891]: 2014-09-18T18:31:56Z 9122c7a54edc confd[9572]: ERROR 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]
I use nsenter to log in to the running container to run some experiments for debugging purposes. I ran this command
confd -onetime -node 172.17.42.1:4001 -config-file /etc/confd/conf.d/nginx.toml
Then received this error as above
confd[12894]: ERROR 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]
I am totally clueless at this point. I am using EC2 with the stable version of CoreOS and I am sure that etcd is running on the host. Also, I can ping the host from inside the container successfully.
Any ideas on what's wrong?
Assistance will be much appreciated.
This error indicates that your etcd cluster isn't operating correctly, so confd has nothing to watch. It has probably lost quorum. The logs (journalctl -u etcd) should indicate what happened.