How does Kubernetes kubelet resource reservation work - amazon-web-services

I recently tried to bring up a Kubernetes cluster in AWS using kops. But when the worker node (Ubuntu 20.04) started, a docker load process on it kept getting OOMkilled even when it has enough memory (~14GiB). I tracked down the issue being I set kubelet's memory reservation too small (--kube-reserved=memory=100Mi...).
So now I have two questions related to the following paragraph in the documentation:
kube-reserved is meant to capture resource reservation for kubernetes system daemons like the kubelet, container runtime, node problem detector, etc.
https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/#kube-reserved
First, I interpreted the "reservation" as "the amount of memory guaranteed", similar to the concept of a pod's .spec.resource.requests.memory. However, it seems like the flag acts like a limit as well? Does this mean Kubernetes intend to manage Kubernetes system daemons with "guaranteed" QoS class concept?
Also, my container runtime, docker, does not seem to be in /kube-reserved cgroup, instead, it is in /system.slice:
$ systemctl status $(pgrep dockerd) | grep CGroup
CGroup: /system.slice/docker.service
So why is it getting limited by /kube-reserved? It is not even kubelet talking to docker through CRI, but just my manual docker load command.

kube-reserved is a way to protect Kubernetes system daemons (which includes the Kubelet) from running out of memory should the pods consume too much. How is this achieved? The pods are limited by default to an "allocatable" value, equal to the memory capacity of the node minus several flag values defined in the URL you posted, one of which is kube-reserved. Here's what this looks like for a 7-GiB DS2_v2 node in AKS:
But it's not always the Kubernetes system daemons that have to be protected from either pods or even OS components consuming too much memory. It can very well be the Kubernetes system daemons that could consume too much memory and start affecting the pods or other OS components. To protect against this scenario, there's an additional flag defined:
To optionally enforce kube-reserved on kubernetes system daemons,
specify the parent control group for kube daemons as the value for
--kube-reserved-cgroup kubelet flag.
With this new flag in place, should the aggregated memory use of the Kubernetes system daemons exceed the cgroup limit, then the OOM killer will step in and terminate one of their processes. To apply this to the picture above, with the --kube-reserved-cgroup flag specified, the Kubernetes system daemons are prevented from going over 1,638 MiB.

Related

What is the recommended EC2 instance for Istio bookinfo sample application?

I have EKS cluster on AWS with istio installed, the first time i installed istio, i used one m3.large EC2 instance and i got some istio services pending, ingress-gateway pods status was showing pending .
I described the pod and i saw error of insufficient CPU.... I increased the EC2 instance to m5.large and every pods started running..
We are actually on staging and this is not live yet, we are spending almost times 3 of our initial cost.
Can someone please recommend an EC2 instance that can conveniently get istio up and running, lets take a look at the bookinfo sample application.
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 2m33s (x60 over 12m) default-scheduler 0/1 nodes are available: 1 Insufficient cpu.
It seems provisioning 2 m5.large instances worked perfectly, but this is incurring more cost.. Each m5.large cost 0.107 USD / hours and that is 77 USD / month .
Having two m5.large instance will encure more cost just to run 15 pods (5 custom pods)
Non-terminated Pods: (15 in total)
The deployment is made up of a different number of components. Some of
them, as pilot, have a large impact in terms of memory and CPU, so it
is recommended to have around 8GB of memory and 4 CPUs free in your
cluster. Obviously, all components have requested resources defined,
so if you don’t have enough capacity you will see pods not starting.
Where you are using M5-large which spec is
m5.large 2 CPU 8 Memory EBS-Only
so in the base of above requirement, you need
m5.xlarge 4 CPU 16 Memory EBS-Only
If your application is need high computing then you may try with compute optmized instance.
Compute optimized instances are ideal for compute-bound applications
that benefit from high-performance processors. They are well suited
for the following applications:
Batch processing workloads
Media transcoding
High-performance web servers
High-performance computing (HPC)
Scientific modeling
Dedicated gaming servers and ad serving engines
Machine learning inference and other compute-intensive applications
compute-optimized-instances
deploying-istio on AWS and azure recommendation
might help you
https://aws.amazon.com/blogs/opensource/getting-started-istio-eks/
If you look at the AWS instance types listing an m5.large instance is pretty small: it only has 2 CPU cores. On the other hand, if you look at the kubectl get pods --all-namespaces listing, you can see there are quite a few pods involved to run the core Kubernetes system (and several of those are replicated on each node in a multi-node installation).
If 2 cores isn't enough, you can try picking larger instance sizes; if 2x m5.large works then 1x m5.2xlarge will be slightly better and the same cost. If you're just running demo applications like this then the "c" family has half the memory (2 GiB per core) and is slightly cheaper so you might try a c5.2xlarge.
For medium-sized workloads, I'd suggest figuring out your total cluster requirements (based on either pods' resource requests or actual statistics from a tool like Prometheus); dividing that across some number of worker nodes, such that losing one won't be a significant problem (maybe 7 or 9); then selecting the instance size that fits that. It will be easier to run on fewer, larger nodes than more, smaller nodes (there are more places to fit that one pod that requires 8 GB of RAM).
(I routinely need to allocate 4-8 GB of memory for desktop environments like Docker Desktop for Mac or kind and still find it cramped; CPU isn't usually my limitation but I could easily believe that 2 cores and 8 GiB of RAM isn't enough.)
(And yes, AWS is pretty expensive for personal projects without an obvious revenue stream attached to them. You could get that m5.large instance for about $500/year if you were willing to pay that amount up front but that can still be a lot of money to just play around with things.)
TL;DR for many requirements the default requests in Istio are extremely greedy. You need to change these with your own values.yaml (assuming you're using Helm) and monitor how much resource Istio is actually using. Using bigger and bigger instance types is a bad solution (unless you really do consume the default requests, or you like spraying money against a wall).
The problem is that Istio, when using the default profiles, makes some very large Requests. This means that even if you've got plenty of available resources, kubernetes will refuse to schedule many of the Istio control plane components.
[I'm assuming you're famililar with kubernetes requests. If not, these are declarations in the pod yaml that "this pods need x cpu and y memory to run comfortably". The Kubernetes pod scheduler will then ensure that pod is scheduled to a node that has sufficient resource. The problem is, many people stick their finger in the air and put massive values in "to be sure". But this means that huge chunks of your available resource are being wasted, if the pod doesn't actually need that resource to be comfortable].
In addition, each sidecar makes a sizeable Request as well, piling on the pressure.
This will be why you're seeing pods stuck in pending.
I'm not 100% convinced that the default requests set by the Istio team are actually that reasonable [edit: for bookinfo, they're certainly not. I suspect the defaults are set for even multithousand node estates]. I would recommend that before boosting your instance sizes (and therefore your costs), look into reducing the requests made by the Istio control and data plane.
If you then find your Istio components are being evicted often, then you've gone too far.
Example: using the supplied Helm values.yaml file here, we have for each sidecar:
requests:
cpu: 100m
memory: 128Mi
(Lines 155-157).
More worringly, the default memory request for Pilot is 2Gb! That means you're going to be giving away a massive chunk (or maybe the whole) of a Node. That's just for Pilot - the same store is true for Galley, Citadel, Telemetry, etc, etc, etc.
You need to monitor a running cluster and if you can determine that these values can be reduced. For example, I have a reasonably busy cluster (way more complicated than the wretched bookinfo), and metrics server is telling me Pilot's cpu is 8millicore(!) and memory 62Mi. So if I'd blindly stuck with the defaults, which most people do, I'd be wasting nearly 2Gb of memory and half a CPU.
See my output here: I stress this is from a long running, production standard cluster:
[ec2-user#ip-172-31-33-8 ~]$ kubectl top pod -n istio-system
NAME CPU(cores) MEMORY(bytes)
istio-citadel-546575dc4b-crnlj 1m 14Mi
istio-galley-6679f66459-4rlrk 19m 17Mi
istio-ingressgateway-b9f65784b-k64th 1m 22Mi
istio-pilot-67bfb94df4-j7vld 8m 62Mi
istio-policy-598b768ddc-cvs2b 5m 39Mi
istio-sidecar-injector-578bc4cc74-n5v6w 11m 7Mi
istio-telemetry-cd6fddc4b-lt8rl 27m 57Mi
prometheus-6ccfbc78c-w4dd6 25m 497Mi
A more readable guide to the defaults is here.. Run through the requests for the whole of the control plane and add up the required cpu and memory. It's a lot of resource.
This is hard work, but you need to sit down and work out what each component really needs, set up your own values.yaml and generate your own yaml for Istio. The demo yamls provided by Istio are not reasonable, especially for Mickey Mouse apps like bookinfo, which should be taken out the back door and put out of its misery. Bear in mind Istio was developed originally alongside massive multi thousand node clusters.

How can I scale CloudFoundry applications "down" without the risk of restarting all of them?

This is a question regarding the Swisscom Application Cloud.
I have implemented a strategy to restart already deployed CloudFoundry applications without using cf restart APP_NAME. I wanted to be able to:
restart running applications without needing access the app manifest and
avoid them suffering any down-time.
The general concept looks like this:
cf scale APP_NAME -I 2
increasing the instance count of the app from 1 to 2
wait for all app instances to be running
cf restart-app-instance APP_NAME 0
restart the "old" app instance
wait for all app instances to be running again
cf scale easyasset-repower-staging -I 1
decrease the instance count of the app back from 2 to 1
This generally works and usually does what I expect it to do. The problem I am having occurs at Stage (3), where sometimes instead of just scaling the instance count back, CloudFoundry will also restart all (remaining) instances.
I do not understand:
Why does this happen only sometimes (all apps restart when scaling down)?
Shouldn't CloudFoundry keep the the remaining instances up and running?
If cf scale is not able to keep perfectly fine running app instances alive - when is it useful?
Please Note:
I am well aware of the Bluegreen / Autopilot plugins for zero-down-time deployment of applications in CloudFoundry and I am actually using them for our deployments from our build server, but they require me to provide a manifest (and additional credentials), which in this case I don't have access to (unless I can somehow extract it from a running app via cf create-app-manifest?).
Update 1:
Looking at the plugins again I found bg-restage, which apparently does approximately what I want, but I have no idea how reliable that one is.
Update 2:
I have concluded that it's probably an obscure issue (or bug) in CloudFoundry and that there are no guarantees given by cf scale that existing instances are to remain running. As pointed out above, I have since realised that it is indeed very much possible to generate the app manifest on the fly (cf create-app-manifest) and even though I couldn't use the bg-restage plugin without errors, I reverted back to the blue-green-deploy plugin, which I can now hand a freshly generated manifest to avoid this whole cf scale exercise.
Comment Questions:
Why do you have the need to restart instances of your application?
We are caching some values from persistent storage on start-up. This restart is happening when changes to that data was detected.
information about the health-check
We are using all types of health checks, depending on which app is to be re-started (http, process and port). I have observed this issue only for apps with health checkhttp. I also have ahttp-endpoint` defined for the health check.
Are you trying to alter the memory with cf scale as well?
No, I am trying to keep all app configuration the same during this process.
When you have two running instances, the command
cf scale <APP> -i 1
will kill instance #1 and instance #0 will not be affected.

Running multiple app instances on a single container in PCF

We have an internal installation of PCF.
A developer wants to push a stateless (obeys 12 factor rules) nodejs app which will spawn other app instances i.e leverage nodejs clustering as per https://nodejs.org/api/cluster.html. Hence there would be multiple processes running on each container. Any known issues with this from a PCF perspective? I appreciate it violates the rule/suggestion of one app instance per container but that is just a suggestion :) All info welcome.
Regards
John
When running an application on Cloud Foundry that spawns child processes, the number one thing you need to watch out for is memory consumption. You set a memory limit when you push your application which is for the entire container. That includes the parent process, whatever child processes are spawned and a little overhead (currently init process, sshd process & health check).
Why is this a problem? Most buildpacks make the assumption that only one process will be running and that it will consume as much memory as possible while staying under the defined memory limit. They attempt to configure the software which is running your application to do this. When you spawn child processes, this breaks the buildpack's assumptions and can create scenarios where your application will exceed the defined memory limit. When this happens, even by one byte, the process will be killed and restarted.
If you're concerned with scaling your application, you should not try to spin off child processes in one extra large container. Instead, let the platform help you and scale up the number of application instances. The platform can easily do this and by using multiple smaller containers you can scale just as well. In fact, if you already have a 12-factor app, it should be well positioned to work in this manner.
Good luck!

Simplest way to provide memory, disk and CPU isolation without downloading images

I am familiar with Docker, Rkt and LXD, but if I did not have the ability to install all these tools, what would be the basic mechanisms to provide isolation of CPU, memory and Disk for a particular process?
CPU - I want to say that only 1 socket of the two is usable by this process
Memory - I don't want this process to use more than 10GB memory
Disk - I don't want the process to use more than 100GB of disk and have visibility (ls should not list it) of files that are not created by this process
I think installing Docker, Rkt and what-not is very heavy weight solution for something basic that I am trying to accomplish
Is cgroups the underlying API I should tap into to get what I need? If so, is there a good book to learn about CGroups
I am running on EC2 - RHEL and Ubuntu both.
See man page for cgroups(7) for introduction, the full documentation of cgroup interface is maintained in linux kernel:
cgroup v1
cgroup v2
On top of that, on a distribution with systemd and cgroup v2 interface, cgroup features should be used via systemd and not directly. See also man page for systemd.resource-control.
For distribution specific information, see:
RHEL 6 Resource Management Guide
RHEL 7 Resource Management Guide
Quick answers to your questions
I want to say that only 1 socket of the two is usable by this process
This could be done via cpuset controller from cgroup v1 (both on RHEL 6 and RHEL 7).
I don't want this process to use more than 10GB memory
See memory controller of cgroup v1 interface or MemoryLimit of systemd resource control interface.
I don't want the process to use more than 100GB of disk
This is out of cgroups area of control, use disk quotas instead.
have visibility (ls should not list it) of files that are not created by this process
This is out of cgroups functionality, use either filesystem access right, filesystem namespaces or PrivateTmp systemd service option, depending on your use case.

Changes to ignite cluster membership unexplainable

I am running a 12 node jvm ignite cluster. Eeach jvm runs on its own vmware node. I am using zookeeper to keep these ignite nodes in sync using tcp discovery. I have been seeing lot of node failures in zookeeper logs
although the java processes are running, I don't know why some ignite nodes leave the cluster with "node failed" kind of errors. Vmware uses vmotion to do something what they call as "migration".I am assuming that is some kind of filesystem sync process between vmware nodes.
I am also seeing pretty frequent "dumping pending object" and "Failed to wait for partition map exchange" kind of messages in the jvm logs for ignite.
My env setup is as follows:
Apache Ignite 1.9.0
RHEL 7.2 (Maipo) runs on each of the 12 nodes
Oracle Jdk1.8.
Zookeeper 3.4.9
Please let me know your thoughts.
TIA
There are generally two possible reasons:
Memory issues. For example, if a node goes to long GC pause, it can become unresponsive and therefore removed from topology. For more details read here: https://apacheignite.readme.io/docs/jvm-and-system-tuning
Network connectivity issues. Check if the network between your VMs is stable. You may also want to try increasing the failure detection timeout: https://apacheignite.readme.io/docs/cluster-config#failure-detection-timeout
VM Migrations sometimes involve suspending the VM. If the VM is suspended, it won't have a clean way to communicate with the rest of the cluster and will appear down.