we have a deamonset and we want to make it HA (not our deamonset), does the following is applicable for HA for deamaon set also?
affinity (anti affinity)
toleration's
pdb
we have on each cluster 3 worker nodes
I did it in the past for deployment but not sure what is also applicable for deamonset, this is not our app but we need to make sure it is HA as it's critical app
update
Does it make sense to add the following to deamonset, lets say I've 3 worker nodes and I want it to be scheduled only in foo workers nodes?
spec:
tolerations:
- effect: NoSchedule
key: WorkGroup
operator: Equal
value: foo
- effect: NoExecute
key: WorkGroup
operator: Equal
value: foo
nodeSelector:
workpcloud.io/group: foo
You have asked two, somewhat unrelated questions.
does the following is applicable for HA for deamaon set also?
affinity (anti affinity)
toleration's
pdb
A daemonset (generally) runs on a policy of "one pod per node" -- you CAN'T make it HA (for example, by using autoscaling), and you will (assuming you use defaults) have as many replicas of the daemonset as you have nodes, unless you explicitly specify which nodes you want to want to run the daemonset pods, using things like nodeSelector and/or tolerations, in which case you will have less pods. The documentation page linked above gives more details and has some examples
this is not our app but we need to make sure it is HA as it's critical app
Are you asking how to make your critical app HA? I'm going to assume you are.
If the app is as critical as you say, then a few starter recommendations:
Make sure you have at least 3 replicas (4 is a good starter number)
Add tolerations if you must schedule those pods on a node pool that has taints
Use node selectors as needed (e.g. for regions or zones, but only if necessary do to something like disks being present in those zones)
Use affinity to group or spread your replicas. Definitely would recommend using a spread so that if one node goes down, the other replicas are still up
Use a pod priority to indicate to the cluster that your pods are more important than other pods (beware this may cause issues if you set it too high)
Setup notifications to something like PagerDuty, OpsGenie, etc, so you (or your ops team) are notified if the app goes down. If the app is critical, then you'll want to know it's down ASAP.
Setup pod disruption budgets, and horizontal pod autoscalers to ensure an agreed number of pods are always up.
You can not control the replicas in DaemonSet as DaemonSet will have one pod per node.
you need to change the object to either Deployment or Statefulset to manage the replica count and use the nodeSelector to deploy it in all the nodes.
Related
This must sound like a real noob question. I have a cluster-autoscaler and cluster overprovisioner set up in my k8s cluster (via helm). I want to see the auto-scaler and overprovisioner actually kick in. I am not able to find any leads on how to accomplish this.
does anyone have any ideas?
You can create a Deployment that runs a container with a CPU intensive task. Set it initially to a small number of replicas (perhaps < 10) and start increasing the replicas number with:
kubectl scale --replicas=11 your-deployment
Edit:
How to tell the Cluster Autoscaler has kicked in?
There are three ways you can determine what the CA is doing. By watching the CA pods' logs, checking the content of the kube-system/cluster-autoscaler-status ConfigMap or via Events.
We have a Flask application that is served via gunicorn, using the eventlet worker. We're deploying the application in a kubernetes pod, with the idea of scaling the number of pods depending on workload.
The recommended settings for the number of workers in gunicorn is 2 - 4 x $NUM_CPUS. See docs. I've previously deployed services on dedicated physical hardware where such calculations made sense. On a 4 core machine, having 16 workers sounds OK and we eventually bumped it to 32 workers.
Does this calculation still apply in a kubernetes pod using an async worker particularly as:
There could be multiple pods on a single node.
The same service will be run in multiple pods.
How should I set the number of gunicorn workers?
Set it to -w 1 and let kubernetes handle the scaling via pods?
Set it to 2-4 x $NUM_CPU on the kubernetes nodes. On one pod or multiple?
Something else entirely?
Update
We decided to go with the 1st option, which is our current approach. Set the number of gunicorn works to 1, and scale horizontally by increasing the number of pods. Otherwise there will be too many moving parts plus we won't be leveraging Kubernetes to its full potential.
For better visibility of the final solution chosen by original author of this question as of 2019 year
Set the number of gunicorn works to 1 (-w 1), and scale horizontally
by increasing the number of pods (using Kubernetes HPA).
and the fact it might be not applicable in the close future, taking into account fast growth of workload related features in Kubernetes platform, e.g. some distributions of Kubernetes propose beside HPA, Vertical Pod Autoscaling (VPA) and Multidimensional Pod autoscaling (MPA) too, so I propose to continue this thread in form of community wiki post.
I'am not developer and it seems not simple task, but for your considerations please follow bests practices for Better performance by optimizing Gunicorn config.
In addition in kubernetes there are different mechanisms in order to scale your deployment like HPA due to CPU utilization and (How is Python scaling with Gunicorn and Kubernetes?)
You can use also Resource requests and limits of Pod and Container.
As per Gunicorn documentation
DO NOT scale the number of workers to the number of clients you expect to have. Gunicorn should only need 4-12 worker processes to handle hundreds or thousands of requests per second.
Gunicorn relies on the operating system to provide all of the load balancing when handling requests. Generally we recommend (2 x $num_cores) + 1 as the number of workers to start off with. While not overly scientific, the formula is based on the assumption that for a given core, one worker will be reading or writing from the socket while the other worker is processing a request.
# update:
Depending on your approach you can choose different solution (deployment, daemonset) all above statements you can achieve in kubernetes by handling according Assigning CPU Resources to Containers and Pods
Using deployment with resources (limits,requests) give you possibility to resize your app into multiple pods on a single node based on your hardware limits but depending on your "app load" it can not be good enough solution.
CPU requests and limits are associated with Containers, but it is useful to think of a Pod as having a CPU request and limit. The CPU request for a Pod is the sum of the CPU requests for all the Containers in the Pod. Likewise, the CPU limit for a Pod is the sum of the CPU limits for all the Containers in the Pod.
Note:
The CPU resource is measured in CPU units. One CPU, in Kubernetes, is equivalent to:
f.e. 1 GCP Core.
As mentioned in the post the second approach (scaling your app into multiple nodes) it's also good choice. In this case you can cosnider using f.e. Statefulset or deployment in addition on GKE using "cluster austoscaler" you can achieve more extendable solution when you try to create new pods that don't have enough capacity to run inside the cluster. In this case cluster autoscaler automatically add additional resources.
On the other hand you can consider using different other solutions like Cerebral it gives you the possibility to create user-defined policies in order to increasing or decreasing the size of pools of nodes inside your cluster.
GKE's cluster autoscaler automatically resizes clusters based on the demands of the workloads you want to run. With autoscaling enabled, GKE automatically adds a new node to your cluster if you've created new Pods that don't have enough capacity to run; conversely, if a node in your cluster is underutilized and its Pods can be run on other nodes, GKE can delete the node.
Please keep in mind that the question is very general and there is no one good answer for this topic. You should consider all prons and cons based on your requirements, load, activity, capacity, costs ...
Hope this help.
We have configured Kubernetes cluster on EC2 machines in our AWS account using kops tool (https://github.com/kubernetes/kops) and based on AWS posts (https://aws.amazon.com/blogs/compute/kubernetes-clusters-aws-kops/) as well as other resources.
We want to setup a K8s cluster of master and slaves such that:
It will automatically resize (both masters as well as nodes/slaves) based on system load.
Runs in Multi-AZ mode i.e. at least one master and one slave in every AZ (availability zone) in the same region for e.g. us-east-1a, us-east-1b, us-east-1c and so on.
We tried to configure the cluster in the following ways to achieve the above.
Created K8s cluster on AWS EC2 machines using kops this below configuration: node count=3, master count=3, zones=us-east-1c, us-east-1b, us-east-1a. We observed that a K8s cluster was created with 3 Master & 3 Slave Nodes. Each of the master and slave server was in each of the 3 AZ’s.
Then we tried to resize the Nodes/slaves in the cluster using (https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-run-on-master.yaml). We set the node_asg_min to 3 and node_asg_max to 5. When we increased the workload on the slaves such that auto scale policy was triggered, we saw that additional (after the default 3 created during setup) slave nodes were spawned, and they did join the cluster in various AZ’s. This worked as expected. There is no question here.
We also wanted to set up the cluster such that the number of masters increases based on system load. Is there some way to achieve this? We tried a couple of approaches and results are shared below:
A) We were not sure if the cluster-auto scaler helps here, but nevertheless tried to resize the Masters in the cluster using (https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-run-on-master.yaml). This is useful while creating a new cluster but was not useful to resize the number of masters in an existing cluster. We did not find a parameter to specify node_asg_min, node_asg_max for Master the way it is present for slave Nodes. Is there some way to achieve this?
B) We increased the count MIN from 1 to 3 in ASG (auto-scaling group), associated with one the three IG (instance group) for each master. We found that new instances were created. However, they did not join the master cluster. Is there some way to achieve this?
Could you please point us to steps, resources on how to do this correctly so that we could configure the number of masters to automatically resize based on system load and is in Multi-AZ mode?
Kind regards,
Shashi
There is no need to scale Master nodes.
Master components provide the cluster’s control plane. Master components make global decisions about the cluster (for example, scheduling), and detecting and responding to cluster events (starting up a new pod when a replication controller’s ‘replicas’ field is unsatisfied).
Master components can be run on any machine in the cluster. However, for simplicity, set up scripts typically start all master components on the same machine, and do not run user containers on this machine. See Building High-Availability Clusters for an example multi-master-VM setup.
Master node consists of the following components:
kube-apiserver
Component on the master that exposes the Kubernetes API. It is the front-end for the Kubernetes control plane.
etcd
Consistent and highly-available key value store used as Kubernetes’ backing store for all cluster data.
kube-scheduler
Component on the master that watches newly created pods that have no node assigned, and selects a node for them to run on.
kube-controller-manager
Component on the master that runs controllers.
cloud-controller-manager
runs controllers that interact with the underlying cloud providers. The cloud-controller-manager binary is an alpha feature introduced in Kubernetes release 1.6.
For more detailed explanation please read the Kubernetes Components docs.
Also if You are thinking about HA, you can read about Creating Highly Available Clusters with kubeadm
I think your assumption is that similar to kubernetes nodes, masters devide the work between eachother. That is not the case, because the main tasks of masters is to have consensus between each other. This is done with etcd which is a distributed key value store. The problem maintaining such a store is easy for 1 machine but gets harder the more machines you add.
The advantage of adding masters is being able to survive more master failures at the cost of having to make all masters fatter (more CPU/RAM....) so that they perform well enough.
I am currently using kops to create AWS EC2 clusters. But it does not seem to have an option to specify 'spot' instances.
Does anybody know how to create instances of type 'spot' with kops or with kubernetes?
From the docs
https://github.com/kubernetes/kops/blob/master/docs/instance_groups.md#converting-an-instance-group-to-use-spot-instances
Follow the normal procedure for reconfiguring an InstanceGroup, but
set the maxPrice property to your bid. For example, "0.10" represents
a spot-price bid of $0.10 (10 cents) per hour.
So after kops create cluster but before kops update cluster --yes run kops edit ig nodes --name $NAME and set maxPrice to your max bid.
metadata:
creationTimestamp: "2016-07-10T15:47:14Z"
name: nodes
spec:
machineType: t2.medium
maxPrice: "0.01"
maxSize: 3
minSize: 3
role: Node
It appears that gardener/machine-controller-manager could be taught about Spot instances fairly easily, and there is an existing issue to do just such a thing. I can't recall off-hand if that is the Node Controller Manager that I recalled seeing, or it is merely a Node Controller Manager and thus there may be other implementations of that idea which already include spot support.
That makes a presumption that you actually meant spot for the workers, and not for the whole cluster. If you mean the whole cluster, then you may be much, much happier with something like kubespray and use that to lay a functioning cluster on top of existing machines. Just bear in mind that while kubernetes certainly is resilient to "damage," including the loss of a master, an etcd member, and without question the loss of a Node, it might frown if a huge portion of its machines vanish at once. In other words: using spot could mean that you spend more programmer/devops/glucose triaging spot disappearance, or you have to so vastly overprovision replicas that it starts to eat into the savings from spot in the first place.
Consider a Statefulset (Cassandra using offical K8S example) across 3 Availability zones:
cassandra-0 -> zone a
cassandra-1 -> zone b
cassandra-2 -> zone c
Each Cassandra pod uses an EBS volume. So there is automatically an affinity. For instance, cassandra-0 cannot move to "zone-b" because its volume is in "zone-a". All good.
If some Kubernetes nodes/workers fail, they will be replaced. The pods will start again on the new node and be re-attached their EBS volume. Looking like nothing happened.
Now if the entire AZ "zone-a" goes down and is unavailable for some time (meaning cassandra-0 cannot start anymore due to affinity for EBS in the same zone). You are left with:
cassandra-1 -> zone b
cassandra-2 -> zone c
Kubernetes will never be able to start cassandra-0 for as long as "zone-a" is unavailable. That's all good because cassandra-1 and cassandra-2 can serve requests.
Now if on top of that, another K8S node goes down or you have setup auto-scaling of your infrastructure, you could end up with cassandra-1 or cassandra-2 needed to move to another K8S node.
It shouldn't be a problem.
However from my testing, K8S will not do that because the pod cassandra-0 is offline. It will never self-heal cassandra-1 or cassandra-2 (or any cassandra-X) because it wants cassandra-0 back first. And cassandra-0 cannot start because it's volume is in a zone which is down and not recovering.
So if you use Statefulset + VolumeClaim + across zones
AND you experience an entire AZ failure
AND you experience an EC2 failure in another AZ or have auto-scaling of your infrastructure
=> then you will loose all your Cassandra pods. Up until zone-a is back online
This seems like a dangerous situation. Is there a way for a stateful set to not care about the order and still self-heal or start more pod on cassandra-3, 4, 5, X?
Starting with Kubernetes 1.7 you can tell Kubernetes to relax the StatefulSet ordering guarantees using the podManagementPolicy option (documentation). By setting that option to Parallel Kubernetes will no longer guarantee any ordering when starting or stopping pods and start pods in parallel. This can have an impact on your service discovery, but should resolve the issue you're talking about.
Two options:
Option 1: use podManagementPolicy and set it to Parallel.
The pod-1 and pod-2 will crash a few times until the seed node (pod-0) is available. This happens when creating the statefulset the first time.
Also note that Cassandra documentation used to recommend NOT creating multiple nodes in parallel but it seems recent updates makes this not true. Multiple nodes can be added to the cluster at the same time
Issue found: if using 2 seed nodes, you will get a split brain scenario. Each seed node will be created at the same time and create 2 separate logical Cassandra clusters
Option 1 b: use podManagementPolicy and set it to Parallel and use ContainerInit.
Same as option 1 but use an initContainer https://kubernetes.io/docs/concepts/workloads/pods/init-containers/.
The init container is a short lived container which has for role to check that the seed node is available before starting the actual container. This is not required if we are happy for the pod to crash until the seed node is available again
The problem is that Init Container will always run which is not required. We want to ensure the Cassandra cluster was well formed the first time it was created. After that it does not matter
Option 2: create 3 different statefulets.
1 statefulset per AZ/Rack. Each statefulset has constraints so it can run only on nodes in the specific AZ. I've also got 3 storage classes (again constraint to a particular zone), to make sure the statefulset does not provision EBS in the wrong zone (statefulset does not handle that dynamically yet)
In each statefulset I've got a Cassandra seed node (defined as environment variable CASSANDRA_SEEDS which populates SEED_PROVIDER at run time). That makes 3 seeds which is plenty.
My setup can survive a complete zone outage thanks to replication-factor=3
Tips:
the list of seed node contains all 3 nodes separated by commas:
"cassandra-a-0.cassandra.MYNAMESPACE.svc.cluster.local, cassandra-b-0.cassandra.MYNAMESPACE.svc.cluster.local, cassandra-c-0.cassandra.MYNAMESPACE.svc.cluster.local"
Wait until the first seed (cassandra-a-0) is ready before creating the other 2 statefulsets. Otherwise you get a split brain. This is only an issue when you create the cluster. After that, you can loose one or two seed nodes without impact as the third one is aware of all the others.
I think that if you can control the deployment of each pod (cassandra-0, cassandra-1, cassandra-2 with three different yaml deployment files), you can use podAffinity set to a specific zone for each pod.
Once a node on a zone fails and the pod running inside that server has to be rescheduled, the affinity will force Kubernetes to deploy the pod on a different node of the same Zone, and if no nodes are available on the same zone, Kubernetes should keep that pod down indefinitely.
For example, you may create a Kubernetes cluster with three different managedNodeGroup, one for each zone (label "zone": "a", "b", "c" for each group), with at least two nodes for each group, and use the podAffinity.
Note: Do not use x1.32xlarge machines for the nodes :-)