k8s high availability configuration edge cases for prod

k8s high availability configuration edge cases for prod - amazon-web-services

we have an app in production which need to be highly available (100%),so we did the following:
We configure 3 instance as HA but then the node died
We configure anti-affinity (to run on differents nodes) but some update done on the nodes and we were unavailable(evicted) for some min.
Now we consider to add pod disruption Budget
https://kubernetes.io/docs/concepts/workloads/pods/disruptions/
My question are:
How the affinity works with pod disruption Budget, could be any collusion ? or this is redundant configs ?
is there any other configuration which I need to add to make sure that my pods run always (as much as possible )

How the affinity works with pod disruption Budget, could be any collusion ? or this is redundant configs ?
Affinity and Anti-affinity is about where your Pod is scheduled, e.g. so that two replicas of the same app is not scheduled to the same node. Pod Disruption Budgets is about to increase availability when using voluntary disruption e.g. maintenance. They are both related to making better availability for your app - but not related to eachother.
Is there any other configuration which I need to add to make sure that my pods run always (as much as possible)
Things will fail. What you need to do is to embrace distributed systems and make all your workload a distributed system, e.g. with multiple instances to remove single point of failure. This is done differently for stateless (e.g. Deployment) and stateful (e.g. StatefulSet) workload. What's important for you is that your app is available at much as possible, but individual instances (e.g. Pods) can fail, almost without that any user notice it.
We configure 3 instance as HA but then the node died
Things will always fail. E.g. a physical node may crash. You need to design your apps so that it can tolerate some failures.
If you use a cloud provider, you should use regional clusters that uses three independent Availability Zones and you need to spread your workload so that it runs in more than one Availability Zone - in this way, your app can tolerate that a whole Availability Zone is down without affecting your users.

Related

Cluster nodes only used by internal pods

We are using GKE to host our apps with Anthos, our default node pool ils set to autoscale but I noticed that out of 5 running pods, only 2 are hosting our actual services.
All the others are running internal services like this:
The issue with that is that there's not enough room for running our own services. I guess these are vital for the cluster otherwise the cluster would autoscale and the nodes would get removed.
What would be the best approach to solve this issue? I thought of upgrading the nodes machine type to allow more resources per node and have more room within them and thus have less running nodes, but I wanted to make sure I was not simply missing something on how GKE works.
I've been now digging for quite some time but it seems that would be my only option.

GKE itself requires several add-on resources which are deployed as part of your cluster. You can fine tune the resource usage of some of the GKE add-ons for smaller clusters. Additionally, Anthos each Anthos capability you enable typically deploys a set of controllers as well. GKE and Anthos try to minimize the compute resources used by these services / controllers, but you do need to account for them when calculating the right size(s) for your nodes. A good rule of thumb is to assume that system services/controllers will use ~1 vCPU when using GKE/Anthos (it's typically lower than that, but it makes things easier). So if your workloads all request >=1 vCPU, you'll likely need to use nodes that have a minimum of 4 vCPUs. You'll also want to enable the cluster autoscaler for your node pools if you don't want to pre-provision everything.
A better option would be to use node auto-provisioning as in this case you don't need to create/manage your own node pools as GKE will automatically add/remove nodes / node pools based on the resources requested by your deployments.

In GKE, how to minimize connect time with Load balancer

In GKE, for cost-saving, I usually put the node number to zero. When I autoscale nodes(or say add) and run the pods. It takes more than 6-7 mins to connect to Loadbalancer and up the URL. That's why health checks in the waiting state. Is there any way to reduce the time? Thanks

If Cloud Functions is not an option, you might want to look at Cloud Run (which supports containers and scales to zero) or GKE Autopilot (which does not scale to zero, but you can scale down to low resource and it will autoscale up and down as needed)

In short not really. Spinning up time of nodes is not easily controlled, basically it is the time that will take for the VM to be allocated, turned on, boot the OS and do some other stuff related to Kubernetes (like configuration, adding to node pool, etc) this takes time! In addition to Pods spinning up time which depends on the Docker image (size/dependencies etc).
Scaling down your application to zero nodes is not very recommended. It is always recommended to have some nodes up (don’t you have other apps running on the GKE cluster? Kubernetes clusters are recommended to have at least 3 nodes running).
Have you considered using Cloud Functions? Is it possible in your case? This the the best option I know of for a quick scale up and zero scale down.
And in general you can keep some kind of “ping” to the function to keep it “hot” for a relatively cheap price.
If none of the options above is possible (id say keeping your node pool with at least 3 nodes operating, is best as it is takes time for the Kubernetes control plan to boot). I suggest starting with reducing the spinning up time of your Pods by improving the Docker image - reducing its size etc.
Here are some articles on how to reduce Docker image size
https://phoenixnap.com/kb/docker-image-size
https://www.ardanlabs.com/blog/2020/02/docker-images-part1-reducing-image-size.html
After that I will experiment with different machine types for node to check which one is spinning the fastest - could be an interesting thing to do in any case
Here is an interesting comparison on VM spinning up times
https://www.google.com/amp/s/blog.cloud66.com/part-2-comparing-the-speed-of-vm-creation-and-ssh-access-on-aws-digitalocean-linode-vexxhost-google-cloud-rackspace-packet-cloud-a-and-microsoft-azure/amp/

Do we have anything similar to Azure "Availability Set" in GCP and AWS

Context :
We are prototyping a multi cloud deployment of our application (based on micro services).
For balancing between high availability and co location we used "Availability Sets" feature in Azure. Which kind off ensures that Azure platform/service upgrades doesn't happen in two distinct sets simultaneously.
Availability sets Azure
Scenario :
I couldn't find anything similar in Google Cloud Platform and AWS. So in this case we have to go with separate "Zones" for high availability.
One argument in favor of Availability sets ( theoretically) are they are kind of more closer that Zones as the former is inside an data center.
Do we have anything close to "availability sets" in GCP and AWS. Please share your thoughts.

Regarding GCP, there are several solutions for high-availability. In general it is recommended to Design Robust Systems prone to failures and Building scalable and resilient applications.
By designing robust systems you are insuring that your VMs are available in case of single instance failure, reboot of the instance or if there is an issue with the zone.
What looks most similar to Availability Sets is Managed Instance Groups.
The managed instance group auto-updater allows you to deploy new versions of software to instances in your MIG, supporting different rollout scenarios (rolling updates, canary updates). You can control the speed and scope of deployment as well as the level of disruption to your service.
Also you can use Regional Persistent Disk that replicates data across zones (datacenters).

It sounds like Placement Groups may be an equivalent feature in AWS. There are a few different configurations where you can ask AWS to cluster your instances very close to maximize network I/O performance or spread your instances across hardware to reduce correlated failures.
Cluster – packs instances close together inside an Availability Zone. This strategy enables workloads to achieve the low-latency network performance necessary for tightly-coupled node-to-node communication that is typical of HPC applications.
Partition – spreads your instances across logical partitions such that groups of instances in one partition do not share the underlying hardware with groups of instances in different partitions. This strategy is typically used by large distributed and replicated workloads, such as Hadoop, Cassandra, and Kafka.
Spread – strictly places a small group of instances across distinct underlying hardware to reduce correlated failures.
I can't speak for Google Cloud as I am not aware of a similar feature but I am also not nearly as familiar with their offerings.
Hope that helps.

Flask application scaling on Kubernetes and Gunicorn

We have a Flask application that is served via gunicorn, using the eventlet worker. We're deploying the application in a kubernetes pod, with the idea of scaling the number of pods depending on workload.
The recommended settings for the number of workers in gunicorn is 2 - 4 x $NUM_CPUS. See docs. I've previously deployed services on dedicated physical hardware where such calculations made sense. On a 4 core machine, having 16 workers sounds OK and we eventually bumped it to 32 workers.
Does this calculation still apply in a kubernetes pod using an async worker particularly as:
There could be multiple pods on a single node.
The same service will be run in multiple pods.
How should I set the number of gunicorn workers?
Set it to -w 1 and let kubernetes handle the scaling via pods?
Set it to 2-4 x $NUM_CPU on the kubernetes nodes. On one pod or multiple?
Something else entirely?
Update
We decided to go with the 1st option, which is our current approach. Set the number of gunicorn works to 1, and scale horizontally by increasing the number of pods. Otherwise there will be too many moving parts plus we won't be leveraging Kubernetes to its full potential.

For better visibility of the final solution chosen by original author of this question as of 2019 year
Set the number of gunicorn works to 1 (-w 1), and scale horizontally
by increasing the number of pods (using Kubernetes HPA).
and the fact it might be not applicable in the close future, taking into account fast growth of workload related features in Kubernetes platform, e.g. some distributions of Kubernetes propose beside HPA, Vertical Pod Autoscaling (VPA) and Multidimensional Pod autoscaling (MPA) too, so I propose to continue this thread in form of community wiki post.

I'am not developer and it seems not simple task, but for your considerations please follow bests practices for Better performance by optimizing Gunicorn config.
In addition in kubernetes there are different mechanisms in order to scale your deployment like HPA due to CPU utilization and (How is Python scaling with Gunicorn and Kubernetes?)
You can use also Resource requests and limits of Pod and Container.
As per Gunicorn documentation
DO NOT scale the number of workers to the number of clients you expect to have. Gunicorn should only need 4-12 worker processes to handle hundreds or thousands of requests per second.
Gunicorn relies on the operating system to provide all of the load balancing when handling requests. Generally we recommend (2 x $num_cores) + 1 as the number of workers to start off with. While not overly scientific, the formula is based on the assumption that for a given core, one worker will be reading or writing from the socket while the other worker is processing a request.
# update:
Depending on your approach you can choose different solution (deployment, daemonset) all above statements you can achieve in kubernetes by handling according Assigning CPU Resources to Containers and Pods
Using deployment with resources (limits,requests) give you possibility to resize your app into multiple pods on a single node based on your hardware limits but depending on your "app load" it can not be good enough solution.
CPU requests and limits are associated with Containers, but it is useful to think of a Pod as having a CPU request and limit. The CPU request for a Pod is the sum of the CPU requests for all the Containers in the Pod. Likewise, the CPU limit for a Pod is the sum of the CPU limits for all the Containers in the Pod.
Note:
The CPU resource is measured in CPU units. One CPU, in Kubernetes, is equivalent to:
f.e. 1 GCP Core.
As mentioned in the post the second approach (scaling your app into multiple nodes) it's also good choice. In this case you can cosnider using f.e. Statefulset or deployment in addition on GKE using "cluster austoscaler" you can achieve more extendable solution when you try to create new pods that don't have enough capacity to run inside the cluster. In this case cluster autoscaler automatically add additional resources.
On the other hand you can consider using different other solutions like Cerebral it gives you the possibility to create user-defined policies in order to increasing or decreasing the size of pools of nodes inside your cluster.
GKE's cluster autoscaler automatically resizes clusters based on the demands of the workloads you want to run. With autoscaling enabled, GKE automatically adds a new node to your cluster if you've created new Pods that don't have enough capacity to run; conversely, if a node in your cluster is underutilized and its Pods can be run on other nodes, GKE can delete the node.
Please keep in mind that the question is very general and there is no one good answer for this topic. You should consider all prons and cons based on your requirements, load, activity, capacity, costs ...
Hope this help.

Confusion about instances used inside a Amazon Ec2 Container Service

When a Ec2 Container Engine cluster is created, it creates a Compute Engine managed instance group to manage the created instances. These instances are from Ec2 service, which means, they are Virtual machines.
But we know that containers represent a new way to deploy containers based on operating-system-level virtualization rather than hardware virtualization
like VMs that are heavyweight and non-portable, isn't a contradiction? correct me if I'm wrong.
We use containers because they are extremely fast (either in boot time or tasks execution) compared to VMs, and they save a lot of space storage. So if we have one node(vm) that can supports 4 containers max, our clients can rapidly lunch 4 containers, but beyond this number, Ec2 autoscaler will need to lunch a new node(vm) to support upcoming containers, which incurs some tasks delay.
Is it impossible to launch containers over physical machines?
And what do you recommend for running critical time execution tasks?

I believe you are working under an erroneous assumption that ECS scales the virtual machines ("container instances" -- the instances where containers will run) directly with task demand.
If that were true, you would have a point, because the cluster would be sluggish and unresponsive any time insufficient container instance resources were not immediately available.
ECS doesn't do that, the presence of the Auto Scaling Group notwithstanding.
Depending on the Amazon EC2 instance types you use in your clusters, and quantity of container instances you have in a cluster, your tasks have a limited amount of resources that they can use when they are run. ECS monitors the resources available in the cluster to work with the schedulers to place tasks. If your cluster runs low on any of these resources, such as memory, you will eventually be unable to launch more tasks until you add more container instances, reduce the number of desired tasks in a service, or stop some of the running tasks in your cluster to free up the constrained resource. (emphasis added)
http://docs.aws.amazon.com/AmazonECS/latest/developerguide/cloudwatch_alarm_autoscaling.html
So, no... it doesn't launch the new tasks slowly when you are out of capacity. It doesn't launch them at all.
But don't get ahead of me.
The link above explains, with examples, how scaling of the virtual machines (container instances) is designed to actually work.
Of course, you don't have to make them adaptively scalable at all. You can go with your physical server model (note: I say physical server model -- meaning a fixed, inelastic pool of resources, on always-running virtual machines, since virtual machines is what EC2 provides), and just choose how many instances you wait to have running at all times, essentially emulating physical servers. If you wanted, say, 8 container instances, the "auto scaling group" would maintain exactly 8 at all times, creating replacements if, say, one of them experienced a hardware failure. That "auto" accomplishment would be maintaining the status quo. And, of course, in this configuration, you could manually reconfigure from 8 to, say, 12 and the "auto" accomplishment would be that you'd automatically get 4 new ones to add to the existing 8.
But the idea of how the service is ideally used is that your group of virtual machines scales up and down by rules you devise, to anticipate the resources needed by future tasks -- or a future lack of tasks.
In the example given, memory reservation is the trigger:
When the memory reservation of your cluster rises above 75% (meaning that only 25% of the memory in your cluster is available to for new tasks to reserve), the alarm triggers the Auto Scaling group to add another instance and provide more resources for your tasks and services.
It triggers the addition of more container instances so that you always have whatever you have determined to be the appropriate threshold of surplus capacity already online by the time you need it.
Of course, memory is just one resource, and 75% is just an arbitrary threshold chosen for the example.
Auto Scaling Groups can scale on a variety of triggers -- the phrase of the moon, the price trends in the stock market, whatever is appropriate to anticipating your desired amount of surplus capacity and can be quantified and monitored can be used... but this service does not scale itself directly by the actual attempt to launch a new task when the task can't be launched due to insufficient resources.
Herein lies the flaw in your original argument.
Why virtual machines? Simply enough, because when you destroy a virtual machine because the capacity is not expected to be needed, you stop paying for it.
In this light, perhaps you'll agree that this is not a weakness, it's a strength. Physical servers never stop costing you when you are not using them.
You don't need to pay anything at all for capacity you will not be needing with VMs -- you only have to pay for the capacity you're using plus the amount you need to keep immediately available to handle anticipated demand.
You can have as much idle surplus immediately ready as you are willing to pay for, or you can maximize savings by allowing as little surplus capacity as you are comfortable with being able to access without delay.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js