Security patches for Kubernetes Nodes

Security patches for Kubernetes Nodes - amazon-web-services

I have access to a kops-built kubernetes cluster on AWS EC2 instances. I would like to make sure, that all available security patches from the corresponding package manager are applied. Unfortunately searching the whole internet for hours I am unable to find any clue on how this should be done. Taking a look into the user data of the launch configurations I did not find a line for the package manager - Therefor I am not sure if a simple node restart will do the trick and I also want to make sure that new nodes come up with current packages.
How to make security patches on upcoming nodes of a kubernetes cluster and how to make sure that all nodes are and stay up-to-date?

You might want to explore https://github.com/weaveworks/kured
Kured (KUbernetes REboot Daemon) is a Kubernetes daemonset that performs safe automatic node reboots when the need to do so is indicated by the package management system of the underlying OS.
Watches for the presence of a reboot sentinel e.g. /var/run/reboot-required
Utilises a lock in the API server to ensure only one node reboots at a time
Optionally defers reboots in the presence of active Prometheus alerts or selected pods
Cordons & drains worker nodes before reboot, uncordoning them after

Related

Cluster nodes only used by internal pods

We are using GKE to host our apps with Anthos, our default node pool ils set to autoscale but I noticed that out of 5 running pods, only 2 are hosting our actual services.
All the others are running internal services like this:
The issue with that is that there's not enough room for running our own services. I guess these are vital for the cluster otherwise the cluster would autoscale and the nodes would get removed.
What would be the best approach to solve this issue? I thought of upgrading the nodes machine type to allow more resources per node and have more room within them and thus have less running nodes, but I wanted to make sure I was not simply missing something on how GKE works.
I've been now digging for quite some time but it seems that would be my only option.

GKE itself requires several add-on resources which are deployed as part of your cluster. You can fine tune the resource usage of some of the GKE add-ons for smaller clusters. Additionally, Anthos each Anthos capability you enable typically deploys a set of controllers as well. GKE and Anthos try to minimize the compute resources used by these services / controllers, but you do need to account for them when calculating the right size(s) for your nodes. A good rule of thumb is to assume that system services/controllers will use ~1 vCPU when using GKE/Anthos (it's typically lower than that, but it makes things easier). So if your workloads all request >=1 vCPU, you'll likely need to use nodes that have a minimum of 4 vCPUs. You'll also want to enable the cluster autoscaler for your node pools if you don't want to pre-provision everything.
A better option would be to use node auto-provisioning as in this case you don't need to create/manage your own node pools as GKE will automatically add/remove nodes / node pools based on the resources requested by your deployments.

How to resize K8s cluster with kops, cluster-autoscaler to dynamically increase Masters

We have configured Kubernetes cluster on EC2 machines in our AWS account using kops tool (https://github.com/kubernetes/kops) and based on AWS posts (https://aws.amazon.com/blogs/compute/kubernetes-clusters-aws-kops/) as well as other resources.
We want to setup a K8s cluster of master and slaves such that:
It will automatically resize (both masters as well as nodes/slaves) based on system load.
Runs in Multi-AZ mode i.e. at least one master and one slave in every AZ (availability zone) in the same region for e.g. us-east-1a, us-east-1b, us-east-1c and so on.
We tried to configure the cluster in the following ways to achieve the above.
Created K8s cluster on AWS EC2 machines using kops this below configuration: node count=3, master count=3, zones=us-east-1c, us-east-1b, us-east-1a. We observed that a K8s cluster was created with 3 Master & 3 Slave Nodes. Each of the master and slave server was in each of the 3 AZ’s.
Then we tried to resize the Nodes/slaves in the cluster using (https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-run-on-master.yaml). We set the node_asg_min to 3 and node_asg_max to 5. When we increased the workload on the slaves such that auto scale policy was triggered, we saw that additional (after the default 3 created during setup) slave nodes were spawned, and they did join the cluster in various AZ’s. This worked as expected. There is no question here.
We also wanted to set up the cluster such that the number of masters increases based on system load. Is there some way to achieve this? We tried a couple of approaches and results are shared below:
A) We were not sure if the cluster-auto scaler helps here, but nevertheless tried to resize the Masters in the cluster using (https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-run-on-master.yaml). This is useful while creating a new cluster but was not useful to resize the number of masters in an existing cluster. We did not find a parameter to specify node_asg_min, node_asg_max for Master the way it is present for slave Nodes. Is there some way to achieve this?
B) We increased the count MIN from 1 to 3 in ASG (auto-scaling group), associated with one the three IG (instance group) for each master. We found that new instances were created. However, they did not join the master cluster. Is there some way to achieve this?
Could you please point us to steps, resources on how to do this correctly so that we could configure the number of masters to automatically resize based on system load and is in Multi-AZ mode?
Kind regards,
Shashi

There is no need to scale Master nodes.
Master components provide the cluster’s control plane. Master components make global decisions about the cluster (for example, scheduling), and detecting and responding to cluster events (starting up a new pod when a replication controller’s ‘replicas’ field is unsatisfied).
Master components can be run on any machine in the cluster. However, for simplicity, set up scripts typically start all master components on the same machine, and do not run user containers on this machine. See Building High-Availability Clusters for an example multi-master-VM setup.
Master node consists of the following components:
kube-apiserver
Component on the master that exposes the Kubernetes API. It is the front-end for the Kubernetes control plane.
etcd
Consistent and highly-available key value store used as Kubernetes’ backing store for all cluster data.
kube-scheduler
Component on the master that watches newly created pods that have no node assigned, and selects a node for them to run on.
kube-controller-manager
Component on the master that runs controllers.
cloud-controller-manager
runs controllers that interact with the underlying cloud providers. The cloud-controller-manager binary is an alpha feature introduced in Kubernetes release 1.6.
For more detailed explanation please read the Kubernetes Components docs.
Also if You are thinking about HA, you can read about Creating Highly Available Clusters with kubeadm

I think your assumption is that similar to kubernetes nodes, masters devide the work between eachother. That is not the case, because the main tasks of masters is to have consensus between each other. This is done with etcd which is a distributed key value store. The problem maintaining such a store is easy for 1 machine but gets harder the more machines you add.
The advantage of adding masters is being able to survive more master failures at the cost of having to make all masters fatter (more CPU/RAM....) so that they perform well enough.

How Does Container Optimized OS Handle Security Updates?

If there is a security patch for Google's Container Optimized OS itself, how does the update get applied?
Google's information on the subject is vague
https://cloud.google.com/container-optimized-os/docs/concepts/security#automatic_updates
Google claims the updates are automatic, but how?
Do I have to set a config option to update automatically?
Does the node need to have access to the internet, where is the update coming from? Or is Google Cloud smart enough to let Container Optimized OS update itself when it is in a private VPC?

Do I have to set a config option to update automatically?
The automatic update behavior for Compute Engine (GCE) Container-Optimized OS (COS) VMs (i.e. those instances you created directly from GCE) are controlled via the "cos-update-strategy" GCE metadata. See the documentation at here.
The current documented default behavior is: "If not set all updates from the current channel are automatically downloaded and installed."
The download will happen in the background, and the update will take effect when the VM reboots.
Does the node need to have access to the internet, where is the update coming from? Or is Google Cloud smart enough to let Container Optimized OS update itself when it is in a private VPC?
Yes, the VM needs to access to the internet. If you disabled all egress network traffic, COS VMs won't be able to update itself.

When operated as part of Kubernetes Engine, the auto-upgrade functionality of Container Optimized OS (cos) is disabled. Updates to cos are applied by upgrading the image version of the nodes using the GKE upgrade functionality – upgrade the master, followed by the node pool, or use the GKE auto-upgrade features.
The guidance on upgrading a Kubernetes Engine cluster describes the upgrade process used for manual and automatic upgrades: https://cloud.google.com/kubernetes-engine/docs/how-to/upgrading-a-cluster.
In summary, the following process is followed:
Nodes have scheduling disabled (so they will not be considered for scheduling new pods admitted to the cluster).
Pods assigned to the node under upgrade are drained. They may be recreated elsewhere if attached to a replication controller or equivalent manager which reschedules a replacement, and there is cluster capacity to schedule the replacement on another node.
The node's Computer Engine instance is upgraded with the new cos image, using the same name.
The node is started, re-added to the cluster, and scheduling is re-enabled. (Besides some conditions, most pods will not automatically move back.)
This process is repeated for subsequent nodes in the cluster.
When you run an upgrade, Kubernetes Engine stops scheduling, drains, and deletes all of the cluster's nodes and their Pods one at a time. Replacement nodes are recreated with the same name as their predecessors. Each node must be recreated successfully for the upgrade to complete. When the new nodes register with the master, Kubernetes Engine marks the nodes as schedulable.

What is the exact behavior of the is_leader flag on Amazon Beanstalk?

What is the exact behavior of the 'is_leader' flag on an Amazon Beanstalk deployment? I could not find any exhaustive documentation on it. More specifically:
What is the value on a single instance environment?
Does Amazon properly reset the is_leader value when nodes are added/removed from an environment, either manually or via auto-scaling?
If that flag is automatically reset by Amazon for a node, does Amazon take care of restarting the instance to make sure that flag is taken into account by the application?

is_leader is a tag, applied by AWS deployment process to the first created instance. As you mentioned, the documentation on is_leader is very scarce, here's what I was able to find:
The idea of a leader only exists during the execution of a deployment
in an environment update. After deployment has executed, there isn't a
concept of a leader anymore, though you could determine which instance
had been the leader if needed for debugging purposes.
The answers to your questions:
What is the value on a single instance environment?
is_leader is not applicable to single instance environment, so the tag is not set.
Does Amazon properly reset the is_leader value when nodes are
added/removed from an environment, either manually or via
auto-scaling?
Leader node is not immune from being removed from an environment. If it's removed, there's no "leader" re-assignment. There are ways to prevent it from being shut down by AutoScaling: Configure Instance Termination Policy for Your Auto Scaling Group
If that flag is automatically reset by Amazon for a node, does Amazon
take care of restarting the instance to make sure that flag is taken
into account by the application?
The flag is not reset. Once the leader node is gone from the environment, the tag will only reappear on the rebuild.
Sources:
Leaderless Beanstalk instances (or detecting the leader)
AWS Elastic Beanstalk, running a cronjob

Best way to manage code changes for application in Amazon EC2 with Auto Scaling

I have multiple instances running behind Load balancer with Auto Scaling in AWS.
Now, if I have to push some code changes to these instances and any new instances that might launch because of auto scaling policy, what's the best way to do this?
The only way I am aware of is, to create a new AMI with latest code, modify the auto scaling policy to use this new AMI and then terminate the existing instances. But this might involve a longer downtime and I am not sure whether the whole process can be automated.
Any pointers in this direction will be highly appreciated.

The way I do my code changes is to have a master server which I edit on the code on. All the slave servers which scale then rsync via ssh by a cron job to bring all the files up to date. All the servers sync every 30 minutes +- a few random seconds to keep from accessing it at the exact same second. (note I leave the Master off of the load balancer so users always have the same code being sent to them. Similarly, when I decide to publish my code changes, I do an rsync from my test server to my master server.
Using this approach, you merely have to put the sync command in the start-up and you don't have to worry about what the code state was on the slave image as it will be up to date after it boots.
EDIT:
We have stopped using this method now and started using the new service AWS CodeDeploy which is made for this exact purpose:
http://aws.amazon.com/codedeploy/
Hope this helps.

We configure our Launch Configuration to use a "clean" off-the-shelf AMI - we use these: http://aws.amazon.com/amazon-linux-ami/
One of the features of these AMIs is CloudInit - https://help.ubuntu.com/community/CloudInit
This feature enables us to deliver to the newly spawned plain vanilla EC2 instance some data. Specifically, we give the instance a script to run.
The script (in a nutshell) does the following:
Upgrades itself (to make sure all security patches and bug fixes are applied).
Installs Git and Puppet.
Clones a Git repo from Github.
Applies a puppet script (which is part of the repo) to configure itself. Puppet installs the rest of the needed software modules.
It does take longer than booting from a pre-configured AMI, but we skip the process of actually making these AMIs every time we update the software (a couple of times a week) and the servers are always "clean" - no manual patches, all software modules are up to date etc.
Now, to upgrade the software, we use a local boto script.
The script kills the servers running the old code one by one. The Auto Scaling mechanism launches new (and upgraded) servers.
Make sure to use as-terminate-instance-in-auto-scaling-group because using ec2-terminate-instance will cause the ELB to continue to send traffic to the shutting-down instance, until it fails the health check.
Interesting related blog post: http://blog.codento.com/2012/02/hello-ec2-part-1-bootstrapping-instances-with-cloud-init-git-and-puppet/

It appears you can manually double auto scaling group size, it will create EC2 instances using AMI from current Launch Configuration. Now if you decrease auto scaling group back to the previous size, old instances will be killed and only instances created from a new AMI will survive.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js