How Does Container Optimized OS Handle Security Updates? - google-cloud-platform

If there is a security patch for Google's Container Optimized OS itself, how does the update get applied?
Google's information on the subject is vague
https://cloud.google.com/container-optimized-os/docs/concepts/security#automatic_updates
Google claims the updates are automatic, but how?
Do I have to set a config option to update automatically?
Does the node need to have access to the internet, where is the update coming from? Or is Google Cloud smart enough to let Container Optimized OS update itself when it is in a private VPC?

Do I have to set a config option to update automatically?
The automatic update behavior for Compute Engine (GCE) Container-Optimized OS (COS) VMs (i.e. those instances you created directly from GCE) are controlled via the "cos-update-strategy" GCE metadata. See the documentation at here.
The current documented default behavior is: "If not set all updates from the current channel are automatically downloaded and installed."
The download will happen in the background, and the update will take effect when the VM reboots.
Does the node need to have access to the internet, where is the update coming from? Or is Google Cloud smart enough to let Container Optimized OS update itself when it is in a private VPC?
Yes, the VM needs to access to the internet. If you disabled all egress network traffic, COS VMs won't be able to update itself.

When operated as part of Kubernetes Engine, the auto-upgrade functionality of Container Optimized OS (cos) is disabled. Updates to cos are applied by upgrading the image version of the nodes using the GKE upgrade functionality – upgrade the master, followed by the node pool, or use the GKE auto-upgrade features.
The guidance on upgrading a Kubernetes Engine cluster describes the upgrade process used for manual and automatic upgrades: https://cloud.google.com/kubernetes-engine/docs/how-to/upgrading-a-cluster.
In summary, the following process is followed:
Nodes have scheduling disabled (so they will not be considered for scheduling new pods admitted to the cluster).
Pods assigned to the node under upgrade are drained. They may be recreated elsewhere if attached to a replication controller or equivalent manager which reschedules a replacement, and there is cluster capacity to schedule the replacement on another node.
The node's Computer Engine instance is upgraded with the new cos image, using the same name.
The node is started, re-added to the cluster, and scheduling is re-enabled. (Besides some conditions, most pods will not automatically move back.)
This process is repeated for subsequent nodes in the cluster.
When you run an upgrade, Kubernetes Engine stops scheduling, drains, and deletes all of the cluster's nodes and their Pods one at a time. Replacement nodes are recreated with the same name as their predecessors. Each node must be recreated successfully for the upgrade to complete. When the new nodes register with the master, Kubernetes Engine marks the nodes as schedulable.

Related

How do I disable auto upgrades for gke masters (not just node pools)

A little confused about this. Reading google docs say auto upgrade is enabled for masters and node pools. But I can only find how to disable auto upgrade on node pools.
How do I disable auto upgrade on masters too?
Google documentation states:
Note: Cluster control planes are always upgraded on a regular basis, regardless of whether your cluster is enrolled in a release channel or not.
So you can disable or enable auto-update for node pools but masters (control plane) has auto-updates always enabled.
Another quote from here:
Regardless of whether your cluster is enrolled in a release channel or not, cluster control planes are always upgraded on a regular basis.
The docs do a poor job of explaining this.
To disable auto-node upgrades you disable auto upgrades on the node pool. To disable auto master upgrades you set the release channel to not specified.
Using Terraform it's configured here:
https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_node_pool#auto_upgrade
https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_cluster#release_channel
The GKE team performs automatic upgrades of your cluster control plane on your behalf and this is managed by Google in order to ensure security patches and functionality. As per document you can manually initiate a control plane upgrade to a version newer than the default.
Also, you can disable node auto-upgrade for an existing node pool. Keep in mind if you do so, you have to make sure that the cluster's nodes run a version compatible the cluster's version.
gcloud container node-pools update node-pool-name --cluster cluster-name \
--zone compute-zone --no-enable-autoupgrade
Also, this article explains further how automatic and manual upgrades work on Google Kubernetes Engine (GKE) clusters.
There is no way to stop GKE Control Plane auto upgrade, but you can stop control plan auto-upgrade for some period by using gke maintenance exclusion window
Doc : https://cloud.google.com/kubernetes-engine/docs/concepts/maintenance-windows-and-exclusions
No, you can't disable mater auto-upgrade even if you set the release channel to not specified(static version). see this https://cloud.google.com/kubernetes-engine/docs/release-schedule#schedule-for-static-no-channel
Note: The control planes of the clusters on static versions are upgraded automatically on or after the dates specified in the Auto Upgrade column of the Stable release channel schedule. When choosing a version older than the default version, you can use maintenance exclusions to prevent a cluster from being automatically upgraded until its end of life date.

Security patches for Kubernetes Nodes

I have access to a kops-built kubernetes cluster on AWS EC2 instances. I would like to make sure, that all available security patches from the corresponding package manager are applied. Unfortunately searching the whole internet for hours I am unable to find any clue on how this should be done. Taking a look into the user data of the launch configurations I did not find a line for the package manager - Therefor I am not sure if a simple node restart will do the trick and I also want to make sure that new nodes come up with current packages.
How to make security patches on upcoming nodes of a kubernetes cluster and how to make sure that all nodes are and stay up-to-date?
You might want to explore https://github.com/weaveworks/kured
Kured (KUbernetes REboot Daemon) is a Kubernetes daemonset that performs safe automatic node reboots when the need to do so is indicated by the package management system of the underlying OS.
Watches for the presence of a reboot sentinel e.g. /var/run/reboot-required
Utilises a lock in the API server to ensure only one node reboots at a time
Optionally defers reboots in the presence of active Prometheus alerts or selected pods
Cordons & drains worker nodes before reboot, uncordoning them after

How to resize K8s cluster with kops, cluster-autoscaler to dynamically increase Masters

We have configured Kubernetes cluster on EC2 machines in our AWS account using kops tool (https://github.com/kubernetes/kops) and based on AWS posts (https://aws.amazon.com/blogs/compute/kubernetes-clusters-aws-kops/) as well as other resources.
We want to setup a K8s cluster of master and slaves such that:
It will automatically resize (both masters as well as nodes/slaves) based on system load.
Runs in Multi-AZ mode i.e. at least one master and one slave in every AZ (availability zone) in the same region for e.g. us-east-1a, us-east-1b, us-east-1c and so on.
We tried to configure the cluster in the following ways to achieve the above.
Created K8s cluster on AWS EC2 machines using kops this below configuration: node count=3, master count=3, zones=us-east-1c, us-east-1b, us-east-1a. We observed that a K8s cluster was created with 3 Master & 3 Slave Nodes. Each of the master and slave server was in each of the 3 AZ’s.
Then we tried to resize the Nodes/slaves in the cluster using (https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-run-on-master.yaml). We set the node_asg_min to 3 and node_asg_max to 5. When we increased the workload on the slaves such that auto scale policy was triggered, we saw that additional (after the default 3 created during setup) slave nodes were spawned, and they did join the cluster in various AZ’s. This worked as expected. There is no question here.
We also wanted to set up the cluster such that the number of masters increases based on system load. Is there some way to achieve this? We tried a couple of approaches and results are shared below:
A) We were not sure if the cluster-auto scaler helps here, but nevertheless tried to resize the Masters in the cluster using (https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/examples/cluster-autoscaler-run-on-master.yaml). This is useful while creating a new cluster but was not useful to resize the number of masters in an existing cluster. We did not find a parameter to specify node_asg_min, node_asg_max for Master the way it is present for slave Nodes. Is there some way to achieve this?
B) We increased the count MIN from 1 to 3 in ASG (auto-scaling group), associated with one the three IG (instance group) for each master. We found that new instances were created. However, they did not join the master cluster. Is there some way to achieve this?
Could you please point us to steps, resources on how to do this correctly so that we could configure the number of masters to automatically resize based on system load and is in Multi-AZ mode?
Kind regards,
Shashi
There is no need to scale Master nodes.
Master components provide the cluster’s control plane. Master components make global decisions about the cluster (for example, scheduling), and detecting and responding to cluster events (starting up a new pod when a replication controller’s ‘replicas’ field is unsatisfied).
Master components can be run on any machine in the cluster. However, for simplicity, set up scripts typically start all master components on the same machine, and do not run user containers on this machine. See Building High-Availability Clusters for an example multi-master-VM setup.
Master node consists of the following components:
kube-apiserver
Component on the master that exposes the Kubernetes API. It is the front-end for the Kubernetes control plane.
etcd
Consistent and highly-available key value store used as Kubernetes’ backing store for all cluster data.
kube-scheduler
Component on the master that watches newly created pods that have no node assigned, and selects a node for them to run on.
kube-controller-manager
Component on the master that runs controllers.
cloud-controller-manager
runs controllers that interact with the underlying cloud providers. The cloud-controller-manager binary is an alpha feature introduced in Kubernetes release 1.6.
For more detailed explanation please read the Kubernetes Components docs.
Also if You are thinking about HA, you can read about Creating Highly Available Clusters with kubeadm
I think your assumption is that similar to kubernetes nodes, masters devide the work between eachother. That is not the case, because the main tasks of masters is to have consensus between each other. This is done with etcd which is a distributed key value store. The problem maintaining such a store is easy for 1 machine but gets harder the more machines you add.
The advantage of adding masters is being able to survive more master failures at the cost of having to make all masters fatter (more CPU/RAM....) so that they perform well enough.

Can I upgrade Elasticache Redis Engine Version without downtime?

I cannot find any information in the AWS documentation that modifying the Redis engine version will or will not cause downtime. It does not explain how the upgrade occurs other than it's performed in the maintenance window.
It is safe to upgrade a production Elasticache Redis instance via the AWS console without loss of data or downtime?
Note: The client library we use is compatible with all versions of Redis so the application should not notice the upgrade.
Changing a cache engine version is a disruptive process which clears
all cache data in the cluster. **
I don't now the requirements of your particular application. But if you can't lose your data and you need to do a major version upgrade, it would be advisable to migrate to a new cluster rather than upgrading the current setup.
** http://docs.aws.amazon.com/AmazonElastiCache/latest/UserGuide/VersionManagement.html
I am not sure if the answers are still relevant given that the question was asked nearly 7 years ago but there's a few things.
Changing the node type or engine version is a Modify action and your data remains intact on your Elasticache clusters. I believe there was a doc that mentioned (if I find it, I will link it here) the process of Elasticache modifications take place.
Essentially, Elasticache launches a new node on the backend with the modifications you've made and copies your data to it. Suppose the modification you make is a change in the engine version from 5.x to 6.x -
Elasticache will launch new Redis nodes on the backend with Engine 6.x.
As the node comes up, Elasticache will read keys from the 5.x node and copy data to 6.x
When the copy is complete, Elasticache will make a change in the DNS records for your cluster's endpoints.
So, there will be some downtime depending on your application's DNS cache TTL config. For example, your application holds the DNS cache for 300 seconds, it can take it 300 seconds to refresh the DNS cache on your application/client-machine and during that time application might show some errors.
From the elasticache side, I do not think they provide any official SLA for this. But this doc[1] mentions it will only take a "few seconds" for this DNS to propogate (depending on engine versions).
Still, you can always take a snapshot of your cluster as a backup. If anything goes south, you can use snapshot to launch a new cluster with the same data.
Also, one more thing - AWS will never upgrade your engine versions by themselves. The maintenance window for your Elasitcache Cluster is for Security patches and small optimizations on the engines. They do not affect the engine versions.
Cheers!
[1] https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/AutoFailover.html
As mentioned by Will above, the AWS answer has changed. and in theory you can do it without downtime. See:
https://docs.aws.amazon.com/AmazonElastiCache/latest/red-ug/VersionManagement.html
The key points are in terms of downtime and impact on existing use:
The Amazon ElastiCache for Redis engine upgrade process is designed
to make a best effort to retain your existing data and requires
successful Redis replication.
...
For single Redis clusters and clusters with Multi-AZ disabled, we
recommend that sufficient memory be made available to Redis as
described in Ensuring That You Have Enough Memory to Create a Redis
Snapshot. In these cases, the primary is unavailable to service
requests during the upgrade process.
...
For Redis clusters with Multi-AZ enabled, we also recommend that you
schedule engine upgrades during periods of low incoming write traffic.
When upgrading to Redis 5.0.5 or above, the primary cluster continues
to be available to service requests during the upgrade process. When
upgrading to Redis 5.0.4 or below, you may notice a brief interruption
of a few seconds associated with the DNS update.
There are no guarantees here so you will have to make your own decision about the risk of losing data if it fails
It depends on your current version:
When upgrading to Redis 5.0.6 or above, the primary cluster continues to be available to service requests during the upgrade process (source).
Starting with Redis engine version 5.0.5, you can upgrade your cluster version with minimal downtime. The cluster is available for reads during the entire upgrade and is available for writes for most of the upgrade duration, except during the failover operation which lasts a few seconds (source); cluster available for reads during engine upgrades, writes are interrupted only for < 1sec with version 5.0.5 (source)
You can also upgrade your ElastiCache clusters with versions earlier than 5.0.5. The process involved is the same but may incur longer failover time during DNS propagation (30s-1m) (source, source).

What is the exact behavior of the is_leader flag on Amazon Beanstalk?

What is the exact behavior of the 'is_leader' flag on an Amazon Beanstalk deployment? I could not find any exhaustive documentation on it. More specifically:
What is the value on a single instance environment?
Does Amazon properly reset the is_leader value when nodes are added/removed from an environment, either manually or via auto-scaling?
If that flag is automatically reset by Amazon for a node, does Amazon take care of restarting the instance to make sure that flag is taken into account by the application?
is_leader is a tag, applied by AWS deployment process to the first created instance. As you mentioned, the documentation on is_leader is very scarce, here's what I was able to find:
The idea of a leader only exists during the execution of a deployment
in an environment update. After deployment has executed, there isn't a
concept of a leader anymore, though you could determine which instance
had been the leader if needed for debugging purposes.
The answers to your questions:
What is the value on a single instance environment?
is_leader is not applicable to single instance environment, so the tag is not set.
Does Amazon properly reset the is_leader value when nodes are
added/removed from an environment, either manually or via
auto-scaling?
Leader node is not immune from being removed from an environment. If it's removed, there's no "leader" re-assignment. There are ways to prevent it from being shut down by AutoScaling: Configure Instance Termination Policy for Your Auto Scaling Group
If that flag is automatically reset by Amazon for a node, does Amazon
take care of restarting the instance to make sure that flag is taken
into account by the application?
The flag is not reset. Once the leader node is gone from the environment, the tag will only reappear on the rebuild.
Sources:
Leaderless Beanstalk instances (or detecting the leader)
AWS Elastic Beanstalk, running a cronjob