Google cloud kubernetes switching off a node

Google cloud kubernetes switching off a node - google-cloud-platform

I'm currently testing out google cloud for a home project. I only require the node to run between a certain time slot. When I switch the node off it automatically switches itself on again. Not sure if I am missing something as I did not enabling autoscaling and it's also a General Purpose e2-small instance

When I switch the node off it automatically switches itself on again.
Not sure if I am missing something as I did not enabling autoscaling
and it's also a General Purpose e2-small instances
Kubernetes nodes are managed by the Node pool. Which you might created during your cluster creation of GKE if you are using it.
Node pool manages the number of the available node counts. there could be chances new nodes is getting created again or existing node starting back.
If you are on GKE and want to scale down to zero you can reduce number of node count in Node pool from GKE console.
Check your node pool : https://cloud.google.com/kubernetes-engine/docs/how-to/node-pools#console_1
Resize your node pool from here : https://cloud.google.com/kubernetes-engine/docs/how-to/node-pools#resizing_a_node_pool

Related

Stop kubernetes cluster on Autopilot mode

I have a kubernetes cluster set up and I want to stop it so it doesn't generate additional costs, but keep my deployments and configurations saved so that it will work when I start it again. I tried disabling autoscaling and resizing the node pool, but I get the error INVALID_ARGUMENT: Autopilot clusters do not support mutating node pools.

With GKE (autopilot or not) you pay 2 things
The control plane, fully managed by Google
The workers: Node pools for GKE, the running pods on GKE Autopilot.
In both case, you can't stop the control plane, you don't manage it. The only solution is to delete the cluster.
In both case, you can scale your pods/node pools to 0 and therefore remove the worker cost.
That being said, in your case, you have no other solution than deleting your Autopilot control plane, and to save your configuration in config file (the yaml files). Next time you want to start your autopilot cluster, create a new one, load your config, and that's all.
For persistent data, you have to save them outside (on GCS for instance) and to reload them also. The boring part.
Note: you have 1 cluster free per billing account

Cluster nodes only used by internal pods

We are using GKE to host our apps with Anthos, our default node pool ils set to autoscale but I noticed that out of 5 running pods, only 2 are hosting our actual services.
All the others are running internal services like this:
The issue with that is that there's not enough room for running our own services. I guess these are vital for the cluster otherwise the cluster would autoscale and the nodes would get removed.
What would be the best approach to solve this issue? I thought of upgrading the nodes machine type to allow more resources per node and have more room within them and thus have less running nodes, but I wanted to make sure I was not simply missing something on how GKE works.
I've been now digging for quite some time but it seems that would be my only option.

GKE itself requires several add-on resources which are deployed as part of your cluster. You can fine tune the resource usage of some of the GKE add-ons for smaller clusters. Additionally, Anthos each Anthos capability you enable typically deploys a set of controllers as well. GKE and Anthos try to minimize the compute resources used by these services / controllers, but you do need to account for them when calculating the right size(s) for your nodes. A good rule of thumb is to assume that system services/controllers will use ~1 vCPU when using GKE/Anthos (it's typically lower than that, but it makes things easier). So if your workloads all request >=1 vCPU, you'll likely need to use nodes that have a minimum of 4 vCPUs. You'll also want to enable the cluster autoscaler for your node pools if you don't want to pre-provision everything.
A better option would be to use node auto-provisioning as in this case you don't need to create/manage your own node pools as GKE will automatically add/remove nodes / node pools based on the resources requested by your deployments.

How to disable node auto-repair

How do I disable GKE cluster nodes maintenance auto-repair using terraform? It seems I can't stop the nodes or change the settings of the GKE nodes from GCP console. So I guess I'll have to try it using terraform even if its recreates the cluster.
How does the maintenance happen? I think it migrates all the pods to the secondary node and then restarts the first node correct? But what if there isn't enough resources available for the secondary node to handle all the nodes from primary node? Will GCP create new node? For example: Primary node has around 110 pods and secondary node has 110 pods. How the maintenance happen if the nodes needs to be restarted?

You can disable node auto-repair by running the following command in the GCP shell:
gcloud container node-pools update <pool-name> --cluster <cluster-name> \
--zone compute-zone \
--no-enable-autorepair
You will find how to do it using the GCP console in this link as well.
If you are still facing issues and want to disable node auto-repair using terraform you have to specify in the argument management if you want to enable auto-repair. You can find further details in the terraform's documentation.
Here you can also find how the node repair process works:
If GKE detects that a node requires repair, the node is drained and re-created. GKE waits one hour for the drain to complete. If the drain doesn't complete, the node is shut down and a new node is created.
If multiple nodes require repair, GKE might repair nodes in parallel. GKE balances the number of repairs depending on the size of the cluster and the number of broken nodes. GKE will repair more nodes in parallel on a larger cluster, but fewer nodes as the number of unhealthy nodes grows.
If you disable node auto-repair at any time during the repair process, in- progress repairs are not cancelled and continue for any node currently under repair.

rancher stuck Waiting to register with Kubernetes

I use rancher to create an EC2 cluster on aws, and I get stuck in "Waiting to register with Kubernetes" every time, as shown in the figure below.
You can see the error message "Cluster must have at least one etcd plane host: failed to connect to the following etcd host(s)" on the Nodes page of Rancher UI. Does anyone know how to solve it?
This is the screenshot with the error

Follow installation guide carefully.
When you are ready to create cluster you have to add node with etcd role.
Each node role (i.e. etcd, Control Plane, and Worker) should be
assigned to a distinct node pool. Although it is possible to assign
multiple node roles to a node pool, this should not be done for
production clusters.
The recommended setup is to have a node pool with the etcd node role
and a count of three, a node pool with the Control Plane node role and
a count of at least two, and a node pool with the Worker node role and
a count of at least two.
Only after that Rancher is seting up cluster.

You can check the exact error (either dns or certificate) by logging into the host nodes and seeing the logs of the container (docker logs).
download the keys and try to ssh to the nodes to see more concrete error messages.

How to get the number of nodes in a GKE node pool as a stackdriver metric?

I'm making a dashboard in GCP stackdriver. We have a autoscaling node pool which the pods I'm interested in monitoring run in. What I'm wondering is how do I monitor the amount of nodes the are currently running in the pool?
I've had a look at log based metrics, but I can't find anywhere in the logs where it actually says how many nodes are currently running.

There's no metric for number of nodes per se. We can get something similar though with the sum of Total cores grouped by cluster_name:

Monitoring the Instance Group/instance_group_size metric will show the exact number of vms/nodes in your node pool. You can filter on the instance group's name associated to your node pool:

in metric explorer use instance group as resource type and instance group size as metric.Then filter out instance group that is configured as node pool for GKE.

In stack driver if you are using the new stack driver it will show the all the running node in an infrastructure part inside kubernetes engine menu.
If you will click on the node it will show all the containing namespaces of further you will click it will show all the running pods also.
for more deatils, you can check this out : https://medium.com/google-cloud/new-stackdriver-monitoring-for-kubernetes-part-1-a296fa164694

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Google cloud kubernetes switching off a node - google-cloud-platform

Related

Stop kubernetes cluster on Autopilot mode

Cluster nodes only used by internal pods

How to disable node auto-repair

rancher stuck Waiting to register with Kubernetes

How to get the number of nodes in a GKE node pool as a stackdriver metric?

Categories

Resources