Nodes are not joining in aws eks - amazon-web-services

I have launched cluster using aws eks successfully and applied aws-auth but nodes are not joining to cluster. I checked log message of a node and found this -
Dec 4 08:09:02 ip-10-0-8-187 kubelet: E1204 08:09:02.760634 3542 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:461: Failed to list *v1.Node: Unauthorized
Dec 4 08:09:03 ip-10-0-8-187 kubelet: W1204 08:09:03.296102 3542 cni.go:171] Unable to update cni config: No networks found in /etc/cni/net.d
Dec 4 08:09:03 ip-10-0-8-187 kubelet: E1204 08:09:03.296217 3542 kubelet.go:2130] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
Dec 4 08:09:03 ip-10-0-8-187 kubelet: E1204 08:09:03.459361 3542 reflector.go:205] k8s.io/kubernetes/pkg/kubelet/kubelet.go:452: Failed to list *v1.Service: Unauthorized`
I am not sure about this. I have attached eks full access to these instance node roles.

If you are using terraform, or modifying tags and name variables, make sure the cluster name matches in the tags!
Node must be "owned" by a certain cluster. The nodes will only join a cluster they're supposed to. I overlooked this, but there isn't a lot of documentation to go on when using terraform. Make sure variables match. This is the node tag naming parent cluster to join:
tag {
key = "kubernetes.io/cluster/${var.eks_cluster_name}-${terraform.workspace}"
value = "owned"
propagate_at_launch = true
}

if you have followed aws white paper there is easy way to connect the all worker node and join them with EKS cluster.
Link : https://docs.aws.amazon.com/eks/latest/userguide/getting-started.html
as per my thinking you forget to edit config map with instance role profile ARN.

Related

How to ensure the node and volume availability zones (AZs) within an AWS EKS Cluster remain consistent during node group rolling upgrades?

I'm having trouble ensuring my pods re-connect to their PVs after an AWS EKS node group rolling upgrade. The issue is that the node itself moves from AZ us-west-2b to us-west-2c, but the PVs remain in us-west-2b.
The label on the node is topology.kubernetes.io/zone=us-west-2c and the label on the PV remains topology.kubernetes.io/zone=us-west-2b, so the volume affinity check warning shows up on the spinning pods after the upgrade finishes:
0/1 nodes are available: 1 node(s) had volume node affinity conflict.
Per the AWS upgrade docs:
When upgrading the nodes in a managed node group, the upgraded nodes
are launched in the same Availability Zone as those that are being
upgraded.
But that doesn't seem to be the case. Is there a way I can always enforce the creation of nodes into the same AZ they were in prior to the upgrade?
Note: this is a 1-node AWS EKS Cluster (with a max set to 3), though I don't think that should matter.
yes, this is possible, you need to enforce the AZ for the node group when you create it. When using kubectl you can use cli:
eksctl create cluster --name=cluster --zones=eu-central-2a,eu-central-2b --node-zones=eu-central-2a
When using terraform:
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "= 14.0.0"
cluster_version = "1.17"
cluster_name = "cluster-in-one-az"
subnets = ["subnet-a", "subnet-b", "subnet-c"]
worker_groups = [
{
instance_type = "m5.xlarge"
asg_max_size = 5
subnets = ["subnet-a"]
}
]
Where the subnet-a belongs to the desired availability zone where your PV was created.

GKE cluster-autoscaler cannot scale up nodepool based on the nodeaffinity

Prerequisites:
GKE of 1.14* or 1.15* latest stable
labeled node pools, created by Deployment manager
An application, which requires persistence volume in RWO mode
Each deployments of applications is differ, should be run at the same time with others, and in the 1 pod per 1 node state.
Each pod has no replicas, should support rolling updates (by helm).
Design:
Deployment manager template for cluster and node pools,
node pools are labeled, each node has the same label (after initial creating)
each new app deploying into new namespace, what allows to have unique service address,
each new release could be 'new install' or 'update existing', based on the node label (nodes labels could be changed by kubectl during install or update of the app)
Problem:
That is working normally if cluster is created from browser console interface. If cluster was created by GCP deployment, the error is (tested on the nginx template from k8s docs with node affinity, even without drive attached):
Warning FailedScheduling 17s (x2 over 17s) default-scheduler 0/2 nodes are available: 2 node(s) didn't match node selector.
Normal NotTriggerScaleUp 14s cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 2 node(s) didn't match node selector
What is the problem? Deployment manager creates bad labels?
affinity used:
# affinity:
# nodeAffinity:
# requiredDuringSchedulingIgnoredDuringExecution:
# nodeSelectorTerms:
# - matchExpressions:
# - key: node/nodeisbusy
# operator: NotIn
# values:
# - busy
GCP give two ways to control deployments restricting to a node-pool or a set of nodes.
Taints & Tolerations
Node Affinity
Am explaining below the #1 approach - A combination of nodeselector and tolerations to achieve restrictions on deployments along with auto-scaling.
Here is an example:
Let us say a cluster cluster-x is available.
Let us say it contains two node pools
project-a-node-pool - Configured to autoscale from 1 to 2 nodes.
project-b-node-pool - Configured to autoscale from 1 to 3 nodes.
Node Pool Labels
Each of the nodes in project-a-node-pool would contain the label. This is configured by default.
cloud.google.com/gke-nodepool: project-a-node-pool
Each of the nodes in project-b-node-pool would contain the label. This is configured by default.
cloud.google.com/gke-nodepool: project-b-node-pool
Node Pool Taints
Add Taints to each of the node pool. As an example command:
gcloud container node-pools create project-a-node-pool --cluster cluster-x
--node-taints project=a:NoExecute
gcloud container node-pools create project-b-node-pool --cluster cluster-x
--node-taints project=b:NoExecute
Snapshot of Taints configured for project-a-node-pool
Deployment Tolerations
Add to the deployment YAML file, the tolerations matching the taint.
tolerations:
- key: "project"
operator: "Equal"
value: "a" (or "b")
effect: "NoExecute"
Test with deployments
Try to do new deployments and check whether each deployment is happening as per the taint / toleration pair. Deployments with toleration value a should go to project-a-node-pool. Deployments with toleration value b should go to project-b-node-pool.
Once sufficient memory / cpu request is reached in either of the node pool, newer deployments should trigger auto-scale within the node pool.

user data for managed node group

How can I edit user data in the managed node group which is part of my EKS cluster?
I tried to create a new version for the Launch Template that EKS cluster has created but I get an error on "Health issues" of the cluster: Ec2LaunchTemplateVersionMismatch
I want to make the nodes in the managed node group to mount the EFS automatically, without doing it manually on each instance, because of the autoscaling.

How to prevent my EC2 instances from automatically rebooting every time one has stopped?

UPDATED
Following the AWS instance scheduler I've been able to setup a scheduler that starts and stops at the beginning and end of the day.
However, the instances keep being terminated and reinstalled.
I have an Amazon Elastic Kubernetes Service (EKS) that returns the following CloudWatch log:
discovered the following log in my CloudWatch
13:05:30
2019-11-21 - 13:05:30.251 - INFO : Handler SchedulerRequestHandler scheduling request for service(s) rds, account(s) 612681954602, region(s) eu-central-1 at 2019-11-21 13:05:30.251936
13:05:30
2019-11-21 - 13:05:30.433 - INFO : Running RDS scheduler for account 612681954602 in region(s) eu-central-1
13:05:31
2019-11-21 - 13:05:31.128 - INFO : Fetching rds Instances for account 612681954602 in region eu-central-1
13:05:31
2019-11-21 - 13:05:31.553 - INFO : Number of fetched rds Instances is 2, number of schedulable resources is 0
13:05:31
2019-11-21 - 13:05:31.553 - INFO : Scheduler result {'612681954602': {'started': {}, 'stopped': {}}}
I don't know if it is my EKS that keeps rebooting my instances, but I really would love to keep them stopped until the next day.
How can I prevent my EC2 instances from automatically rebooting every time one has stopped? Or, even better, how can I deactivate my EKS stack automatically?
Update:
I discovered that EKS has a Cluster Autoscaler. Maybe this could be where the problem lies?
https://docs.aws.amazon.com/eks/latest/userguide/cluster-autoscaler.html
EKS node group would create an auto scaling group to manage the worker nodes. You need specify the minimum, maximum and desired size of worker nodes. Once any instance is stopped, the auto scaling group would create new instance to match the desired instance size.
Check below doc for details,
https://docs.aws.amazon.com/eks/latest/userguide/launch-workers.html

Spark Cluster on EC2 - "ssh-ready" state pops up for password

I am trying to create a Spark cluster on EC2 with the following command
(I am referring Apache documetnation)
./spark-ec2 --key-pair=spark-cluster --identity-file=/Users/abc/spark-cluster.pem --slaves=3 --region=us-west-1 --zone=us-west-1c --vpc-id=vpc-2e44594 --subnet-id=subnet-18447841 --spark-version=1.6.1 launch spark-cluster
Once I fire above command master and slaves are getting created but once process reaches to 'SSH-ready' state process keeps on waiting for password
below is the Trace. I have referred apache official documentation and many other documents/videos none of the creations asked for the password. not sure whether I am missing something, any pointer to this issue is much appreciated.
Creating security group spark-cluster-master Creating security group
spark-cluster-slaves Searching for existing cluster spark-cluster in
region us-west-1... Spark AMI: ami-1a250d3e Launching instances...
Launched 3 slaves in us-west-1c, regid = r-32249df4 Launched master in
us-west-1c, regid = r-5r426bar Waiting for AWS to propagate instance
metadata...
**
Waiting for cluster to enter 'ssh-ready' state..........Password:
**
Modified the spark-ec2.py script to include the proxy and enabled the AWS Nat to allow the outbound calls