EksCtl : Update node-definitions via cluster config file not working - amazon-web-services

I am using eksctl to create our EKS cluster.
For the first run, it works out good, but if I want to upgrade the cluster-config later in the future, it's not working.
I have a cluster-config file with me, but any changes made to it are not reflect with update/upgrade command.
What am I missing?
Cluster.yaml :
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: supplier-service
region: eu-central-1
vpc:
subnets:
public:
eu-central-1a: {id: subnet-1}
eu-central-1b: {id: subnet-2}
eu-central-1c: {id: subnet-2}
nodeGroups:
- name: ng-1
instanceType: t2.medium
desiredCapacity: 3
ssh:
allow: true
securityGroups:
withShared: true
withLocal: true
attachIDs: ['sg-1', 'sg-2']
iam:
withAddonPolicies:
autoScaler: true
Now, if in the future, I would like to make change to instance.type or replicas, I have to destroy entire cluster and recreate...which becomes quite cumbersome.
How can I do in-place upgrades with clusters created by EksCtl? Thank you.

I'm looking into the exact same issue as yours.
After a bunch of searches against the Internet, I found that it is not possible yet to in-place upgrade your existing node group in EKS.
First, eksctl update has become deprecated. When I executed eksctl upgrade --help, it gave a warning like this:
DEPRECATED: use 'upgrade cluster' instead. Upgrade control plane to the next version.
Second, as mentioned in this GitHub issue and eksctl document, up to now the eksctl upgrade nodegroup is used only for upgrading the version of managed node group.
So unfortunately, you'll have to create a new node group to apply your changes, migrate your workload/switch your traffic to new node group and decommission the old one. In your case, it's not necessary to nuke the entire cluster and recreate.
If you're seeking for seamless upgrade/migration with minimum/zero down time, I suggest you try managed node group, in which the graceful draining of workload seems promising:
Node updates and terminations gracefully drain nodes to ensure that your applications stay available.
Note: in your config file above, if you specify nodeGroups rather than managedNodeGroups, an unmanaged node group will be provisioned.
However, don't lose hope. An active issue in eksctl GitHub repository has been lodged to add eksctl apply option. At this stage it's not yet released. Would be really nice if this came true.

To upgrade the cluster using eksctl:
Upgrade the control plane version
Upgrade coredns, kube-proxy and aws-node
Upgrade the worker nodes
If you just want to update nodegroup and keep the same configuration, you can just change nodegroup names, e.g. append -v2 to the name. [0]
If you want to change the node group configuration 'instance type', you need to just create a new node group: eksctl create nodegroup --config-file=dev-cluster.yaml [1]
[0] https://eksctl.io/usage/cluster-upgrade/#updating-multiple-nodegroups-with-config-file
[1] https://eksctl.io/usage/managing-nodegroups/#creating-a-nodegroup-from-a-config-file

Related

How do I create an EKS cluster with nodes via CDK?

I'm able to deploy a Kubernetes Fargate cluster via CDK on my desired VPC:
const vpc = ec2.Vpc.fromLookup(this, 'vpc', {
vpcId: 'vpc-abcdefg'
})
const cluster = new eks.FargateCluster(this, 'sample-eks', {
version: eks.KubernetesVersion.V1_21,
vpc,
})
cluster.addNodegroupCapacity('node-group-capacity', {
minSize: 2,
maxSize: 2,
})
However, there are no nodes within this cluster:
$ kubectl config get-clusters
NAME
minikube
arn:aws:eks:us-east-1:<account_number>:cluster/<cluster_name>
$ kubectl get nodes
No resources found
Very confused as to why this is happening, as I thought the addNodegroupCapacity method is supposed to add nodes to the cluster. I think I can add nodes post-hoc via eksctl, but I was wondering if it'd be possible to deploy with nodes via CDK.
My mistake was not adding a role/user with sufficient permissions to the aws-auth ConfigMap. This meant that the cluster did not have proper permissions to create nodes. The following fixed my issue:
const role = iam.Role.fromRoleName(this, 'admin-role', '<my-admin-role>');
cluster.awsAuth.addRoleMapping(role, { groups: [ 'system:masters' ]});
The <my-admin-role> argument is the name of the role that I assume when I log in to AWS. I found it by running aws sts get-caller-identity, which returns a JSON doc that provides your assumed role's ARN. For me it was arn:aws:sts::<account-number>:assumed-role/<my-admin-role>/<my-username>.
This also resolved another issue, as I was not able to interact with the cluster via kubectl. I would get the following error message: error: You must be logged in to the server (Unauthorized). Adding my assumed role to the aws-auth ConfigMap gave me permission to access the cluster via my terminal.
Not sure why I haven't seen this bit of configuration in the tutorials I've used, would appreciate any comments that could help explain this to me.

GKE cluster-autoscaler cannot scale up nodepool based on the nodeaffinity

Prerequisites:
GKE of 1.14* or 1.15* latest stable
labeled node pools, created by Deployment manager
An application, which requires persistence volume in RWO mode
Each deployments of applications is differ, should be run at the same time with others, and in the 1 pod per 1 node state.
Each pod has no replicas, should support rolling updates (by helm).
Design:
Deployment manager template for cluster and node pools,
node pools are labeled, each node has the same label (after initial creating)
each new app deploying into new namespace, what allows to have unique service address,
each new release could be 'new install' or 'update existing', based on the node label (nodes labels could be changed by kubectl during install or update of the app)
Problem:
That is working normally if cluster is created from browser console interface. If cluster was created by GCP deployment, the error is (tested on the nginx template from k8s docs with node affinity, even without drive attached):
Warning FailedScheduling 17s (x2 over 17s) default-scheduler 0/2 nodes are available: 2 node(s) didn't match node selector.
Normal NotTriggerScaleUp 14s cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 2 node(s) didn't match node selector
What is the problem? Deployment manager creates bad labels?
affinity used:
# affinity:
# nodeAffinity:
# requiredDuringSchedulingIgnoredDuringExecution:
# nodeSelectorTerms:
# - matchExpressions:
# - key: node/nodeisbusy
# operator: NotIn
# values:
# - busy
GCP give two ways to control deployments restricting to a node-pool or a set of nodes.
Taints & Tolerations
Node Affinity
Am explaining below the #1 approach - A combination of nodeselector and tolerations to achieve restrictions on deployments along with auto-scaling.
Here is an example:
Let us say a cluster cluster-x is available.
Let us say it contains two node pools
project-a-node-pool - Configured to autoscale from 1 to 2 nodes.
project-b-node-pool - Configured to autoscale from 1 to 3 nodes.
Node Pool Labels
Each of the nodes in project-a-node-pool would contain the label. This is configured by default.
cloud.google.com/gke-nodepool: project-a-node-pool
Each of the nodes in project-b-node-pool would contain the label. This is configured by default.
cloud.google.com/gke-nodepool: project-b-node-pool
Node Pool Taints
Add Taints to each of the node pool. As an example command:
gcloud container node-pools create project-a-node-pool --cluster cluster-x
--node-taints project=a:NoExecute
gcloud container node-pools create project-b-node-pool --cluster cluster-x
--node-taints project=b:NoExecute
Snapshot of Taints configured for project-a-node-pool
Deployment Tolerations
Add to the deployment YAML file, the tolerations matching the taint.
tolerations:
- key: "project"
operator: "Equal"
value: "a" (or "b")
effect: "NoExecute"
Test with deployments
Try to do new deployments and check whether each deployment is happening as per the taint / toleration pair. Deployments with toleration value a should go to project-a-node-pool. Deployments with toleration value b should go to project-b-node-pool.
Once sufficient memory / cpu request is reached in either of the node pool, newer deployments should trigger auto-scale within the node pool.

AWS EKS kubectl - No resources found in default namespace

Trying to setup a EKS cluster.
An error occurred (AccessDeniedException) when calling the DescribeCluster operation: Account xxx is not authorized to use this service. This error came form the CLI, on the console I was able to crate the cluster and everything successfully.
I am logged in as the root user (its just my personal account).
It says Account so sounds like its not a user/permissions issue?
Do I have to enable my account for this service? I don't see any such option.
Also if login as a user (rather than root) - will I be able to see everything that was earlier created as root. I have now created a user and assigned admin and eks* permissions. I checked this when I sign in as the user - I can see everything.
The aws cli was setup with root credentials (I think) - so do I have to go back and undo fix all this and just use this user.
Update 1
I redid/restarted everything including user and awscli configure - just to make sure. But still the issue did not get resolved.
There is an option to create the file manually - that finally worked.
And I was able to : kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.100.0.1 443/TCP 48m
KUBECONFIG: I had setup the env:KUBECONFIG
$env:KUBECONFIG="C:\Users\sbaha\.kube\config-EKS-nginixClstr"
$Env:KUBECONFIG
C:\Users\sbaha\.kube\config-EKS-nginixClstr
kubectl config get-contexts
CURRENT NAME CLUSTER AUTHINFO NAMESPACE
* aws kubernetes aws
kubectl config current-context
aws
My understanding is is I should see both the aws and my EKS-nginixClstr contexts but I only see aws - is this (also) a issue?
Next Step is to create and add worker nodes. I updated the node arn correctly in the .yaml file: kubectl apply -f ~\.kube\aws-auth-cm.yaml
configmap/aws-auth configured So this perhaps worked.
But next it fails:
kubectl get nodes No resources found in default namespace.
On AWS Console NodeGrp shows- Create Completed. Also on CLI kubectl get nodes --watch - it does not even return.
So this this has to be debugged next- (it never ends)
aws-auth-cm.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: aws-auth
namespace: kube-system
data:
mapRoles: |
- rolearn: arn:aws:iam::arn:aws:iam::xxxxx:role/Nginix-NodeGrpClstr-NodeInstanceRole-1SK61JHT0JE4
username: system:node:{{EC2PrivateDNSName}}
groups:
- system:bootstrappers
- system:nodes
This problem was related to not having the correct version of eksctl - it must be at least 0.7.0. The documentation states this and I knew this, but initially whatever I did could not get beyond 0.6.0. The way you get it is to configure your AWS CLI to a region that supports EKS. Once you get 0.7.0 this issue gets resolved.
Overall to make EKS work - you must have the same user both on console and CLI, and you must work on a region that supports EKS, and have correct eksctl version 0.7.0.

Attempting to share cluster access results in creating new cluster

I have deleted the config file I used when experimenting with Kubernetes on my AWS (using this tutorial) and replaced it with another devs config file when they set up Kubernetes on a shared AWS (using this). When I run kubectl config view I see the following above the users section:
- cluster:
certificate-authority-data: REDACTED
server: <removed>
name: aws_kubernetes
contexts:
- context:
cluster: aws_kubernetes
user: aws_kubernetes
name: aws_kubernetes
current-context: aws_kubernetes
This leads me to believe that my config should be pointing to use our shared AWS but whenever I run cluster/kube-up.sh it creates a new GCE cluster so I'm thinking I'm using the wrong command to spin up the cluster on AWS.
Am I using the wrong command/missing a flag/etc? Additionally I'm thinking kube-up creates a new cluster instead of recreating a previously instantiated one.
If you are sharing the cluster, you shouldn't need to run kube-up.sh (that script only needs to be run once to initially create a cluster). Once a cluster exists, you can use the standard kubectl commands to interact with it. Try starting with kubectl get nodes to verify that your configuration file has valid credentials and you see the expected AWS nodes printed in the output.

Unreliable discovery for elasticsearch nodes on ec2

I'm using elasticsearch (0.90.x) with the cloud-aws plugin. Sometimes nodes running on different machines aren't able to discover each other ("waited for 30s and no initial state was set by the discovery"). I've set "discovery.ec2.ping_timeout" to "15s", but this doesn't seem to help. Are there other settings that might make a difference?
discovery:
type: ec2
ec2:
ping_timeout: 15s
Not sure if you are aware of this blog post: http://www.elasticsearch.org/tutorials/elasticsearch-on-ec2/. It explains the plugin settings in depth.
Adding cluster name, like so
cluster.name: your_cluster_name
discovery:
type: ec2
...
might help.