How to trigger auto-scaling for EKS pods? - amazon-web-services

Context
I am running an application (Apache Airflow) on EKS, that spins up new workers to fulfill new tasks. Every worker is required to spin up a new pod. I am afraid to run out of memory and/or CPU when there are several workers being spawned. My objective is to trigger auto-scaling.
What I have tried
I am using Terraform for provisioning (also happy to have answers that are not in Terraform, which i can conceptually transform to Terraform code).
I have setup a fargate profile like:
# Create EKS Fargate profile
resource "aws_eks_fargate_profile" "airflow" {
cluster_name = module.eks_cluster.cluster_id
fargate_profile_name = "${var.project_name}-fargate-${var.env_name}"
pod_execution_role_arn = aws_iam_role.fargate_iam_role.arn
subnet_ids = var.private_subnet_ids
selector {
namespace = "fargate"
}
tags = {
Terraform = "true"
Project = var.project_name
Environment = var.env_name
}
}
My policy for auto scaling the nodes:
# Create IAM Policy for node autoscaling
resource "aws_iam_policy" "node_autoscaling_pol" {
name = "${var.project_name}-node-autoscaling-${var.env_name}"
policy = data.aws_iam_policy_document.node_autoscaling_pol_doc.json
}
# Create autoscaling policy
data "aws_iam_policy_document" "node_autoscaling_pol_doc" {
statement {
actions = [
"autoscaling:DescribeAutoScalingGroups",
"autoscaling:DescribeAutoScalingInstances",
"autoscaling:DescribeLaunchConfigurations",
"autoscaling:DescribeTags",
"autoscaling:SetDesiredCapacity",
"autoscaling:TerminateInstanceInAutoScalingGroup",
"ec2:DescribeLaunchTemplateVersions"
]
effect = "Allow"
resources = ["*"]
}
}
And finally a (just a snippet for brevity):
# Create EKS Cluster
module "eks_cluster" {
cluster_name = "${var.project_name}-${var.env_name}"
# Assigning worker groups
worker_groups = [
{
instance_type = var.nodes_instance_type_1
asg_max_size = 1
name = "${var.project_name}-${var.env_name}"
}
]
}
Question
Is increasing the asg_max_size sufficient for auto scaling? I have a feeling that I need to set something where along the lines of: "When memory exceeds X do y" but I am not sure.
I don't have so much experience with advanced monitoring/metrics tools, so a somewhat simple solution that does basic auto-scaling would be the best fit for my needs = )

This is handled by a tool called cluster-autoscaler. You can find the EKS guide for it at https://docs.aws.amazon.com/eks/latest/userguide/cluster-autoscaler.html or the project itself at https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler

Related

How to set AWS EKS nodes to use gp3

I'm trying to set my EKS nodes to use gp3 as volume. It's using the default gp2 but I would like to change it to gp3. I'm using terraform to build the infrastructure and the aws_eks_cluster resource (I'm not using the module "eks"). Here is a simple snippet:
resource "aws_eks_cluster" "cluster" {
name = var.name
role_arn = aws_iam_role.cluster.arn
version = var.k8s_version
}
resource "aws_eks_node_group" "cluster" {
capacity_type = var.node_capacity_type
cluster_name = aws_eks_cluster.cluster.name
disk_size = random_id.node_group.keepers.node_disk
instance_types = split(",", random_id.node_group.keepers.node_type)
node_group_name = "${var.name}-${local.availability_zones[count.index]}-${random_id.node_group.hex}"
node_role_arn = random_id.node_group.keepers.role_arn
subnet_ids = [var.private ? aws_subnet.private[count.index].id : aws_subnet.public[count.index].id]
version = var.k8s_version
}
I tried to set up the kubernetes_storage_class resource but it's only changing for volumes used by the pods (PV/PVC). I would like to change the nodes volume to gp3.
I didn't find in the documentation and in the github how to do that. Was anyone able to do that?
Thanks.
You can try to setup your own launch template and then reference it in aws_eks_node_group - launch_template argument.
Launch template allows you to configure disk type. AWS provides guide on how to write a launch template correctly.

Why EKS can't issue certificate to kubelet after nodepool creation?

When I'm creating EKS cluster with single nodepool using terraform, I'm facing the kubelet certificate problem, i.e. csr's are stuck in pending state like this:
NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITION
csr-8qmz5 4m57s kubernetes.io/kubelet-serving kubernetes-admin <none> Pending
csr-mq9rx 5m kubernetes.io/kubelet-serving kubernetes-admin <none> Pending
As we can see REQUESTOR here is kubernetes-admin, and I'm really not sure why.
My terrafrom code for cluster itself:
resource "aws_eks_cluster" "eks" {
name = var.eks_cluster_name
role_arn = var.eks_role_arn
version = var.k8s_version
vpc_config {
endpoint_private_access = "true"
endpoint_public_access = "true"
subnet_ids = var.eks_public_network_ids
security_group_ids = var.eks_security_group_ids
}
kubernetes_network_config {
ip_family = "ipv4"
service_ipv4_cidr = "10.100.0.0/16"
}
}
Terraform code for nodegroup:
resource "aws_eks_node_group" "aks-NG" {
depends_on = [aws_ec2_tag.eks-subnet-cluster-tag, aws_key_pair.eks-deployer]
cluster_name = aws_eks_cluster.eks.name
node_group_name = "aks-dev-NG"
ami_type = "AL2_x86_64"
node_role_arn = var.eks_role_arn
subnet_ids = var.eks_public_network_ids
capacity_type = "ON_DEMAND"
instance_types = var.eks_nodepool_instance_types
disk_size = "50"
scaling_config {
desired_size = 2
max_size = 2
min_size = 2
}
tags = {
Name = "${var.eks_cluster_name}-node"
"kubernetes.io/cluster/${var.eks_cluster_name}" = "owned"
}
remote_access {
ec2_ssh_key = "eks-deployer-key"
}
}
Per my understanding it's very basic configuration.
Now, when I'm creating cluster and nodegroup via AWS management console with exactly SAME parameters, i.e. cluster IAM role and nodegroup IAM roles are same as for Terraform, everything is fine:
NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITION
csr-86qtg 6m20s kubernetes.io/kubelet-serving system:node:ip-172-31-201-140.ec2.internal <none> Approved,Issued
csr-np42b 6m43s kubernetes.io/kubelet-serving system:node:ip-172-31-200-199.ec2.internal <none> Approved,Issued
But here, certificate requestor it's node itself (per my understanding). So I would like to know what's the problem is here? Why requestor is different in this case, what's the difference between creating of these resources from AWS management console and using terraform, and how do I manage this issue? Please help.
UPD.
I found that this problem appears when I'm creating cluster using terraform via assumed role created for terraform.
When i'm creating the cluster using terraform with regular IAM user credentials, with same permissions set everything is fine.
It doesn't gives any answer regarding the root casue, but still, it's something to consider.
Right now it seems like weird EKS bug.

EC2 instance created using terraform with autoscaling group not added to ECS cluster

TL;DR: Does my EC2 instance need an IAM role to be added to my ECS cluster? If so, how do I set that?
I have an EC2 instance created using an autoscaling group. (ASG definition here.) I also have an ECS cluster, which is set on the spawned instances via user_data. I've confirmed that /etc/ecs/ecs.config on the running instance looks correct:
ECS_CLUSTER=my_cluster
However, the instance never appears in the cluster, so the service task doesn't run. There are tons of questions on SO about this, and I've been through them all. The instances are in a public subnet and have access to the internet. The error in ecs-agent.log is:
Error getting ECS instance credentials from default chain: NoCredentialProviders: no valid providers in chain.
So I am guessing that the problem is that the instance has no IAM role associated with it. But I confess that I am a bit confused about all the various "roles" and "services" involved. Does this look like a problem?
If that's it, where do I set this? I'm using Cloud Posse modules. The docs say I shouldn't set a service_role_arn on a service task if I'm using "awsvpc" as the networking mode, but I am not sure whether I should be using a different mode for this setup (multiple containers running as tasks on a single EC2 instance). Also, there are several other roles I can configure here? The ECS service task looks like this:
module "ecs_alb_service_task" {
source = "cloudposse/ecs-alb-service-task/aws"
# Cloud Posse recommends pinning every module to a specific version
version = "0.62.0"
container_definition_json = jsonencode([for k, def in module.flask_container_def : def.json_map_object])
name = "myapp-web"
security_group_ids = [module.sg.id]
ecs_cluster_arn = aws_ecs_cluster.default.arn
task_exec_role_arn = [aws_iam_role.ec2_task_execution_role.arn]
launch_type = "EC2"
alb_security_group = module.sg.name
vpc_id = module.vpc.vpc_id
subnet_ids = module.subnets.public_subnet_ids
network_mode = "awsvpc"
desired_count = 1
task_memory = (512 * 3)
task_cpu = 1024
deployment_controller_type = "ECS"
enable_all_egress_rule = false
health_check_grace_period_seconds = 10
deployment_minimum_healthy_percent = 50
deployment_maximum_percent = 200
ecs_load_balancers = [{
container_name = "web"
container_port = 80
elb_name = null
target_group_arn = module.alb.default_target_group_arn
}]
}
And here's the policy for the ec2_task_execution_role:
data "aws_iam_policy_document" "ec2_task_execution_role" {
statement {
actions = ["sts:AssumeRole"]
principals {
type = "Service"
identifiers = ["ecs-tasks.amazonaws.com"]
}
}
}
Update: Here is the rest of the declaration of the task execution role:
resource "aws_iam_role" "ec2_task_execution_role" {
name = "${var.project_name}_ec2_task_execution_role"
assume_role_policy = data.aws_iam_policy_document.ec2_task_execution_role.json
tags = {
Name = "${var.project_name}_ec2_task_execution_role"
Project = var.project_name
}
}
resource "aws_iam_role_policy_attachment" "ec2_task_execution_role" {
role = aws_iam_role.ec2_task_execution_role.name
policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
}
# Create a policy for the EC2 role to use Session Manager
resource "aws_iam_role_policy" "ec2_role_policy" {
name = "${var.project_name}_ec2_role_policy"
role = aws_iam_role.ec2_task_execution_role.id
policy = jsonencode({
"Version" : "2012-10-17",
"Statement" : [
{
"Effect" : "Allow",
"Action" : [
"ssm:DescribeParameters",
"ssm:GetParametersByPath",
"ssm:GetParameters",
"ssm:GetParameter"
],
"Resource" : "*"
}
]
})
}
Update 2: The EC2 instances are created by the Auto-Scaling Group, see here for my code. The ECS cluster is just this:
# Create the ECS cluster
resource "aws_ecs_cluster" "default" {
name = "${var.project_name}_cluster"
tags = {
Name = "${var.project_name}_cluster"
Project = var.project_name
}
}
I was expecting there to be something like instance_role in the ec2-autoscaling-group module, but there isn't.
You need to set the EC2 instance profile (IAM instance role) via the iam_instance_profile_name setting in the module "autoscale_group".

How do I launch a Beanstalk environment with HealthChecks as "EC2 and ELB" and health_check_grace_time as 1500 using terraform?

I have started learning about terraform recently and wanted to create an environment using the above stated settings. When I run the below code I get 2 resources deployed one is beanstalk and other is Auto Scaling group(ASG) the ASG has the desired settings but is not linked with the beanstalk . Hence I am trying to Connect these two.
(I copy the beanstalk Id form the Tags section then head over to ASG under EC2 and search for the same and look at the health Check section)
resource "aws_autoscaling_group" "example" {
launch_configuration = aws_launch_configuration.as_conf.id
min_size = 2
max_size = 10
availability_zones = [ "us-east-1a" ]
health_check_type = "ELB"
health_check_grace_period = 1500
tag {
key = "Name"
value = "terraform-asg-example"
propagate_at_launch = true
}
}
provider "aws" {
region = "us-east-1"
}
resource "aws_elastic_beanstalk_application" "application" {
name = "Test-app"
}
resource "aws_elastic_beanstalk_environment" "environment" {
name = "Test-app"
application = aws_elastic_beanstalk_application.application.name
solution_stack_name = "64bit Windows Server Core 2019 v2.5.6 running IIS 10.0"
setting {
namespace = "aws:autoscaling:launchconfiguration"
name = "IamInstanceProfile"
value = "aws-elasticbeanstalk-ec2-role"
}
setting {
namespace = "aws:autscaling"
}
}
resource "aws_launch_configuration" "as_conf" {
name = "web_config_shivanshu"
image_id = "ami-2757f631"
instance_type = "t2.micro"
lifecycle {
create_before_destroy = true
}
}
You do not create an ASG or launch config/template outside of the Elastic Beanstalk environment and join them together. As there are config options which are not available. For example GP3 SSD is available as part of a launch template, but not available as part of elastic beanstalk yet
What you want to do is remove the resources of
resource "aws_launch_configuration" "as_conf"
resource "aws_autoscaling_group" "example"
Then utilise the setting {} block a lot more within resource "aws_elastic_beanstalk_environment" "environment"
Here is a list of all the settings you can describe in the settings block (https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/command-options-general.html)
So I got it how we can change the Auto-scaling group(ASG) of the beanstalk we have created using terraform. First of all,create the beanstalk according to your setting. we use the Setting block in beanstalk resource and Namespace for configuring it according to our need.
Step-1
Create a beanstalk using terraform
resource "aws_elastic_beanstalk_environment" "test"
{ ...
...
}
Step-2
After you have created the beanstalk.Create autoscaling resource skeleton. the ASG associated with the beanstalk will be handled by terraform under this resource block.Import using the id of ASG that you can get from either terraform plan/show
terraform import aws_autoscaling_group.<Name that you give> asg-id
Step-3
After you have done that Change the beanstalk according to your need
then make sure you have added these to tags.Because sometimes I have noticed that the mapping of this ASG to the beanstalk is is lost.
tag {
key = "elasticbeanstalk:environment-id"
propagate_at_launch = true
value = aws_elastic_beanstalk_environment.<Name of your beanstalk>.id
}
tag {
key = "elasticbeanstalk:environment-name"
propagate_at_launch = true
value = aws_elastic_beanstalk_environment.<Name of your beanstalk>.name
}

Terraform: Add cloudwatch alarm to elastic load balancer generated through elastic beanstalk

I'm setting up Terraform config for an Elastic Beanstalk app that sets up an ELB. We want to have a cloudwatch alarm that triggers when the ELB gets too many 5XX errors. I'm trying to pass the ELB ARNs from the EB environment but it fails with the message:
value of 'count' cannot be computed
I know that's a common issue with Terraform, e.g. https://github.com/hashicorp/terraform/issues/10857 I can't really figure out a workaround. We're trying to make this ELB cloudwatch alarm module generic so I can't really hardcode the number of ELBs.
Here's the code I'm using:
locals {
elb_count = "${length(var.load_balancer_arns)}"
}
resource aws_cloudwatch_metric_alarm "panic_time" {
count = "${local.elb_count}"
alarm_name = "${var.application_name}-panic-time"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = "1"
metric_name = "HTTPCode_ELB_5XX_Count"
namespace = "AWS/ELB"
period = "60"
statistic = "Sum"
threshold = "${var.max_5xx_errors}"
dimensions {
LoadBalancer = "${element(var.load_balancer_arns, count.index)}"
}
alarm_description = "SNS if we start getting a lot of 500 errors"
alarm_actions = ["${aws_sns_topic.panic_time.arn}"]
}
resource aws_sns_topic "panic_time" {
name = "${var.application_name}-panic-time"
}
resource aws_sns_topic_policy "panic_time" {
arn = "${aws_sns_topic.panic_time.arn}"
policy = "${data.aws_iam_policy_document.panic_time_sns.json}"
}
data aws_iam_policy_document "panic_time_sns" {
statement {
actions = [
"SNS:Publish",
]
resources = [
"${aws_sns_topic.panic_time.arn}",
]
principals {
type = "Service"
identifiers = ["events.amazonaws.com"]
}
}
}
I'm passing in the load balancers from the environment in main.tf:
load_balancer_arns = "${module.environment.load_balancers}"
(The load_balancers output looks like this:)
output load_balancers {
value = "${aws_elastic_beanstalk_environment.main.load_balancers}"
}
Didn't get how many ELB you have per environment.
I used such workaround for one ELB per environment:
resource "aws_elastic_beanstalk_environment" "some-environment" {
count = "${length(var.environments)}"
...
}
resource "aws_cloudwatch_metric_alarm" "alarm-unhealthy-host-count" {
count = "${length(var.environments)}"
...
dimensions {
LoadBalancerName = "${element(aws_elastic_beanstalk_environment.some-environment.*.load_balancers[count.index], 0)}"
}
}
It is funny, but I can pass any value to element(..., val), and every time I get the only correct name of ELB per environment.