I'm new to AWS and I'm trying to provision an ECS cluster with a capacity provider via Terraform. My plan executes without errors currently, and I can see that the capacity provider creates my instances, but those instances are not being registered with the cluster, even though the provider can be seen in the cluster's edit page in the web console.
Here is my config for the cluster:
resource "aws_ecs_cluster" "cluster" {
name = "main"
depends_on = [
null_resource.iam_wait
]
}
data "aws_ami" "amazon_linux_2" {
most_recent = true
owners = ["amazon"]
filter {
name = "name"
values = ["amzn2-ami-ecs-hvm-*-x86_64-ebs"]
}
}
resource "aws_launch_configuration" "cluster" {
name = "cluster-${aws_ecs_cluster.cluster.name}"
image_id = data.aws_ami.amazon_linux_2.image_id
instance_type = "t2.small"
security_groups = [module.vpc.default_security_group_id]
iam_instance_profile = aws_iam_instance_profile.cluster.name
}
resource "aws_autoscaling_group" "cluster" {
name = aws_ecs_cluster.cluster.name
launch_configuration = aws_launch_configuration.cluster.name
vpc_zone_identifier = module.vpc.private_subnets
min_size = 3
max_size = 3
desired_capacity = 3
tag {
key = "ClusterName"
value = aws_ecs_cluster.cluster.name
propagate_at_launch = true
}
tag {
key = "AmazonECSManaged"
value = ""
propagate_at_launch = true
}
}
resource "aws_ecs_capacity_provider" "cluster" {
name = aws_ecs_cluster.cluster.name
auto_scaling_group_provider {
auto_scaling_group_arn = aws_autoscaling_group.cluster.arn
managed_scaling {
status = "ENABLED"
maximum_scaling_step_size = 1
minimum_scaling_step_size = 1
target_capacity = 3
}
}
}
resource "aws_ecs_cluster_capacity_providers" "cluster" {
cluster_name = aws_ecs_cluster.cluster.name
capacity_providers = [aws_ecs_capacity_provider.cluster.name]
default_capacity_provider_strategy {
base = 1
weight = 100
capacity_provider = aws_ecs_capacity_provider.cluster.name
}
}
The instance profile role has this policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ec2:DescribeTags",
"ecs:CreateCluster",
"ecs:DeregisterContainerInstance",
"ecs:DiscoverPollEndpoint",
"ecs:Poll",
"ecs:RegisterContainerInstance",
"ecs:StartTelemetrySession",
"ecs:Submit*",
"ecr:GetAuthorizationToken",
"ecr:GetDownloadUrlForLayer",
"ecr:BatchGetImage",
"ecr:BatchCheckLayerAvailability",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": "*"
}
]
}
I've read that this can happen if the instances do not have the proper roles, but as far as I can tell I've set up my roles correctly. I'm not getting any visible permission errors that I can find.
Another strange thing I've seen is that if another cluster named "default" exists, then the instances will register themselves to that cluster, even though the capacity provider is still attached to the other cluster.
Figured it out! I just had to set user_data like below in my launch configuration.
resource "aws_launch_configuration" "cluster" {
name = "cluster-${aws_ecs_cluster.cluster.name}"
image_id = data.aws_ami.amazon_linux_2.image_id
instance_type = "t2.small"
security_groups = [module.vpc.default_security_group_id]
iam_instance_profile = aws_iam_instance_profile.cluster.name
user_data = "#!/bin/bash\necho ECS_CLUSTER=${aws_ecs_cluster.cluster.name} >> /etc/ecs/ecs.config"
}
Related
I checked all similar questions on stackoverflow but I couldn't find any decent answer for this issue. So main problem is when I applied my terraform. The instances up and run successfully and I can see the node group under EKS but I can't see any nodes under my EKS cluster. I followed this article aws article I applied below steps but didn't work. The article also mentions about aws-auth and userdata. Should I add also these things and how? (do I need userdata I already added optimized ami?)
In summary my main problems:
my instances running with same name
my instances does not join the EKS cluster
Applied steps via aws article:
I added aws optimized ami but it doesn't
solve my problem:
/aws/service/eks/optimized-ami/1.22/amazon-linux-2/recommended/image_id (New update during installation of node group its failing because of this image probably not suitable for t2.micro)
I set below parameter for vpc what article say
enable_dns_support = true
enable_dns_hostnames = true
I set the tags for my worker nodes
key = "kubernetes.io/cluster/${var.cluster_name}"
value = "owned"
I specified userdata line in launch template. Below you can see my userdata.sh file that Im calling that from launch template
There are no nodes :(
node_grp.tf :Here my EKS worker node terraform file with policies.
resource "aws_iam_role" "eks_nodes" {
name = "eks-node-group"
assume_role_policy = <<POLICY
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "ec2.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
POLICY
}
resource "aws_iam_role_policy" "node_autoscaling" {
name = "${var.base_name}-node_autoscaling_policy"
role = aws_iam_role.eks_nodes.name
policy = <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"autoscaling:DescribeAutoScalingGroups",
"autoscaling:DescribeAutoScalingInstances",
"autoscaling:SetDesiredCapacity",
"autoscaling:TerminateInstanceInAutoScalingGroup",
"autoscaling:DescribeTags"
],
"Resource": "*"
}
]
}
EOF
}
resource "aws_iam_role_policy_attachment" "AmazonEKSWorkerNodePolicy" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy"
role = aws_iam_role.eks_nodes.name
}
resource "aws_iam_role_policy_attachment" "AmazonEKS_CNI_Policy" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy"
role = aws_iam_role.eks_nodes.name
}
resource "aws_iam_role_policy_attachment" "AmazonEC2ContainerRegistryReadOnly" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly"
role = aws_iam_role.eks_nodes.name
}
resource "aws_eks_node_group" "node" {
cluster_name = var.cluster_name
node_group_name = "${var.base_name}-node-group"
node_role_arn = aws_iam_role.eks_nodes.arn
subnet_ids = var.private_subnet_ids
scaling_config {
desired_size = var.desired_nodes
max_size = var.max_nodes
min_size = var.min_nodes
}
launch_template {
name = aws_launch_template.node_group_template.name
version = aws_launch_template.node_group_template.latest_version
}
depends_on = [
aws_iam_role_policy_attachment.AmazonEKSWorkerNodePolicy,
aws_iam_role_policy_attachment.AmazonEKS_CNI_Policy,
aws_iam_role_policy_attachment.AmazonEC2ContainerRegistryReadOnly,
]
}
resource "aws_launch_template" "node_group_template" {
name = "${var.cluster_name}_node_group"
instance_type = var.instance_type
user_data = base64encode(templatefile("${path.module}/userdata.sh", { API_SERVER_URL = var.cluster_endpoint, B64_CLUSTER_CA = var.ca_certificate, CLUSTER_NAME = var.cluster_name } ))
block_device_mappings {
device_name = "/dev/xvda"
ebs {
volume_size = var.disk_size
}
}
tag_specifications {
resource_type = "instance"
tags = {
"Instance Name" = "${var.cluster_name}-node"
Name = "${var.cluster_name}-node"
key = "kubernetes.io/cluster/${var.cluster_name}"
value = "owned"
}
}
}
cluster.tf : my main eks cluster file
resource "aws_iam_role" "eks_cluster" {
name = var.cluster_name
assume_role_policy = <<POLICY
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "eks.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
POLICY
}
resource "aws_iam_role_policy_attachment" "AmazonEKSClusterPolicy" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEKSClusterPolicy"
role = aws_iam_role.eks_cluster.name
}
resource "aws_iam_role_policy_attachment" "AmazonEKSServicePolicy" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEKSServicePolicy"
role = aws_iam_role.eks_cluster.name
}
resource "aws_eks_cluster" "eks_cluster" {
name = var.cluster_name
role_arn = aws_iam_role.eks_cluster.arn
enabled_cluster_log_types = ["api", "audit", "authenticator","controllerManager","scheduler"]
vpc_config {
security_group_ids = [var.security_group_id]
subnet_ids = flatten([ var.private_subnet_ids, var.public_subnet_ids ])
endpoint_private_access = false
endpoint_public_access = true
}
depends_on = [
aws_iam_role_policy_attachment.AmazonEKSClusterPolicy,
aws_iam_role_policy_attachment.AmazonEKSServicePolicy
]
}
resource "aws_iam_openid_connect_provider" "oidc_provider" {
client_id_list = ["sts.amazonaws.com"]
thumbprint_list = var.trusted_ca_thumbprints
url = aws_eks_cluster.eks_cluster.identity[0].oidc[0].issuer
}
user-data.sh : My userdata sh file calling from launch template
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="==MYBOUNDARY=="
--==MYBOUNDARY==
Content-Type: text/x-shellscript; charset="us-ascii"
#!/bin/bash
set -ex
/etc/eks/bootstrap.sh ${CLUSTER_NAME} --b64-cluster-ca ${B64_CLUSTER_CA} --apiserver-endpoint ${API_SERVER_URL}
--==MYBOUNDARY==--\
While I am trying to deploy EKS via Terraform, I am facing an error with node-group creation.
I am getting the following error:
Error: error waiting for EKS Node Group (Self-Hosted-Runner:Self-Hosted-Runner-default-node-group) to create:
unexpected state 'CREATE_FAILED', wanted target 'ACTIVE'.
last error: 1 error occurred:i-04db15f25be4212fb, i-07bd88adabaa103c0, i-0915982ac0f217fe4:
NodeCreationFailure: Instances failed to join the kubernetes cluster.
with module.eks.aws_eks_node_group.eks-node-group,
│ on ../../modules/aws/eks/eks-node-group.tf line 1, in resource "aws_eks_node_group" "eks-node-group":
│ 1: resource "aws_eks_node_group" "eks-node-group" {
EKS
# EKS Cluster Resources
resource "aws_eks_cluster" "eks" {
name = var.cluster-name
version = var.k8s-version
role_arn = aws_iam_role.cluster.arn
vpc_config {
security_group_ids = [var.security_group]
subnet_ids = var.private_subnets
}
enabled_cluster_log_types = var.eks-cw-logging
depends_on = [
aws_iam_role_policy_attachment.cluster-AmazonEKSClusterPolicy,
aws_iam_role_policy_attachment.cluster-AmazonEKSServicePolicy,
]
}
EKS-NODE-GROUP
resource "aws_eks_node_group" "eks-node-group" {
cluster_name = var.cluster-name
node_group_name = "${var.cluster-name}-default-node-group"
node_role_arn = aws_iam_role.node.arn
subnet_ids = var.private_subnets
capacity_type = "SPOT"
node_group_name_prefix = null #"Creates a unique name beginning with the specified prefix. Conflicts with node_group_name"
scaling_config {
desired_size = var.desired-capacity
max_size = var.max-size
min_size = var.min-size
}
update_config {
max_unavailable = 1
}
instance_types = [var.node-instance-type]
# Ensure that IAM Role permissions are created before and deleted after EKS Node Group handling.
# Otherwise, EKS will not be able to properly delete EC2 Instances and Elastic Network Interfaces.
depends_on = [
aws_eks_cluster.eks,
aws_iam_role_policy_attachment.node-AmazonEKSWorkerNodePolicy,
aws_iam_role_policy_attachment.node-AmazonEKS_CNI_Policy
]
tags = {
Name = "${var.cluster-name}-default-node-group"
}
}
IAM
# IAM
# CLUSTER
resource "aws_iam_role" "cluster" {
name = "${var.cluster-name}-eks-cluster-role"
assume_role_policy = <<POLICY
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "eks.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
POLICY
}
resource "aws_iam_role_policy_attachment" "cluster-AmazonEKSClusterPolicy" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEKSClusterPolicy"
role = aws_iam_role.cluster.name
}
resource "aws_iam_role_policy_attachment" "cluster-AmazonEKSServicePolicy" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEKSServicePolicy"
role = aws_iam_role.cluster.name
}
# NODES
resource "aws_iam_role" "node" {
name = "${var.cluster-name}-eks-node-role"
assume_role_policy = <<POLICY
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "ec2.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
POLICY
}
resource "aws_iam_role_policy_attachment" "node-AmazonEKSWorkerNodePolicy" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEKSWorkerNodePolicy"
role = aws_iam_role.node.name
}
resource "aws_iam_role_policy_attachment" "node-AmazonEKS_CNI_Policy" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEKS_CNI_Policy"
role = aws_iam_role.node.name
}
resource "aws_iam_role_policy_attachment" "node-AmazonEC2ContainerRegistryReadOnly" {
policy_arn = "arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly"
role = aws_iam_role.node.name
}
resource "aws_iam_instance_profile" "node" {
name = "${var.cluster-name}-eks-node-instance-profile"
role = aws_iam_role.node.name
}
Security Group
# Create Security Group
resource "aws_security_group" "cluster" {
name = "terraform_cluster"
description = "AWS security group for terraform"
vpc_id = aws_vpc.vpc1.id
# Input
ingress {
from_port = "1"
to_port = "65365"
protocol = "TCP"
cidr_blocks = [var.address_allowed, var.vpc1_cidr_block]
}
# Output
egress {
from_port = 0 # any port
to_port = 0 # any port
protocol = "-1" # any protocol
cidr_blocks = ["0.0.0.0/0"] # any destination
}
# ICMP Ping
ingress {
from_port = -1
to_port = -1
protocol = "icmp"
cidr_blocks = [var.address_allowed, var.vpc1_cidr_block]
}
tags = merge(
{
Name = "onboarding-sg",
},
var.tags,
)
}
VPC
# Create VPC
resource "aws_vpc" "vpc1" {
cidr_block = var.vpc1_cidr_block
instance_tenancy = "default"
enable_dns_support = true
enable_dns_hostnames = true
tags = merge(
{
Name = "onboarding-vpc",
},
var.tags,
)
}
# Subnet Public
resource "aws_subnet" "subnet_public1" {
vpc_id = aws_vpc.vpc1.id
cidr_block = var.subnet_public1_cidr_block[0]
map_public_ip_on_launch = "true" #it makes this a public subnet
availability_zone = data.aws_availability_zones.available.names[0]
tags = merge(
{
Name = "onboarding-public-sub",
"kubernetes.io/role/elb" = "1"
},
var.tags,
)
}
# Subnet Private
resource "aws_subnet" "subnet_private1" {
for_each = { for idx, cidr_block in var.subnet_private1_cidr_block: cidr_block => idx}
vpc_id = aws_vpc.vpc1.id
cidr_block = each.key
map_public_ip_on_launch = "false" //it makes this a public subnet
availability_zone = data.aws_availability_zones.available.names[each.value]
tags = merge(
{
Name = "onboarding-private-sub",
"kubernetes.io/role/internal-elb" = "1",
"kubernetes.io/cluster/${var.cluster-name}" = "owned"
},
var.tags,
)
}
tfvars
#General vars
region = "eu-west-1"
#Bucket vars
bucket = "tf-state"
tag_name = "test"
tag_environment = "Dev"
acl = "private"
versioning_enabled = "Enabled"
# Network EKS vars
aws_public_key_path = "~/.ssh/id_rsa.pub"
aws_key_name = "aws-k8s"
address_allowed = "/32" # Office public IP Address
vpc1_cidr_block = "10.0.0.0/16"
subnet_public1_cidr_block = ["10.0.128.0/20", "10.0.144.0/20", "10.0.160.0/20"]
subnet_private1_cidr_block = ["10.0.0.0/19", "10.0.32.0/19", "10.0.64.0/19"]
tags = {
Scost = "testing",
Terraform = "true",
Environment = "testing"
}
#EKS
cluster-name = "Self-Hosted-Runner"
k8s-version = "1.21"
node-instance-type = "t3.medium"
desired-capacity = "3"
max-size = "7"
min-size = "1"
# db-subnet-cidr = ["10.0.192.0/21", "10.0.200.0/21", "10.0.208.0/21"]
eks-cw-logging = ["api", "audit", "authenticator", "controllerManager", "scheduler"]
ec2-key-public-key = ""
"issues" : [ {
"code" : "NodeCreationFailure",
"message" : "Instances failed to join the kubernetes cluster",
What do you think I missed configured?
Given the following terraform.tf file:
provider "aws" {
profile = "default"
region = "us-east-1"
}
locals {
vpc_name = "some-vpc-name"
dev_vpn_source = "*.*.*.*/32" # Insted of * I have a CIDR block of our VPN here
}
resource "aws_vpc" "vpc" {
cidr_block = "10.0.0.0/16"
enable_dns_hostnames = true
tags = {
Name: local.vpc_name
}
}
resource "aws_subnet" "a" {
cidr_block = "10.0.0.0/17"
vpc_id = aws_vpc.vpc.id
tags = {
Name: "${local.vpc_name}-a"
}
}
resource "aws_subnet" "b" {
cidr_block = "10.0.128.0/17"
vpc_id = aws_vpc.vpc.id
tags = {
Name: "${local.vpc_name}-b"
}
}
resource "aws_security_group" "ssh" {
name = "${local.vpc_name}-ssh"
vpc_id = aws_vpc.vpc.id
tags = {
Name: "${local.vpc_name}-ssh"
}
}
resource "aws_security_group_rule" "ingress-ssh" {
from_port = 22
protocol = "ssh"
security_group_id = aws_security_group.ssh.id
to_port = 22
type = "ingress"
cidr_blocks = [local.dev_vpn_source]
description = "SSH access for developer"
}
resource "aws_security_group" "outbound" {
name = "${local.vpc_name}-outbound"
vpc_id = aws_vpc.vpc.id
tags = {
Name: "${local.vpc_name}-outbound"
}
}
resource "aws_security_group_rule" "egress" {
from_port = 0
protocol = "all"
security_group_id = aws_security_group.outbound.id
to_port = 65535
type = "egress"
cidr_blocks = ["0.0.0.0/0"]
description = "All outbound allowed"
}
module "ecs-clusters" {
source = "./ecs-clusters/"
subnets = [aws_subnet.a, aws_subnet.b]
vpc_name = local.vpc_name
security_groups = [aws_security_group.ssh, aws_security_group.outbound]
}
And the following ecs-clusters/ecs-cluster.tf file:
variable "vpc_name" {
type = string
}
variable "subnets" {
type = list(object({
id: string
}))
}
variable "security_groups" {
type = list(object({
id: string
}))
}
data "aws_ami" "amazon_linux_ecs" {
most_recent = true
owners = ["amazon"]
filter {
name = "name"
values = ["amzn2-ami-ecs*"]
}
}
resource "aws_iam_instance_profile" "ecs-launch-profile" {
name = "${var.vpc_name}-ecs"
role = "ecsInstanceRole"
}
resource "aws_launch_template" "ecs" {
name = "${var.vpc_name}-ecs"
image_id = data.aws_ami.amazon_linux_ecs.id
instance_type = "r5.4xlarge"
key_name = "some-ssh-key-name"
iam_instance_profile {
name = "${var.vpc_name}-ecs"
}
block_device_mappings {
device_name = "/dev/xvda"
ebs {
volume_type = "gp3"
volume_size = 1024
delete_on_termination = false
}
}
network_interfaces {
associate_public_ip_address = true
subnet_id = var.subnets[0].id
security_groups = var.security_groups[*].id
}
update_default_version = true
}
resource "aws_autoscaling_group" "ecs-autoscaling_group" {
name = "${var.vpc_name}-ecs"
vpc_zone_identifier = [for subnet in var.subnets: subnet.id]
desired_capacity = 1
max_size = 1
min_size = 1
protect_from_scale_in = true
launch_template {
id = aws_launch_template.ecs.id
version = aws_launch_template.ecs.latest_version
}
tag {
key = "Name"
propagate_at_launch = true
value = "${var.vpc_name}-ecs"
}
depends_on = [aws_launch_template.ecs]
}
resource "aws_ecs_capacity_provider" "ecs-capacity-provider" {
name = var.vpc_name
auto_scaling_group_provider {
auto_scaling_group_arn = aws_autoscaling_group.ecs-autoscaling_group.arn
managed_termination_protection = "ENABLED"
managed_scaling {
maximum_scaling_step_size = 1
minimum_scaling_step_size = 1
status = "ENABLED"
target_capacity = 1
}
}
depends_on = [aws_autoscaling_group.ecs-autoscaling_group]
}
resource "aws_ecs_cluster" "ecs-cluster" {
name = var.vpc_name
capacity_providers = [aws_ecs_capacity_provider.ecs-capacity-provider.name]
depends_on = [aws_ecs_capacity_provider.ecs-capacity-provider]
}
resource "aws_iam_role" "ecs-execution" {
name = "${var.vpc_name}-ecs-execution"
assume_role_policy = <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Action": "sts:AssumeRole",
"Principal": {
"Service": "ecs-tasks.amazonaws.com"
},
"Effect": "Allow",
"Sid": ""
}
]
}
EOF
}
resource "aws_iam_role" "ecs" {
name = "${var.vpc_name}-ecs"
assume_role_policy = <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Action": "sts:AssumeRole",
"Principal": {
"Service": "ecs-tasks.amazonaws.com"
},
"Effect": "Allow",
"Sid": ""
}
]
}
EOF
}
resource "aws_iam_role_policy_attachment" "execution-role" {
role = aws_iam_role.ecs-execution.name
policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy"
}
resource "aws_iam_role_policy_attachment" "role" {
role = aws_iam_role.ecs.name
policy_arn = "arn:aws:iam::aws:policy/AmazonS3FullAccess"
}
I'm facing two problems:
I can't SSH into EC2 instance created by the autoscaling group, despite the fact that I'm using the same SSH key and VPN to access other EC2 instances. My VPN client config includes route to the target machine via VPN gateway.
I can't execute task on the ESC cluster. The task gets stuck in provisioning status and then fails with "Unable to run task". The task is configured to use 1 GB of RAM and 1 vCPU.
What am I doing wrong?
Based on the comments.
There were two issues with the original setup:
Lack of connectivity to ECS and ECR services, which was solved by enabling internet access in the VPC. It is also possible to use VPC interface endpoints for ECS, ECR and S3, if the internet access is not desired.
Container instances did not register with ECS. This was fixed by using user_data to bootstrap ECS instances so that they can register with the ECS cluster.
Roles:
resource "aws_iam_role" "ecs-ec2-role" {
name = "${var.app_name}-ecs-ec2-role"
assume_role_policy = <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Action": "sts:AssumeRole",
"Principal": {
"Service": [
"ecs.amazonaws.com",
"ecs-tasks.amazonaws.com"
]
},
"Effect": "Allow",
"Sid": ""
}
]
}
EOF
}
resource "aws_iam_instance_profile" "ecs-ec2-role" {
name = "${var.app_name}-ecs-ec2-role"
role = "${aws_iam_role.ecs-ec2-role.name}"
}
resource "aws_iam_role_policy" "ecs-ec2-role-policy" {
name = "${var.app_name}-ecs-ec2-role-policy"
role = "${aws_iam_role.ecs-ec2-role.id}"
policy = <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ecs:CreateCluster",
"ecs:DeregisterContainerInstance",
"ecs:DiscoverPollEndpoint",
"ecs:Poll",
"ecs:RegisterContainerInstance",
"ecs:StartTelemetrySession",
"ecs:Submit*",
"ecs:StartTask",
"ecr:GetAuthorizationToken",
"ecr:BatchCheckLayerAvailability",
"ecr:GetDownloadUrlForLayer",
"ecr:BatchGetImage",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents",
"logs:DescribeLogStreams"
],
"Resource": [
"arn:aws:logs:*:*:*"
]
}
]
}
EOF
}
# ecs service role
resource "aws_iam_role" "ecs-service-role" {
name = "${var.app_name}-ecs-service-role"
assume_role_policy = <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Action": "sts:AssumeRole",
"Principal": {
"Service": [
"ecs.amazonaws.com"
]
},
"Effect": "Allow",
"Sid": ""
}
]
}
EOF
}
resource "aws_iam_role_policy_attachment" "ecs-service-attach" {
role = "${aws_iam_role.ecs-service-role.name}"
policy_arn = "arn:aws:iam::aws:policy/service-role/AmazonEC2ContainerServiceRole"
}
data "aws_iam_policy_document" "aws_secrets_policy" {
version = "2012-10-17"
statement {
sid = ""
effect = "Allow"
actions = ["secretsmanager:GetSecretValue"]
resources = [
var.aws_secrets
]
}
}
resource "aws_iam_policy" "aws_secrets_policy" {
name = "aws_secrets_policy"
policy = "${data.aws_iam_policy_document.aws_secrets_policy.json}"
}
resource "aws_iam_role_policy_attachment" "aws_secrets_policy" {
role = aws_iam_role.ecs-ec2-role.name
policy_arn = aws_iam_policy.aws_secrets_policy.arn
}
ECS:
resource "aws_ecs_cluster" "main" {
name = "${var.app_name}-cluster"
}
data "template_file" "app" {
template = file("./templates/ecs/app.json.tpl")
vars = {
app_name = var.app_name
app_image = var.app_image
app_host = var.app_host
endpoint_protocol = var.endpoint_protocol
app_port = var.app_port
container_cpu = var.container_cpu
container_memory = var.container_memory
aws_region = var.aws_region
aws_secrets = var.aws_secrets
}
}
resource "aws_ecs_task_definition" "app" {
family = "${var.app_name}-task"
execution_role_arn = aws_iam_role.ecs-ec2-role.arn
cpu = var.container_cpu
memory = var.container_memory
container_definitions = data.template_file.app.rendered
}
resource "aws_ecs_service" "main" {
name = "${var.app_name}-service"
cluster = aws_ecs_cluster.main.id
task_definition = aws_ecs_task_definition.app.arn
desired_count = var.app_count
iam_role = aws_iam_role.ecs-service-role.arn
depends_on = [aws_iam_role_policy_attachment.ecs-service-attach]
load_balancer {
target_group_arn = aws_lb_target_group.app.id
container_name = var.app_name
container_port = var.app_port
}
}
Autoscaling:
data "aws_ami" "latest_ecs" {
most_recent = true
filter {
name = "name"
values = ["*amazon-ecs-optimized"]
}
filter {
name = "virtualization-type"
values = ["hvm"]
}
owners = ["591542846629"] # AWS
}
resource "aws_launch_configuration" "ecs-launch-configuration" {
// name = "${var.app_name}-launch-configuration"
image_id = data.aws_ami.latest_ecs.id
instance_type = var.instance_type
iam_instance_profile = aws_iam_instance_profile.ecs-ec2-role.id
security_groups = [aws_security_group.ecs_tasks.id]
root_block_device {
volume_type = "standard"
volume_size = 100
delete_on_termination = true
}
lifecycle {
create_before_destroy = true
}
associate_public_ip_address = "false"
key_name = "backend-dev"
#
# register the cluster name with ecs-agent which will in turn coord
# with the AWS api about the cluster
#
user_data = data.template_file.autoscaling_user_data.rendered
}
data "template_file" "autoscaling_user_data" {
template = file("./templates/ecs/autoscaling_user_data.tpl")
vars = {
ecs_cluster = aws_ecs_cluster.main.name
}
}
#
# need an ASG so we can easily add more ecs host nodes as necessary
#
resource "aws_autoscaling_group" "ecs-autoscaling-group" {
name = "${var.app_name}-autoscaling-group"
max_size = "4"
min_size = "2"
health_check_grace_period = 300
desired_capacity = "2"
vpc_zone_identifier = [aws_subnet.private[0].id, aws_subnet.private[1].id]
launch_configuration = aws_launch_configuration.ecs-launch-configuration.name
health_check_type = "ELB"
tag {
key = "Name"
value = var.app_name
propagate_at_launch = true
}
}
resource "aws_autoscaling_policy" "demo-cluster" {
name = "${var.app_name}-ecs-autoscaling-polycy"
policy_type = "TargetTrackingScaling"
estimated_instance_warmup = "90"
adjustment_type = "ChangeInCapacity"
autoscaling_group_name = aws_autoscaling_group.ecs-autoscaling-group.name
target_tracking_configuration {
predefined_metric_specification {
predefined_metric_type = "ASGAverageCPUUtilization"
}
target_value = 40.0
}
}
Cluster name was added to an Instances successfully via User Data:
$ cat /etc/ecs/ecs.config
ECS_CLUSTER=mercure-cluster
But I'm getting an error:
service mercure-service was unable to place a task because no
container instance met all of its requirements. Reason: No Container
Instances were found in your cluster.
ecs-agent.log:
$ grep 'WARN\|ERROR' ecs-agent.log.2019-10-24-10
2019-10-24T10:36:45Z [WARN] Error getting valid credentials (AKID ): NoCredentialProviders: no valid providers in chain. Deprecated.
2019-10-24T10:36:45Z [ERROR] Unable to register as a container instance with ECS: NoCredentialProviders: no valid providers in chain. Deprecated.
2019-10-24T10:36:45Z [ERROR] Error registering: NoCredentialProviders: no valid providers in chain. Deprecated.
ecs-init.log:
$ grep 'WARN\|ERROR' ecs-init.log
2019-10-24T10:36:45Z [WARN] ECS Agent failed to start, retrying in 547.77941ms
2019-10-24T10:36:46Z [WARN] ECS Agent failed to start, retrying in 1.082153551s
2019-10-24T10:36:50Z [WARN] ECS Agent failed to start, retrying in 2.066145821s
2019-10-24T10:36:55Z [WARN] ECS Agent failed to start, retrying in 4.235010051s
Terraform Version
v0.11.3
Affected Resources
aws_ecs_service
aws_ecs_task_definition
aws_alb
aws_alb_target_group
aws_alb_listener
Error
I'm setting up an ECS cluster with currently one service. Had several issues getting the service up without breaking, but now my service can't seem to keep a container running.
service phoenix-web (instance i-079707fc669361a81) (port 80) is unhealthy in target-group tgqaphoenix-web due to (reason Request timed out)
Related?
Once my resources are up, I can't seem to find a public dns link on any instance or on the vpc gateway
main.tf for my ECS Service module:
data "template_file" "ecs_task_definition_config" {
template = "${file("config/ecs-task.json")}"
}
resource "aws_ecs_task_definition" "phoenix-web" {
lifecycle {
create_before_destroy = true
}
family = "nginx-phoenix-task"
container_definitions = "${data.template_file.ecs_task_definition_config.rendered}"
}
resource "aws_security_group" "main" {
vpc_id = "${var.vpc_id}"
tags {
Name = "sg${var.name}LoadBalancer"
Project = "${var.name}"
Environment = "${var.environment}"
}
}
resource "aws_security_group_rule" "app_lb_https_ingress" {
type = "ingress"
from_port = 80
to_port = 80
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
security_group_id = "${aws_security_group.main.id}"
}
resource "aws_alb" "main" {
security_groups = ["${aws_security_group.main.id}"]
subnets = ["${var.public_subnet_ids}"]
name = "alb-${var.environment}-${var.name}"
access_logs {
bucket = "${var.access_log_bucket}"
prefix = "${var.access_log_prefix}"
}
tags {
Name = "alb-${var.environment}-${var.name}"
Project = "${var.name}"
Environment = "${var.environment}"
}
}
resource "aws_alb_target_group" "main" {
name = "tg${var.environment}${var.name}"
health_check {
healthy_threshold = "3"
interval = "30"
protocol = "HTTP"
timeout = "3"
path = "/healthz"
unhealthy_threshold = "2"
}
port = "80"
protocol = "HTTP"
vpc_id = "${var.vpc_id}"
tags {
Name = "tg${var.environment}${var.name}"
Project = "${var.name}"
Environment = "${var.environment}"
}
depends_on = ["aws_alb.main"]
}
resource "aws_alb_listener" "https" {
load_balancer_arn = "${aws_alb.main.id}"
port = "80"
protocol = "HTTP"
default_action {
target_group_arn = "${aws_alb_target_group.main.id}"
type = "forward"
}
}
resource "aws_ecs_service" "service" {
lifecycle {
create_before_destroy = true
}
name = "${var.name}"
cluster = "${var.environment}"
task_definition = "${aws_ecs_task_definition.phoenix-web.id}"
desired_count = "${var.desired_count}"
deployment_minimum_healthy_percent = "${var.deployment_min_healthy_percent}"
deployment_maximum_percent = "${var.deployment_max_percent}"
iam_role = "${aws_iam_role.ecs-role.id}"
load_balancer {
target_group_arn = "${aws_alb_target_group.main.id}"
container_name = "phoenix-web"
container_port = "80"
}
depends_on = ["aws_iam_role.ecs-role", "null_resource.alb_exists"]
}
resource "aws_iam_role_policy" "ecs-policy" {
name = "ecs-policy"
role = "${aws_iam_role.ecs-role.id}"
policy = <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ecs:CreateCluster",
"ecs:DeregisterContainerInstance",
"ecs:DiscoverPollEndpoint",
"ecs:Poll",
"ecs:RegisterContainerInstance",
"ecs:StartTelemetrySession",
"ecs:Submit*",
"ecr:GetAuthorizationToken",
"ecr:BatchCheckLayerAvailability",
"ecr:GetDownloadUrlForLayer",
"ecr:BatchGetImage",
"ec2:AuthorizeSecurityGroupIngress",
"ec2:Describe*",
"elasticloadbalancing:DeregisterInstancesFromLoadBalancer",
"elasticloadbalancing:Describe*",
"elasticloadbalancing:RegisterInstancesWithLoadBalancer",
"elasticloadbalancing:RegisterTargets",
"elasticloadbalancing:DeregisterTargets"
],
"Resource": "*"
}
]
}
EOF
depends_on = ["aws_iam_role.ecs-role"]
}
resource "aws_iam_role" "ecs-role" {
name = "ecs-role"
assume_role_policy = <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Action": "sts:AssumeRole",
"Principal": {
"Service": "ecs.amazonaws.com"
},
"Effect": "Allow",
"Sid": ""
}
]
}
EOF
}
resource "aws_appautoscaling_target" "main" {
service_namespace = "ecs"
resource_id = "service/${var.environment}/${var.name}"
scalable_dimension = "ecs:service:DesiredCount"
role_arn = "${aws_iam_role.ecs-role.arn}"
min_capacity = "${var.min_count}"
max_capacity = "${var.max_count}"
depends_on = [
"aws_ecs_service.service",
]
}
resource "null_resource" "alb_exists" {
triggers {
alb_name = "${aws_alb_target_group.main.id}"
}
}
main.tf for my ECS cluster module
module "s3-log-storage" {
source = "cloudposse/s3-log-storage/aws"
version = "0.1.3"
# insert the 3 required variables here
namespace = "mmt-ecs"
stage = "${var.environment}"
name = "logs-bucket"
policy = <<POLICY
{
"Id": "Policy1519319575520",
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Stmt1519319570434",
"Action": [
"s3:PutObject",
"s3:PutObjectAcl",
"s3:PutObjectTagging",
"s3:PutObjectVersionAcl",
"s3:PutObjectVersionTagging"
],
"Effect": "Allow",
"Resource": "arn:aws:s3:::mmt-ecs-qa-logs-bucket/*",
"Principal": "*"
}
]
}
POLICY
}
module "network" {
source = "../network"
environment = "${var.environment}"
vpc_cidr = "${var.vpc_cidr}"
public_subnet_cidrs = "${var.public_subnet_cidrs}"
private_subnet_cidrs = "${var.private_subnet_cidrs}"
availability_zones = "${var.availability_zones}"
depends_id = ""
}
module "ecs_instances" {
source = "../ecs_instances"
environment = "${var.environment}"
cluster = "${var.cluster}"
instance_group = "${var.instance_group}"
private_subnet_ids = "${module.network.private_subnet_ids}"
aws_ami = "${var.ecs_aws_ami}"
instance_type = "${var.instance_type}"
max_size = "${var.max_size}"
min_size = "${var.min_size}"
desired_capacity = "${var.desired_capacity}"
vpc_id = "${module.network.vpc_id}"
iam_instance_profile_id = "${aws_iam_instance_profile.ecs.id}"
key_name = "${var.key_name}"
load_balancers = "${var.load_balancers}"
depends_id = "${module.network.depends_id}"
custom_userdata = "${var.custom_userdata}"
cloudwatch_prefix = "${var.cloudwatch_prefix}"
}
module "web-phoenix-service" {
source = "../services/web-phoenix"
environment = "${var.environment}"
vpc_id = "${module.network.vpc_id}"
public_subnet_ids = "${module.network.public_subnet_ids}"
name = "phoenix-web"
deployment_max_percent = "200"
deployment_min_healthy_percent = "100"
max_count = "2"
min_count = "1"
desired_count = "1"
ecs_service_role_name = "${aws_iam_instance_profile.ecs.id}"
access_log_bucket = "${module.s3-log-storage.bucket_id}"
access_log_prefix = "ALB"
}
resource "aws_ecs_cluster" "cluster" {
name = "${var.cluster}"
}
It seems the application health check is failing i.e. /healthz. You start debugging issue like below:
1) Spin up a container in your local and check whether it is working or not. Per your health check info above, you should be able to access application like http://someip:port/healthz
If this works
2) Are you exposing port 80 while building docker image ? Check in docker file.
3) if above two steps seems okay, then try accessing your application by using EC S instance ip as soon as task is running.
http://ecsinstanceip:port/healthz .
4) If 3 also works, they try increasing the health check timeout period so that the application gets more time to pass its health check..
Clue 1
Make sure that the ECS container instance's security group is able to accept ports 1024-65535 inside the VPN (don't open it for the outside world)
Clue 2
On the task definition for the portMappings specify it like:
"portMappings": [
{
"hostPort": 0,
"protocol": "tcp",
"containerPort": 80
}
],
Note here:
containerPort is what you expose from your container, where you app is listening with its healthcheck
hostPort would be what port you bind for forwarding on the host. Leave it 0 an it will be automatically assigned by ECS, that's why you need to open 1024-65535 on the SG. This is needed so you will be able to run the same task definition multiple times on the same instance (scale horizontally).