ResourceInitializationError with Fargate ECS deployment - amazon-web-services

I'm fairly new to AWS. I am trying to deploy a docker container to ECS but it fails with the following error:
ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: unable to retrieve ecr registry auth: service call has been retried 3 time(s): RequestError: send request failed caused by: Post "https://api.ecr.us-east-1.amazonaws.com/": dial tcp 52.46.146.144:443: i/o timeout
This was working perfectly fine, until I tried to add a loadbalancer, at which point this error began occuring. I must have changed something but I'm not sure what.
The ECS instance is in a public subnet
The security group has in/out access on all ports/ips (0.0.0.0/0)
The VPC has an internet gateway
Clearly something is wrong with my config but I'm not sure what. Google and other stack overflows haven't helped so far.
Terraform ECS file:
resource "aws_ecs_cluster" "solmines-ecs-cluster" {
name = "solmines-ecs-cluster"
}
resource "aws_ecs_service" "solmines-ecs-service" {
name = "solmines"
cluster = aws_ecs_cluster.solmines-ecs-cluster.id
task_definition = aws_ecs_task_definition.solmines-ecs-task-definition.arn
launch_type = "FARGATE"
desired_count = 1
network_configuration {
security_groups = [aws_security_group.solmines-ecs.id]
subnets = ["${aws_subnet.solmines-public-subnet1.id}", "${aws_subnet.solmines-public-subnet2.id}"]
assign_public_ip = true
}
load_balancer {
target_group_arn = aws_lb_target_group.solmines-lb-tg.arn
container_name = "solmines-api"
container_port = 80
}
depends_on = [aws_lb_listener.solmines-lb-listener]
}
resource "aws_ecs_task_definition" "solmines-ecs-task-definition" {
family = "solmines-ecs-task-definition"
network_mode = "awsvpc"
requires_compatibilities = ["FARGATE"]
memory = "1024"
cpu = "512"
execution_role_arn = "${aws_iam_role.solmines-ecs-role.arn}"
container_definitions = <<EOF
[
{
"name": "solmines-api",
"image": "${aws_ecr_repository.solmines-ecr-repository.repository_url}:latest",
"memory": 1024,
"cpu": 512,
"essential": true,
"portMappings": [
{
"containerPort": 80,
"hostPort": 80
}
]
}
]
EOF
}

Related

Switching to Windows Containers causing a "CannotPullContainerError... failed to resolve ref <image> ... Forbidden" despite working for Linux

I have a windows container image which is stored within a private artifactory repository, and would like to deploy it to AWS Fargate. Unfortunately, I am getting the error:
CannotPullContainerError: inspect image has been retried 1 time(s):
failed to resolve ref
"my.local.artifactory.com:port/repo/project/branch:image#sha256:digest":
failed to do request: Head
"https://my.local.artifactory.com:port/v2/repo/project/branch/manifests/sha256:digest":
Forbidden
Whenever my ecs service attempts to spin up a new task.
We have existing linux applications running in AWS Fargate, which also pull (successfully) from our artifactory repo; however this will be our first Windows container deployment.
Using terraform, I've been able to show that it is the switch to windows which is changing something, somewhere, to cause this. The error can be reproduced by switching our ecs_task_definition resource from:
resource "aws_ecs_task_definition" "ecs_task_definition" {
cpu = 1024
family = var.app_aws_name
container_definitions = jsonencode([
{
name = var.app_aws_name
image = "my.local.artifactory.com:port/repo/**LINUX_PROJECT**/branch:image#sha256:digest"
cpu = 1024
memory = 2048
essential = true
environment = [
{
name = "ASPNETCORE_ENVIRONMENT"
value = var.aspnetcore_environment_value
}
]
portMappings = [
{
containerPort = 80
hostPort = 80
protocol = "tcp"
},
{
containerPort = 443
hostPort = 443
protocol = "tcp"
}
],
logConfiguration = {
logDriver = "awslogs"
options = {
awslogs-create-group = "true"
awslogs-group = "/ecs/${var.app_name_lower}"
awslogs-region = var.region
awslogs-stream-prefix = var.app_aws_name
}
}
}
])
memory = 2048
network_mode = "awsvpc"
requires_compatibilities = [
"FARGATE"
]
task_role_arn = aws_iam_role.ecs_execution_role.arn
execution_role_arn = aws_iam_role.ecs_execution_role.arn
}
to:
resource "aws_ecs_task_definition" "ecs_task_definition" {
cpu = 1024
family = var.app_aws_name
container_definitions = jsonencode([
{
name = var.app_aws_name
image = "my.local.artifactory.com:port/repo/**WINDOWS_PROJECT**/branch:image#sha256:digest"
cpu = 1024
memory = 2048
essential = true
environment = [
{
name = "ASPNETCORE_ENVIRONMENT"
value = var.aspnetcore_environment_value
}
]
portMappings = [
{
containerPort = 80
hostPort = 80
protocol = "tcp"
},
{
containerPort = 443
hostPort = 443
protocol = "tcp"
}
],
logConfiguration = {
logDriver = "awslogs"
options = {
awslogs-create-group = "true"
awslogs-group = "/ecs/${var.app_name_lower}"
awslogs-region = var.region
awslogs-stream-prefix = var.app_aws_name
}
}
}
])
memory = 2048
network_mode = "awsvpc"
requires_compatibilities = [
"FARGATE"
]
**runtime_platform {
operating_system_family = "WINDOWS_SERVER_2019_CORE"
cpu_architecture = "X86_64"
}**
task_role_arn = aws_iam_role.ecs_execution_role.arn
execution_role_arn = aws_iam_role.ecs_execution_role.arn
}
Keeping all other terraform resources the same, the first will work successfully, the latter will result in the error.
Here's what I have tried:
Triple, quadruple checked that the windows image does actually exist in artifactory.
Pulled the image stored within artifactory, pushed it to ECR and had the task definition pull the image from there. This worked successfully, leading me to believe there is nothing wrong with the image itself, or any missing windows AWS configuration.
Ensured that the windows image is set up in artifactory to allow anonymous user read access, in exactly the same way as our Linux images.
Attempted to use AWS Secret Manager to connect to artifactory using an account regardless. (Unsuccessful)
Attempted to use a non-ecr "mcr.microsoft.com/windows/servercore/iis:windowsservercore-ltsc2019" image, this allowed the task to run successfully.
Checked our artifactory logs to see if any pull requests are actually making it there - no pull requests for that image have been logged which would lead me to believe that it's a network based infrastructure issue rather than permissions; however the pull works fine for linux containers keeping the security groups, VPC, and subnets otherwise the same!
Due to point 6 I believe this to be network related, but for the life of me I cannot figure out what it is that changes between windows and linux containers that would cause this. The pull request still happens on port 443, and still comes from the same vpc/subnet, so i don't see how the firewall could be blocking it, and the security group is unchanged so again I do not see how that could be the issue.
So my question is, what actually changes between linux/windows task definitions that could be causing this?
...
Or am i missing something, following a red herring?
If there's any other information you'd like please ask and I'll add it here. I've tried to not bloat this too much.
Cheers

CannotStartContainerError: ResourceInitializationError: failed to create new container runtime task: failed to create shim: OCI runtime create failed:

Here's the full error message:
CannotStartContainerError: ResourceInitializationError: failed to create new container runtime task: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: exec: "/": permission denied: unknown Entry point
I have an application that I created a docker image with and had it working fine on lambda. The image is on ECR. I deleted my lambda function, created a docker container in ECS from that image and utilized Fargate.
here is my main.tf file in my ECS module on Terraform that I used to create this task.
resource "aws_ecs_cluster" "cluster" {
name = "python-cloud-cluster"
}
resource "aws_ecs_service" "ecs-service" {
name = "python-cloud-project"
cluster = aws_ecs_cluster.cluster.id
task_definition = aws_ecs_task_definition.pcp-ecs-task-definition.arn
launch_type = "FARGATE"
network_configuration {
subnets = var.service_subnets
security_groups = var.pcp_service_sg
assign_public_ip = true
}
desired_count = 1
}
resource "aws_ecs_task_definition" "pcp-ecs-task-definition" {
family = "ecs-task-definition-pcp"
network_mode = "awsvpc"
requires_compatibilities = ["FARGATE"]
memory = "1024"
cpu = "512"
task_role_arn = var.task_role_arn
execution_role_arn = var.task_role_arn
container_definitions = <<EOF
[
{
"name": "pcp-container",
"image": "775362094965.dkr.ecr.us-west-2.amazonaws.com/weather-project:latest",
"memory": 1024,
"cpu": 512,
"essential": true,
"entryPoint": ["/"],
"portMappings": [
{
"containerPort": 80,
"hostPort": 80
}
]
}
]
EOF
}
I found a base template online and altered it to fit my needs. I just realized the entry point is set to ["/"] in the task definition, which was default from the template I used. What should I be setting it to? Or this error caused by a different issue?
entryPoint is optional, and you don't have to specify it if you don't know what it is.
In your case it is / which is incorrect. It should be some executable (e.g. /bin/bash), and it depends on your container and what the container does. But again, its optional.
You have to check documentation of your weather-project container, and see what exactly it does and how to use it.

Terraform Launch Type Fargate for windows container Error:- You do not have authorization to access the specified platform

Description
Terraform: For Launch type, Fargate with windows container getting below error after running terraform apply Error:
error creating app-name service: error waiting for ECS service (app-name) creation: AccessDeniedException: You do not have authorization to access the specified platform.
Below Terraform and AWS provider version used:
Terraform CLI and Terraform AWS Provider Version
User-Agent: APN/1.0 HashiCorp/1.0 Terraform/0.12.31 (+https://www.terraform.io) terraform-provider-aws/3.70.0 (+https://registry.terraform.io/providers/hashicorp/aws) aws-sdk-go/1.42.23 (go1.16; linux; amd64)
Affected Resource(s):- aws_ecs_service
Terraform Configuration Files
resource "aws_ecs_task_definition" "app_task" {
family = "${var.tags["environment"]}-app"
container_definitions = data.template_file.app_task_definition.rendered
requires_compatibilities = ["FARGATE"]
network_mode = "awsvpc"
task_role_arn = aws_iam_role.ecs_role.arn
execution_role_arn = aws_iam_role.ecs_role.arn
memory = var.fargate_memory
cpu = var.fargate_cpu
runtime_platform {
operating_system_family = "WINDOWS_SERVER_2019_CORE"
cpu_architecture = "X86_64"
}
depends_on = [null_resource.confd_cluster_values]
}
resource "aws_ecs_service" "app" {
name = "${var.tags["environment"]}-app"
cluster = data.terraform_remote_state.fargate_cluster.outputs.cluster.id
task_definition = aws_ecs_task_definition.app_task.arn
desired_count = var.ecs_app_desired_count
health_check_grace_period_seconds = 2147483647
deployment_minimum_healthy_percent = 0
deployment_maximum_percent = 100
launch_type = "FARGATE"
enable_execute_command = true
network_configuration {
security_groups = [data.terraform_remote_state.fargate_cluster.outputs.cluster_security_group]
subnets = data.aws_subnet_ids.private.ids
}
load_balancer {
target_group_arn = aws_alb_target_group.app.arn
container_name = var.alb_target_container_name
container_port = 8097
}
lifecycle {
ignore_changes = [desired_count]
}
depends_on = [aws_ecs_task_definition.app_task]
}
Debug Output
-----------------------------------------------------: timestamp=2022-01-01T16:30:06.055+0530
2022-01-01T16:30:06.055+0530 [INFO] plugin.terraform-provider-aws_v3.70.0_x5: 2022/01/01 16:30:06 [DEBUG] [aws-sdk-go] {"__type":"AccessDeniedException","message":"You do not have authorization to access the specified platform."}: timestamp=2022-01-01T16:30:06.055+0530
2022-01-01T16:30:06.055+0530 [INFO] plugin.terraform-provider-aws_v3.70.0_x5: 2022/01/01 16:30:06 [DEBUG] [aws-sdk-go] DEBUG: Validate Response ecs/CreateService failed, attempt 0/25, error AccessDeniedException: You do not have authorization to access the specified platform.: timestamp=2022-01-01T16:30:06.055+0530
The issue is not due to your TF code, but due to your IAM permissions that you use to run the code. You have to verity your permissions. You may also be limited at the AWS Organization level if your account is part of a group of accounts.
After reading this https://aws.amazon.com/blogs/containers/running-windows-containers-with-amazon-ecs-on-aws-fargate/ came to know that Amazon ECS Exec feature is unsupported in Fargate for Windows tasks and therefore the error occurred.
Disabling below in aws_ecs_service resolved the issue.
enable_execute_command = true
It would be helpful if terraform can show users an appropriate message saying the above feature is not available for windows instead of throwing an error "You do not have authorization to access the specified platform."

AWS Kinesis Firehose unable to index data into AWS Elasticsearch

I am trying to send data from Amazon Kinesis Data Firehose to Amazon Elasticsearch Service, but it's logging an error saying 503 Service Unavailable. However, I can reach the Elasticsearch endpoint (https://vpc-XXX.<region>.es.amazonaws.com) and make queries on it. I also went through How can I prevent HTTP 503 Service Unavailable errors in Amazon Elasticsearch Service? and can confirm my setup have enough resources.
Here's the error I get in my S3 backup bucket that holds the failed logs:
{
"attemptsMade": 8,
"arrivalTimestamp": 1599748282943,
"errorCode": "ES.ServiceException",
"errorMessage": "Error received from Elasticsearch cluster. <html><body><h1>503 Service Unavailable</h1>\nNo server is available to handle this request.\n</body></html>",
"attemptEndingTimestamp": 1599748643460,
"rawData": "eyJ0aWNrZXJfc3ltYm9sIjoiQUxZIiwic2VjdG9yIjoiRU5FUkdZIiwiY2hhbmdlIjotNi4zNSwicHJpY2UiOjg4LjgzfQ==",
"subsequenceNumber": 0,
"esDocumentId": "49610662085822146490768158474738345331794592496281976834.0",
"esIndexName": "prometheus-2020-09",
"esTypeName": ""
},
Anyone have any ideas how to fix this and have the data indexed into Elasticsearch?
Turns out, my issue was with selecting the wrong security group.
I was using the same security group (I named it elasticsearch-${domain_name}) as attached to the Elasticsearch instance (which allowed TCP ingress/egress to/from port 443 from the firehose_es security group). I should have selected the firehose_es security group instead.
As requested in the comment, here's the Terraform configuration for the firehose_es SG.
resource "aws_security_group" "firehose_es" {
name = "firehose_es"
description = "Firehose to send logs to Elasticsearch"
vpc_id = module.networking.aws_vpc_id
}
resource "aws_security_group_rule" "firehose_es_https_ingress" {
type = "ingress"
from_port = 443
to_port = 443
protocol = "tcp"
security_group_id = aws_security_group.firehose_es.id
cidr_blocks = ["10.0.0.0/8"]
}
resource "aws_security_group_rule" "firehose_es_https_egress" {
type = "egress"
from_port = 443
to_port = 443
protocol = "tcp"
security_group_id = aws_security_group.firehose_es.id
source_security_group_id = aws_security_group.elasticsearch.id
}
Another thing which I fixed prior to asking this question (which may be why some of you are reaching this question) is to use the right role and attach the right policy to the role. Here's my role (as Terraform config)
// https://docs.aws.amazon.com/firehose/latest/dev/controlling-access.html
data "aws_iam_policy_document" "firehose_es_policy_specific" {
statement {
actions = [
"s3:AbortMultipartUpload",
"s3:GetBucketLocation",
"s3:GetObject",
"s3:ListBucket",
"s3:ListBucketMultipartUploads",
"s3:PutObject"
]
resources = [
aws_s3_bucket.firehose.arn,
"${aws_s3_bucket.firehose.arn}/*"
]
}
statement {
actions = [
"es:DescribeElasticsearchDomain",
"es:DescribeElasticsearchDomains",
"es:DescribeElasticsearchDomainConfig",
"es:ESHttpPost",
"es:ESHttpPut"
]
resources = [
var.elasticsearch_domain_arn,
"${var.elasticsearch_domain_arn}/*",
]
}
statement {
actions = [
"es:ESHttpGet"
]
resources = [
"${var.elasticsearch_domain_arn}/_all/_settings",
"${var.elasticsearch_domain_arn}/_cluster/stats",
"${var.elasticsearch_domain_arn}/${var.name_prefix}${var.name}_${var.app}*/_mapping/type-name",
"${var.elasticsearch_domain_arn}/_nodes",
"${var.elasticsearch_domain_arn}/_nodes/stats",
"${var.elasticsearch_domain_arn}/_nodes/*/stats",
"${var.elasticsearch_domain_arn}/_stats",
"${var.elasticsearch_domain_arn}/${var.name_prefix}${var.name}_${var.app}*/_stats"
]
}
statement {
actions = [
"ec2:DescribeVpcs",
"ec2:DescribeVpcAttribute",
"ec2:DescribeSubnets",
"ec2:DescribeSecurityGroups",
"ec2:DescribeNetworkInterfaces",
"ec2:CreateNetworkInterface",
"ec2:CreateNetworkInterfacePermission",
"ec2:DeleteNetworkInterface",
]
resources = [
"*"
]
}
}
resource "aws_kinesis_firehose_delivery_stream" "ecs" {
name = "${var.name_prefix}${var.name}_${var.app}"
destination = "elasticsearch"
s3_configuration {
role_arn = aws_iam_role.firehose_es.arn
bucket_arn = aws_s3_bucket.firehose.arn
buffer_interval = 60
compression_format = "GZIP"
}
elasticsearch_configuration {
domain_arn = var.elasticsearch_domain_arn
role_arn = aws_iam_role.firehose_es.arn
# If Firehose cannot deliver to Elasticsearch, logs are sent to S3
s3_backup_mode = "FailedDocumentsOnly"
buffering_interval = 60
buffering_size = 5
index_name = "${var.name_prefix}${var.name}_${var.app}"
index_rotation_period = "OneMonth"
vpc_config {
subnet_ids = var.elasticsearch_subnet_ids
security_group_ids = [var.firehose_security_group_id]
role_arn = aws_iam_role.firehose_es.arn
}
}
}
I was able to figure our my mistake after reading through the Controlling Access with Amazon Kinesis Data Firehose article again.

Attach Auto-Scaling Policy to ECS service from CLI

I have a service running on ECS deployed with Fargate. I am using ecs-cli compose to launch this service. Here is the command I currently use:
ecs-cli compose service up --cluster my_cluster —-launch-type FARGATE
I also have an ecs-params.yml to configure this service. Here is the content:
version: 1
task_definition:
task_execution_role: ecsTaskExecutionRole
task_role_arn: arn:aws:iam::XXXXXX:role/MyExecutionRole
ecs_network_mode: awsvpc
task_size:
mem_limit: 2GB
cpu_limit: 1024
run_params:
network_configuration:
awsvpc_configuration:
subnets:
- "subnet-XXXXXXXXXXXXXXXXX"
- "subnet-XXXXXXXXXXXXXXXXX"
security_groups:
- "sg-XXXXXXXXXXXXXX"
assign_public_ip: ENABLED
Once the service is created, I have to log into the AWS console and attach an auto-scaling policy through the AWS GUI. Is there an easier way to attach an auto-scaling policy, either through the CLI or in my YAML configuration?
While you can use the AWS CLI itself (see application-autoscaling in the docs),
I think it is much better for the entire operation to be performed in one deployment, and for that, you have tools such as Terraform.
You can use the terraform-ecs module written by arminc from Github, or you can do by it yourself! Here's a quick (and really dirty) example for the entire cluster, but you can also just grab the autoscaling part and use that if you don't want to have the entire deployment in one place:
provider "aws" {
region = "us-east-1" # insert your own region
profile = "insert aw cli profile, should be located in ~/.aws/credentials file"
# you can also use your aws credentials instead
# access_key = "insert_access_key"
# secret_key = "insert_secret_key"
}
resource "aws_ecs_cluster" "cluster" {
name = "my-cluster"
}
resource "aws_ecs_service" "service" {
name = "my-service"
cluster = "${aws_ecs_cluster.cluster.id}"
task_definition = "${aws_ecs_task_definition.task_definition.family}:${aws_ecs_task_definition.task_definition.revision}"
network_configuration {
# These can also be created with Terraform and applied dynamically instead of hard-coded
# look it up in the Docs
security_groups = ["SG_IDS"]
subnets = ["SUBNET_IDS"] # can also be created with Terraform
assign_public_ip = true
}
}
resource "aws_ecs_task_definition" "task_definition" {
family = "my-service"
execution_role_arn = "ecsTaskExecutionRole"
task_role_arn = "INSERT_ARN"
network_mode = "awsvpc"
container_definitions = <<DEFINITION
[
{
"name": "my_service"
"cpu": 1024,
"environment": [{
"name": "exaple_ENV_VAR",
"value": "EXAMPLE_VALUE"
}],
"essential": true,
"image": "INSERT IMAGE URL",
"memory": 2048,
"networkMode": "awsvpc"
}
]
DEFINITION
}
#
# Application AutoScaling resources
#
resource "aws_appautoscaling_target" "main" {
service_namespace = "ecs"
resource_id = "service/${var.cluster_name}/${aws_ecs_service.service.name}"
scalable_dimension = "ecs:service:DesiredCount"
# Insert Min and Max capacity here
min_capacity = "1"
max_capacity = "4"
depends_on = [
"aws_ecs_service.main",
]
}
resource "aws_appautoscaling_policy" "up" {
name = "scaling_policy-${aws_ecs_service.service.name}-up"
service_namespace = "ecs"
resource_id = "service/${aws_ecs_cluster.cluster.name}/${aws_ecs_service.service.name}"
scalable_dimension = "ecs:service:DesiredCount"
step_scaling_policy_configuration {
adjustment_type = "ChangeInCapacity"
cooldown = "60" # In seconds
metric_aggregation_type = "Average"
step_adjustment {
metric_interval_lower_bound = 0
scaling_adjustment = 1 # you can also use negative numbers for scaling down
}
}
depends_on = [
"aws_appautoscaling_target.main",
]
}
resource "aws_appautoscaling_policy" "down" {
name = "scaling_policy-${aws_ecs_service.service.name}-down"
service_namespace = "ecs"
resource_id = "service/${aws_ecs_cluster.cluster.name}/${aws_ecs_service.service.name}"
scalable_dimension = "ecs:service:DesiredCount"
step_scaling_policy_configuration {
adjustment_type = "ChangeInCapacity"
cooldown = "60" # In seconds
metric_aggregation_type = "Average"
step_adjustment {
metric_interval_upper_bound = 0
scaling_adjustment = -1 # scale down example
}
}
depends_on = [
"aws_appautoscaling_target.main",
]
}