Cannot use memoryReservation in Terraform ECS provider - amazon-web-services

I am running tests where it is not desirable for my containers to have hard memory limits as I programatically swapping the vms for bigger sized ones and need the containers to be able to leverage the increment in CPU and memory automatically.
I want to explore memoryReservation as it is softlimit and will allows the containers scale up, if the memory of the VM is not low.
Unfortunately, this parameter does not seem to work in the task definition. Any ideas?
Task definition:
resource "aws_ecs_task_definition" "quorum" {
family = "quorum-${var.consensus_mechanism}-${var.tx_privacy_engine}-${var.network_name}"
container_definitions = "${replace(element(compact(local.container_definitions), 0), "/\"(true|false|[0-9]+)\"/", "$1")}"
requires_compatibilities = ["${var.ecs_mode}"]
# cpu = "4096"
# memory = "81920"
memoryReservation = "8192"
network_mode = "${var.ecs_network_mode}"
task_role_arn = "${aws_iam_role.ecs_task.arn}"
execution_role_arn = "${aws_iam_role.ecs_task.arn}"
volume {
name = "${local.shared_volume_name}"
}
volume {
name = "docker_socket"
host_path = "/var/run/docker.sock"
}
}
Error:
[FINAL] Summary execution:
Wrote summarry output to: .mjolnir//output.log
2 errors occurred:
* aws_ecs_service.quorum: 5 errors occurred:
* aws_ecs_service.quorum[3]: Resource 'aws_ecs_task_definition.quorum' not found for variable 'aws_ecs_task_definition.quorum.revision'
* aws_ecs_service.quorum[0]: Resource 'aws_ecs_task_definition.quorum' not found for variable 'aws_ecs_task_definition.quorum.revision'
* aws_ecs_service.quorum[4]: Resource 'aws_ecs_task_definition.quorum' not found for variable 'aws_ecs_task_definition.quorum.revision'
* aws_ecs_service.quorum[2]: Resource 'aws_ecs_task_definition.quorum' not found for variable 'aws_ecs_task_definition.quorum.revision'
* aws_ecs_service.quorum[1]: Resource 'aws_ecs_task_definition.quorum' not found for variable 'aws_ecs_task_definition.quorum.revision'
* output._status: Resource 'aws_ecs_task_definition.quorum' not found for variable 'aws_ecs_task_definition.quorum.revision'
Restoring env variables.
Error occured: 4
I will be deeply appreciative of pointers

ECS task definitions are made up of a multiple of container definitions with some extra parameters that can set hard limits for the whole task and also set things like placement constraints and networking configuration.
To set the memory soft limit that a task is allowed to use in ECS rather than the hard limit you need to use memoryReservation from the container definition rather than the task definition.
The code in your question doesn't show how you are defining the container definitions in your local but a basic example of setting soft memory limits in an ECS task would look something like this:
resource "aws_ecs_task_definition" "service" {
family = "service"
container_definitions = <<EOF
[
{
"name": "first",
"image": "service-first",
"cpu": 10,
"memoryReservation": 512,
"essential": true,
"portMappings": [
{
"containerPort": 80,
"hostPort": 80
}
]
},
{
"name": "second",
"image": "service-second",
"cpu": 10,
"memoryReservation": 256,
"essential": true,
"portMappings": [
{
"containerPort": 443,
"hostPort": 443
}
]
}
]
EOF
}

Related

AWS ECS error "Container.name contains invalid characters"

I am setting up an ECS task using Terraform and am encountering an error. The error is “Error: failed creating ECS Task Definition (web-2048-task): ClientException: Container.name contains invalid characters.”
Here is my task code:
resource "aws_ecs_task_definition" "aws-ecs-task" {
family = "${var.app_name}-task"
container_definitions = file("templates/task.json")
requires_compatibilities = ["FARGATE"]
network_mode = "awsvpc"
memory = "2048" #2 GB
cpu = "1024" #1 vCPU
execution_role_arn = aws_iam_role.ecsTaskExecutionRole.arn
task_role_arn = aws_iam_role.ecsTaskExecutionRole.arn
tags = {
Name = "${var.app_name}-ecs-td"
Environment = var.app_environment
}
}
Here is the json task definition code:
[{
"container_name": "${var.app_name}",
"name": "${var.app_name}",
"image": "${data.aws_ecr_repository.aws-ecr.repository_url}:latest",
"essential": true
},
{
"portMappings": [{
"containerPort": 80,
"hostPort": 80
}],
"cpu": 1024,
"memory": 2048,
"networkMode": "awsvpc"
}
]
I tried replacing the ${var.app_name} in the json to be the name of the app, which is ‘web-2048`, but the error changes to “Container.name should not be null or empty.” When I change the app name back to a variable, then I get the above error again.
I checked the Terraform Registry for aws_ecs_task_definition but didn’t see any info regarding container names. Same for the Aws developer guide section on task definitions but didn't find anything that helped.
Can I get some guidance on this?
I managed to reproduce the error. It is because the variables are not replaced with values in the task.json file as you are using the file built-in function:
file reads the contents of a file at the given path and returns them as a string.
Because of that, the "name": "${var.app_name}" is read as a string literal. Since that is the case and as per the documentation [1]:
Up to 255 letters (uppercase and lowercase), numbers, hyphens, and underscores are allowed
The ., $ and {} are not allowed and that is why you are getting the ClientException: Container.name contains invalid characters. error.
In order to fix this, I suggest using the templatefile built-in function [2]. It will require some changes to the code:
resource "aws_ecs_task_definition" "aws-ecs-task" {
family = "${var.app_name}-task"
container_definitions = templatefile("templates/task.json", {
app_name = var.app_name
repository_url = data.aws_ecr_repository.aws-ecr.repository_url
})
requires_compatibilities = ["FARGATE"]
network_mode = "awsvpc"
memory = "2048" #2 GB
cpu = "1024" #1 vCPU
execution_role_arn = aws_iam_role.ecsTaskExecutionRole.arn
task_role_arn = aws_iam_role.ecsTaskExecutionRole.arn
tags = {
Name = "${var.app_name}-ecs-td"
Environment = var.app_environment
}
}
This will also require slight modifications in the task.json file:
[{
"name": "${app_name}",
"image": "${repository_url}:latest",
"essential": true
},
{
"portMappings": [{
"containerPort": 80,
"hostPort": 80
}],
"cpu": 1024,
"memory": 2048,
"networkMode": "awsvpc"
}
]
What templatefile will do is replace the placeholder variables you are passing to it (app_name and repository_url) inside of the JSON file with the values you provide. Additionally, you might consider renaming the template file into something like task.json.tftpl. The call to the templatefile function in that case would have to be fixed to:
container_definitions = templatefile("templates/task.json.tftpl", {
app_name = var.app_name
repository_url = data.aws_ecr_repository.aws-ecr.repository_url
})
[1] https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_definition_parameters.html
[2] https://developer.hashicorp.com/terraform/language/functions/templatefile
Have you checked if the substitution to the name of your container image contains only valid character and not having '.' in it? The specific error you are getting appears to be because you have included the '.' character in your name attribute. From the above docs:
Up to 255 letters (uppercase and lowercase), numbers, hyphens, and underscores are allowed
If that doesn't resolve 'Container.name should not be null or empty' issue, it could be some other reasons, some in the past has mistakenly define the key-value pair labeled in the environment variable key, double check if it's "name".
{ "name": "xxx", "value": "yyy" }

ResourceInitializationError with Fargate ECS deployment

I'm fairly new to AWS. I am trying to deploy a docker container to ECS but it fails with the following error:
ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: unable to retrieve ecr registry auth: service call has been retried 3 time(s): RequestError: send request failed caused by: Post "https://api.ecr.us-east-1.amazonaws.com/": dial tcp 52.46.146.144:443: i/o timeout
This was working perfectly fine, until I tried to add a loadbalancer, at which point this error began occuring. I must have changed something but I'm not sure what.
The ECS instance is in a public subnet
The security group has in/out access on all ports/ips (0.0.0.0/0)
The VPC has an internet gateway
Clearly something is wrong with my config but I'm not sure what. Google and other stack overflows haven't helped so far.
Terraform ECS file:
resource "aws_ecs_cluster" "solmines-ecs-cluster" {
name = "solmines-ecs-cluster"
}
resource "aws_ecs_service" "solmines-ecs-service" {
name = "solmines"
cluster = aws_ecs_cluster.solmines-ecs-cluster.id
task_definition = aws_ecs_task_definition.solmines-ecs-task-definition.arn
launch_type = "FARGATE"
desired_count = 1
network_configuration {
security_groups = [aws_security_group.solmines-ecs.id]
subnets = ["${aws_subnet.solmines-public-subnet1.id}", "${aws_subnet.solmines-public-subnet2.id}"]
assign_public_ip = true
}
load_balancer {
target_group_arn = aws_lb_target_group.solmines-lb-tg.arn
container_name = "solmines-api"
container_port = 80
}
depends_on = [aws_lb_listener.solmines-lb-listener]
}
resource "aws_ecs_task_definition" "solmines-ecs-task-definition" {
family = "solmines-ecs-task-definition"
network_mode = "awsvpc"
requires_compatibilities = ["FARGATE"]
memory = "1024"
cpu = "512"
execution_role_arn = "${aws_iam_role.solmines-ecs-role.arn}"
container_definitions = <<EOF
[
{
"name": "solmines-api",
"image": "${aws_ecr_repository.solmines-ecr-repository.repository_url}:latest",
"memory": 1024,
"cpu": 512,
"essential": true,
"portMappings": [
{
"containerPort": 80,
"hostPort": 80
}
]
}
]
EOF
}

AWS ECS does not start new tasks

AWS ECS cluster services do not start new tasks.
Already checked:
ECS EC2 instances are registered, active, full CPU and memory available, ECS agent is connected.
there are no events in ECS service "Events" tab, nothing about registering, starting, stopping, no errors, it's just empty.
Registered EC2 instances are set up correctly, in other cluster the same AMI is working perfect.
Task definition is correct, it used to work a day before and since then no changes happened.
Checked Service role contains all relevant policies
Querying ECS with AWS CLI aws ecs describe-services --services my-service --cluster my-cluster yields that deployment rollout is constantly IN_PROGRESS and stays like this.
Full response with configuration is here (I've substituted real names and IDs):
{
"serviceArn": "arn:aws:ecs:eu-central-1:my-account-id:service/my-cluster/my-service",
"serviceName": "my-service",
"clusterArn": "arn:aws:ecs:eu-central-1:my-account-id:cluster/my-cluster",
"loadBalancers": [
{
"targetGroupArn": "arn:aws:elasticloadbalancing:eu-central-1:my-account-id:targetgroup/my-service-lb/load-balancer-id",
"containerName": "my-service",
"containerPort": 8065
}
],
"serviceRegistries": [
{
"registryArn": "arn:aws:servicediscovery:eu-central-1:my-account-id:service/srv-srv_id",
"containerName": "my-service",
"containerPort": 8065
}
],
"status": "ACTIVE",
"desiredCount": 1,
"runningCount": 0,
"pendingCount": 0,
"launchType": "EC2",
"taskDefinition": "arn:aws:ecs:eu-central-1:my-account-id:task-definition/my-service:76",
"deploymentConfiguration": {
"deploymentCircuitBreaker": {
"enable": false,
"rollback": false
},
"maximumPercent": 200,
"minimumHealthyPercent": 100
},
"deployments": [
{
"id": "ecs-svc/deployment_id",
"status": "PRIMARY",
"taskDefinition": "arn:aws:ecs:eu-central-1:my-account-id:task-definition/my-service:76",
"desiredCount": 1,
"pendingCount": 0,
"runningCount": 0,
"failedTasks": 0,
"createdAt": "2022-06-28T09:15:08.241000+02:00",
"updatedAt": "2022-06-28T09:15:08.241000+02:00",
"launchType": "EC2",
"rolloutState": "IN_PROGRESS",
"rolloutStateReason": "ECS deployment ecs-svc/deployment_id in progress."
}
],
"roleArn": "arn:aws:iam::my-account-id:role/aws-service-role/ecs.amazonaws.com/AWSServiceRoleForECS",
"events": [],
"createdAt": "2022-06-28T09:15:08.241000+02:00",
"placementConstraints": [],
"placementStrategy": [
{
"type": "spread",
"field": "attribute:ecs.availability-zone"
}
],
"healthCheckGracePeriodSeconds": 120,
"schedulingStrategy": "REPLICA",
"createdBy": "arn:aws:iam::my-account-id:role/my-role",
"enableECSManagedTags": false,
"propagateTags": "NONE",
"enableExecuteCommand": false
}
The ECS service and service discovery entry is created using Terraform, and the service definition is
resource "aws_service_discovery_service" "ecs_discovery_service" {
name = var.service_name
dns_config {
namespace_id = var.service_discovery_hosted_zone_id
dns_records {
ttl = 10
type = "SRV"
}
}
health_check_custom_config {
failure_threshold = 1
}
}
resource "aws_ecs_service" "ecs_service" {
name = var.service_name
cluster = var.ecs_cluster_id
task_definition = var.task_definition_arn
desired_count = var.desired_count
deployment_minimum_healthy_percent = 100
deployment_maximum_percent = 200
health_check_grace_period_seconds = var.health_check_grace_period_seconds
target_group_arn = aws_lb_target_group.target_group.arn
container_name = var.service_name
container_port = var.service_container_port
ordered_placement_strategy {
type = "spread"
field = "attribute:ecs.availability-zone"
}
service_registries {
registry_arn = aws_service_discovery_service.ecs_discovery_service.arn
container_name = var.service_name
container_port = var.service_container_port
}
}
This code used to work pretty fine, and without any changes in infrastructure, after destroying and applying the infrastructure code, ECS does not start any new tasks.
I could narrow problem to the service discovery, as if I remove the service_registries section, the tasks are started as normal.
Removing the service discovery solves the issue, however it's not the proper solution and I don't understand what is the reason of the problem.
Again, the Service Role has the permissions for the service discovery.
"servicediscovery:DeregisterInstance",
"servicediscovery:Get*",
"servicediscovery:List*",
"servicediscovery:RegisterInstance",
"servicediscovery:UpdateInstanceCustomHealthStatus"
I can't find any ways to trace this strange behaviour and want to ask you guys for help:
could you give me any hints what / where I could check. I've checked multiple troubleshooting guides, however all of them rely on events in ECS service and I don't have any there, anything else I had in mind is checked.
maybe you know what could be the problem that the service discovery blocks the ECS to start new tasks? I thought ECS adds a SRV record to the registry when it starts the container and the container is healthy, however I could not see that any containers have been started at all.
I would be very thankful for any hints and let me know if you need any details.
Have a nice day and best regards.

CannotStartContainerError: ResourceInitializationError: failed to create new container runtime task: failed to create shim: OCI runtime create failed:

Here's the full error message:
CannotStartContainerError: ResourceInitializationError: failed to create new container runtime task: failed to create shim: OCI runtime create failed: container_linux.go:380: starting container process caused: exec: "/": permission denied: unknown Entry point
I have an application that I created a docker image with and had it working fine on lambda. The image is on ECR. I deleted my lambda function, created a docker container in ECS from that image and utilized Fargate.
here is my main.tf file in my ECS module on Terraform that I used to create this task.
resource "aws_ecs_cluster" "cluster" {
name = "python-cloud-cluster"
}
resource "aws_ecs_service" "ecs-service" {
name = "python-cloud-project"
cluster = aws_ecs_cluster.cluster.id
task_definition = aws_ecs_task_definition.pcp-ecs-task-definition.arn
launch_type = "FARGATE"
network_configuration {
subnets = var.service_subnets
security_groups = var.pcp_service_sg
assign_public_ip = true
}
desired_count = 1
}
resource "aws_ecs_task_definition" "pcp-ecs-task-definition" {
family = "ecs-task-definition-pcp"
network_mode = "awsvpc"
requires_compatibilities = ["FARGATE"]
memory = "1024"
cpu = "512"
task_role_arn = var.task_role_arn
execution_role_arn = var.task_role_arn
container_definitions = <<EOF
[
{
"name": "pcp-container",
"image": "775362094965.dkr.ecr.us-west-2.amazonaws.com/weather-project:latest",
"memory": 1024,
"cpu": 512,
"essential": true,
"entryPoint": ["/"],
"portMappings": [
{
"containerPort": 80,
"hostPort": 80
}
]
}
]
EOF
}
I found a base template online and altered it to fit my needs. I just realized the entry point is set to ["/"] in the task definition, which was default from the template I used. What should I be setting it to? Or this error caused by a different issue?
entryPoint is optional, and you don't have to specify it if you don't know what it is.
In your case it is / which is incorrect. It should be some executable (e.g. /bin/bash), and it depends on your container and what the container does. But again, its optional.
You have to check documentation of your weather-project container, and see what exactly it does and how to use it.

awsvpc: Network Configuration is not valid for the given networkMode of this task definition

My task definition:
resource "aws_ecs_task_definition" "datadog" {
family = "${var.environment}-datadog-agent-task"
task_role_arn = "arn:aws:iam::xxxxxxxx:role/datadog-role"
container_definitions = <<EOF
[
{
"name": "${var.environment}-${var.datadog-identifier}",
"network_mode" : "awsvpc",
"image": "datadog/agent:latest",
"portMappings": [
{
...
My service defintion:
resource "aws_ecs_service" "datadog" {
name = "${var.environment}-${var.datadog-identifier}-datadog-ecs-service"
cluster = "${var.cluster}"
task_definition = "${aws_ecs_task_definition.datadog.arn}"
network_configuration {
subnets = flatten(["${var.private_subnet_ids}"])
}
# This allows running one for every instance
scheduling_strategy = "DAEMON"
}
I get the following error -
InvalidParameterException: Network Configuration is not valid for the given networkMode of this task definition
Is there something I am missing here? Looking at the Terraform docs and GitHub issues this should have worked. Is it related to running Datadog as a daemon?
You need to set the aws_ecs_task_definition's network_mode to awsvpc if you are defining the network_configuration of the service that uses that task definition.
This is mentioned in the documentation for the network_configuration parameter of the aws_ecs_service resource:
network_configuration - (Optional) The network configuration for the
service. This parameter is required for task definitions that use the
awsvpc network mode to receive their own Elastic Network Interface,
and it is not supported for other network modes.
In your case you've added the network_mode parameter to the container definition instead of the task definition (a task is a collection of n containers and are grouped together to share some resources). The container definition schema doesn't allow for a network_mode parameter.