AWS Cloudformation creates task definition with no container definition - amazon-web-services

The following CloudFormation script creates a task definition but does not seem to create the container definition correctly. Can anyone tell me why?
{
"AWSTemplateFormatVersion": "2010-09-09",
"Description": "Test stack for troubleshooting task creation",
"Parameters": {
"TaskFamily": {
"Description": "The task family to associate the task definition with.",
"Type": "String",
"Default": "Dm-Testing"
}
},
"Resources": {
"TaskDefinition": {
"Type": "AWS::ECS::TaskDefinition",
"Properties": {
"Family": {
"Ref": "TaskFamily"
},
"RequiresCompatibilities": [
"EC2"
],
"ContainerDefinitions": [
{
"Name": "sample-app",
"Image": "nginx",
"Memory": 200,
"Cpu": 10,
"Essential": true,
"Environment": [
{
"Name": "SOME_ENV_VARIABLE",
"Value": "SOME_VALUE"
}
]
}
]
}
}
}
}
When I view the created task, there is no container listed in the builder view of task definition in aws.
The information is listed, however, under the json tab of the task definition:
Note that the above image is a subset of the info shown, not all of it.
The result of this is that, when the task is run in a cluster, it does run the image, but runs it without the environment variables applied. In addition, CF does not report any errors when creating this stack, or when running the created task.
Finally, the CloudFormation script is a cut down example of the 'real' script which has started exhibiting this same issue. That script has been working fine for around a year now, and, as far as I can see, there have been no changes to the script between it working and breaking.
I would greatly appreciate any thoughts or suggestions on this because my face is beginning to hurt from smashing it against this particular wall.

Turns out this was a bug in cloudformation that only occurred when creating a task definition using a script through the aws console. Amazon have now resolved this.

Related

ECS / EC2 auto scaling doesn't deal with two tasks one after the other

I'm currently at my wits end trying to figure this out.
We have a step functions pipeline that runs tasks on a mixture of Fargate and EC2 ECS instances. They are all in the same cluster.
If we run a task that requires EC2, and we want to run another task afterwards that also uses EC2 we have to put a 20 minute Wait command in order for the second task to run successfully.
It doesn't seem to use existing EC2 instances, or scale up any more for when we run the second task? It gives the error of RESOURCE:MEMORY. I would expect it to scale up some more EC2 instances in order to match the demand, or to use the existing EC2 instances to run the tasks.
The ECS cluster has a capacity provider with managed scaling on, managed termination protection on and target capacity at 100%.
The ASG has a min capacity of 0, and a max capacity of 8.
It has managed scaling on.
Instance type is r5.4xlarge
Example step function that recreates the problem:
{
"StartAt": "Set up variables",
"States": {
"Set up variables": {
"Type": "Pass",
"Next": "Map1",
"Result": [
1,
2,
3
],
"ResultPath": "$.input"
},
"Map1": {
"Type": "Map",
"Next": "Map2",
"ItemsPath": "$.input",
"ResultPath": null,
"Iterator": {
"StartAt": "Inner1",
"States": {
"Inner1": {
"ResultPath": null,
"Type": "Task",
"TimeoutSeconds": 2000,
"End": true,
"Resource": "arn:aws:states:::ecs:runTask.sync",
"Parameters": {
"Cluster": "arn:aws:ecs:CLUSTER_ID",
"TaskDefinition": "processing-task",
"NetworkConfiguration": {
"AwsvpcConfiguration": {
"Subnets": [
"subnet-111"
]
}
},
"Overrides": {
"Memory": "110000",
"Cpu": "4096",
"ContainerOverrides": [
{
"Command": [
"sh",
"-c",
"sleep 600"
],
"Name": "processing-task"
}
]
}
}
}
}
}
},
"Map2": {
"Type": "Map",
"End": true,
"ItemsPath": "$.input",
"Iterator": {
"StartAt": "Inner2",
"States": {
"Inner2": {
"ResultPath": null,
"Type": "Task",
"TimeoutSeconds": 2000,
"End": true,
"Resource": "arn:aws:states:::ecs:runTask.sync",
"Parameters": {
"Cluster": "arn:aws:ecs:CLUSTER_ID",
"TaskDefinition": "processing-task",
"NetworkConfiguration": {
"AwsvpcConfiguration": {
"Subnets": [
"subnet-111"
]
}
},
"Overrides": {
"Memory": "110000",
"Cpu": "4096",
"ContainerOverrides": [
{
"Command": [
"sh",
"-c",
"sleep 600"
],
"Name": "processing-task"
}
]
}
}
}
}
}
}
}
}
What I've tried so far:
I've tried changing the cooldown period for the EC2 instances, with a small amount of success. The only problem is that it now scales up too fast and we still have to wait before running more tasks, only we have to wait a shorter time.
Please let me know if what we want is possible and how to do it if it is
Thank you
I very recently ran into a similar scenario with a Capacity Provider. Bursts of concurrent task placements via ECS run-task (invoked with a Lambda) were not returning task information in the response. Despite this, a task was queued in the PROCESSING state on the cluster where it would sit for some time and then eventually fail to start with the error RESOURCE:MEMORY.
Speculation: It seems that the problem is related to the capacity provider's refresh interval of CapacityProviderReservation: https://aws.amazon.com/blogs/containers/deep-dive-on-amazon-ecs-cluster-auto-scaling/.
CapacityProviderReservation needs to change in order for your cluster to scale out (or in) based on its Alarm, but bursts of task placements which exceed your total current capacity don't always seem to satisfy this requirement.
We were able to overcome this behavior of failing to place tasks by exponentially backing off and retrying the call to ECS run-task if the response contains an empty tasks[] collection. This has had only a minor impact on our task placement throughput and we haven't seen the problem reoccur, since.

Running a public image from AWS ECR in ECS Cluster

I have successfully pushed my 3 docker images on ECR.
Configured an ECS cluster.
Created 3 task definitions for those 3 images stored in respective ECR repositories.
Now, I want to run a public image of redis on the same cluster as a different task. I tried created a task definition of the same using the following URL: public.ecr.aws/ubuntu/redis:latest
But as soon as I run it as a new task I get the following error:
Essential container in task exited
Any specific reason for this error or am I doing something wrong?
Ok, so the redis image needs to either set a password (I recommend this as well) or allow it to connect to the redis cluster without a password.
To configure a password or disable passwords auth, you need to set environment variables in the image. You can read the documentation under the heading Configuration
Luckily, this is easy in ECS. You need to specify the environment variable in the task definition. So either:
{
"family": "",
"containerDefinitions": [
{
"name": "",
"image": "",
...
"environment": [
{
"name": "ALLOW_EMPTY_PASSWORD",
"value": "yes"
}
],
...
}
],
...
}
or for a password:
{
"family": "",
"containerDefinitions": [
{
"name": "",
"image": "",
...
"environment": [
{
"name": "REDIS_PASSWORD",
"value": "your_password"
}
],
...
}
],
...
}
For more granular configuration you should read the documentation of the redis docker image I posted above.

AWS Cloudwatch (EventBridge) Event Rule for AWS Batch with Environment Variables

I have created a Cloudwatch Event (EventBridge) Rule that triggers an AWS Batch Job and I want to specify an environment variable and parameters. I'm trying to do so with the following Configured Input (Constant [JSON text]), but when the job is submitted, then environment variables I'm trying to setup in the job with are not included and the parameters are the defaults. The parameters are working as expected.
{
"ContainerProperties": {
"Environment": [
{
"Name": "MY_ENV_VAR",
"Value": "MyVal"
}
]
},
"Parameters": {
"one": "1",
"two": "2",
"three": "3"
}
}
As I was typing out the question, I actually thought to look at the Submit Job API to see what I was doing wrong (I was referencing the CFTs for the Job Definition as my thought process above). For others it may help, I found that I needed to use ContainerOverrides rather than ContainerProperties to have it work properly.
{
"ContainerOverrides": {
"Environment": [
{
"Name": "MY_ENV_VAR",
"Value": "NorthAmerica"
}
]
},
"Parameters": {
"one": "1",
"two": "2",
"three": "3"
}
}
The preceding solution DIDN'T work for me. The correct answer can be found here:
https://aws.amazon.com/premiumsupport/knowledge-center/batch-parameters-trigger-cloudwatch/
I was only able to pass parameters to the job like so:
{
"Parameters": {
"customers": "tgc,localhost"
}
}
I wasn't able to get environment variables to work and didn't try ContainerOverrides.

AWS Step cannot correctly invoke AWS Batch job with complex parameters

I have an existing AWS Steps orchestration that is executing a AWS Batch job via lambdas. However AWS have recently added the ability to directly invoke other services like AWS Batch from a step. I am keen to use this new functionality but cannot get it working.
https://docs.aws.amazon.com/step-functions/latest/dg/connectors-batch.html
So my new step operation that I want to use to invoke Batch.
"File Copy": {
"Type": "Task",
"Resource": "arn:aws:states:::batch:submitJob.sync",
"Parameters": {
"JobName": "MyBatchJob",
"JobQueue": "MySecondaryQueue",
"ContainerOverrides.$": "$.lts_job_container_overrides",
"JobDefinition.$": "$.lts_job_job_definition",
},
"Next": "Upload Start"
}
Note that I am trying to use the $. JSONpath syntax in order to dynamically have parameters passed through the steps.
When given the following inputs
"lts_job_container_overrides": {
"environment": [
{
"name": "MY_ENV_VARIABLE",
"value": "XYZ"
},
],
"command": [
"/app/file_copy.py"
]
},
"lts_job_job_definition": "MyBatchJobDefinition"
I was expected that the environment and command values would be passed through to the corresponding parameter (ContainerOverrides) in AWS Batch. Instead, it appears that AWS Steps is trying to promote them up as top level parameters - and then complaining that they are not valid.
{
"error": "States.Runtime",
"cause": "An error occurred while executing the state 'File Copy'
(entered at the event id #29). The Parameters
'{\"ContainerOverrides\":{\"environment\":
[{\"name\":\"MY_ENV_VARIALBE\",\"value\":\"XYZ\"}],\"command\":
[\"/app/file_copy.py\"]},\"JobDefinition\":\"MyBatchJobDefinition\"}'
could not be used to start the Task: [The field 'environment' is not
supported by Step Functions, The field 'command' is not supported by
Step Functions]"
}
How can I stop AWS Steps from attempting to interpret the values I am trying to pass through to AWS Batch?
I have tried taking JSON path out of the mix and just specifying the ContainerProperties statically (even though this long term won't be a solution). But even then I encounter issues.
"ContainerOverrides": {
"environment": [
{
"name": "RUN_ID",
"value": "xyz"
}
],
"command": "/app/file_copy.py"
}
In this case steps itself rejects the definition file on load.
Invalid State Machine Definition: 'SCHEMA_VALIDATION_FAILED: The field
'environment' is not supported by Step Functions at /States/File
Copy/Parameters, SCHEMA_VALIDATION_FAILED: The field 'command' is not
supported by Step Functions at /States/File Copy/Parameters'
So it just appears that ContainerOverrides is problematic fullstop? Have I misunderstood how it is intended to be used in this scenario?
The above issue has been resolved (as per the answer below) in the AWS Batch documentation - the following note has been added by AWS:
Note
Parameters in Step Functions are expressed in CamelCase, even when the native service API is pascalCase.
This should work, I've tested it seems to be working fine for me. Both Environment and its object keys and Command should be first letter capital.
{
"StartAt": "AWS Batch: Manage a job",
"States": {
"AWS Batch: Manage a job": {
"Type": "Task",
"Resource": "arn:aws:states:::batch:submitJob.sync",
"Parameters": {
"JobName": "test",
"JobDefinition": "jobdef",
"JobQueue": "testq",
"ContainerOverrides": {
"Command": [
"/app/file_copy.py"
],
"Environment": [
{
"Name": "MY_ENV_VARIABLE",
"Value": "XYZ"
}
]
}
},
"End": true
}
}
}

EMR cluster created with CloudFormation not shown

I have added an EMR cluster to a stack. After updating the stack successfully (CloudFormation), I can see the master and slave nodes in EC2 console and I can SSH into the master node. But AWS console does not show the new cluster. Even aws emr list-clusters doesn't show the cluster. I have triple checked the region and I am certain I'm looking at the right region.
Relevant CloudFormation JSON:
"Spark01EmrCluster": {
"Type": "AWS::EMR::Cluster",
"Properties": {
"Name": "Spark01EmrCluster",
"Applications": [
{
"Name": "Spark"
},
{
"Name": "Ganglia"
},
{
"Name": "Zeppelin"
}
],
"Instances": {
"Ec2KeyName": {"Ref": "KeyName"},
"Ec2SubnetId": {"Ref": "PublicSubnetId"},
"MasterInstanceGroup": {
"InstanceCount": 1,
"InstanceType": "m4.large",
"Name": "Master"
},
"CoreInstanceGroup": {
"InstanceCount": 1,
"InstanceType": "m4.large",
"Name": "Core"
}
},
"Configurations": [
{
"Classification": "spark-env",
"Configurations": [
{
"Classification": "export",
"ConfigurationProperties": {
"PYSPARK_PYTHON": "/usr/bin/python3"
}
}
]
}
],
"BootstrapActions": [
{
"Name": "InstallPipPackages",
"ScriptBootstrapAction": {
"Path": "[S3 PATH]"
}
}
],
"JobFlowRole": {"Ref": "Spark01InstanceProfile"},
"ServiceRole": "MyStackEmrDefaultRole",
"ReleaseLabel": "emr-5.13.0"
}
}
The reason is missing VisibleToAllUsers property, which defaults to false. Since I'm using AWS Vault (i.e. using STS AssumeRole API to authenticate), I'm basically a different user every time, so I couldn't see the cluster. I couldn't update the stack to add VisibleToAllUsers either as I was getting Job flow ID does not exist.
The solution was to login as root user and fix things from there (I had to delete the cluster manually, but removing it from the stack template JSON and updating the stack would probably have worked if I hadn't messed things up already).
I then added the cluster back to the template (with VisibleToAllUsers set to true) and updated the stack as usual (AWS Vault).