AWS Batch Job stuck RUNNABLE when Launch template is configured

AWS Batch Job stuck RUNNABLE when Launch template is configured - amazon-web-services

I have configured Step Function with AWS Batch Jobs. All configuration working well but I need to customize starting instance. For this purpose I use Launch Template service and build simple (empty) configuration based on instance type used in AWS Batch configuration. When Compute Environment is build with Launch Template, Batch Job is stuck on RUNNABLE stage. When I run AWS Batch Job without Launch Template everything works OK. Lunching instance form template also works OK. Could anyone give me any advice what is wrong or missing? Below are definitions of whole stack elements.
Launch Template definition
Compute environment details Overview
Compute environment name senet-cluster-r5ad-2xlarge-v3-4
Compute environment ARN arn:aws:batch:eu-central-1:xxxxxxxxxxx:compute-environment/senet-cluster-r5ad-2xlarge-v3-4
ECS Cluster name arn:aws:ecs:eu-central-1:xxxxxxxxxxxx:cluster/senet-cluster-r5ad-2xlarge-v3-4_Batch_3323aafe-d7a4-3cfe-91e5-c1079ee9d02e
Type MANAGED
Status VALID
State ENABLED
Service role arn:aws:iam::xxxxxxxxxxx:role/service-role/AWSBatchServiceRole
Compute resources
Minimum vCPUs 0
Desired vCPUs 0
Maximum vCPUs 25
Instance types r5ad.2xlarge
Allocation strategy BEST_FIT
Launch template lt-023ebdcd5df6073df
Launch template version $Default
Instance rolearn:aws:iam::xxxxxxxxxxx:instance-profile/ecsInstanceRole
Spot fleet role
EC2 Keypair senet-test-keys
AMI id ami-0b418580298265d5c
vpcId vpc-0917ea63
Subnets subnet-49332034, subnet-8902a7e3, subnet-9de503d1
Security groups sg-cdbbd9af, sg-047ea19daf36aa269
AWS Batch Job Definition
{
"jobDefinitionName": "senet-cluster-job-def-3",
"jobDefinitionArn": "arn:aws:batch:eu-central-1:xxxxxxxxxxxxxx:job-definition/senet-cluster-job-def-3:9",
"revision": 9,
"status": "ACTIVE",
"type": "container",
"parameters": {},
"containerProperties": {
"image": "xxxxxxxxxxx.dkr.ecr.eu-central-1.amazonaws.com/senet/batch-process:latest",
"vcpus": 4,
"memory": 60000,
"command": [],
"jobRoleArn": "arn:aws:iam::xxxxxxxxxxxxx:role/AWSS3BatchFullAccess-senet",
"volumes": [],
"environment": [
{
"name": "BATCH_FILE_S3_URL",
"value": "s3://senet-batch/senet_jobs.sh"
},
{
"name": "AWS_DEFAULT_REGION",
"value": "eu-central-1"
},
{
"name": "BATCH_FILE_TYPE",
"value": "script"
}
],
"mountPoints": [],
"ulimits": [],
"user": "root",
"resourceRequirements": [],
"linuxParameters": {
"devices": []
}
}
}

For those of you who had the same problem. Here are the solution works for me. it took me a few days to figure it out.
The default AWS AMI snapshots need at least 30G of storage. When you do not have the launch template, the cloudformation will use the correct storage size.
In my case, I defined only 8G of storage in my launch template. And when the launch template is used, the jobs are stuck in runnable.
Simply change the storage in your launch template to anything bigger than 30G. It shall work.
Also, do not forget IamInstanceProfile and SecurityGroupIds are required in the launch template for the job to get started.

Related

How to share EFS among different ECS tasks and hosted in different instances

Currently, the tasks that we defined are using bind_mount to share the EFS persistent data among containers in a single task, lets say taskA saves in /efs/cache/taskA.
But we are looking to find out, if there's any way to share the EFS data of taskA with the taskB containers in ECS. So taskB can be able to access data from taskA by doing bind_mount in taskB.
So can we use bind_mount in ecs to achieve this? or is there any alternative. Thanks
taskB definition looks like:
containerDefinitions": [
"mountPoints": [
{
"readOnly": null,
"containerPath": "/efs/cache/taskA",
"sourceVolume": "efs_cache_taskA"
},
...],
"volumes": [
{
"fsxWindowsFileServerVolumeConfiguration": null,
"efsVolumeConfiguration": null,
"name": "efs_cache_taskA",
"host": {
"sourcePath": "/efs/cache/taskA"
},
"dockerVolumeConfiguration": null
},
...
}

You no longer need to mount EFS on EC2 and then to bind mounts. Now ECS supports a native integration with ECS (both EC2 and Fargate) that will allow you to configure the tasks to mount the same file system (or Access Point) without even bothering about configuring EC2 (in fact it works with Fargate as well). See this blog post series for more info.

Streaming Cloudwatch Logs to Amazon ES

I'm using Fargate to deploy my application. To log the container logs, I'm using awslogs as the log-driver. Now I want to ship my logs to Amazon ES service. While going through the docs for shipping, I encountered a note that mentions
Streaming large amounts of CloudWatch Logs data to other
destinations might result in high usage charges.
I want to understand what all will I be billed for while shipping the logs to ELK? How do they define large amounts?
Will I be billed for
a) Cloudwatch?
b) Log driver?
c) Lambda function? Does every log-line triggers a lambda function?
Lastly, is there still a possibility to lower the cost more?

Personally I would look running fluent or fluentbit in another container along side your application https://docs.fluentbit.io/manual/pipeline/outputs/elasticsearch
You can send your logs direct to ES then without any cloudwatch costs.
EDIT
Here's the final solution, just in case someone is looking for a cheaper solution.
Run Fluentd/Fuentbit in another container alongside your application
Using the Github Config, I was able to forward the logs to ES with the below config.
{
"family": "workflow",
"cpu": "256",
"memory": "512",
"containerDefinitions": [
{
"name": "log_router",
"image": "docker.io/amazon/aws-for-fluent-bit:latest",
"essential": true,
"firelensConfiguration": {
"type": "fluentbit",
"options":{
"enable-ecs-log-metadata":"true"
}
},
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-create-group": "true",
"awslogs-group": "your_log_group",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "ecs"
}
},
"memoryReservation": 50
},
{
"name": "ContainerName",
"image": "YourImage",
"cpu": 0,
"memoryReservation": 128,
"portMappings": [
{
"containerPort": 5005,
"protocol": "tcp"
}
],
"essential": true,
"command": [
"YOUR COMMAND"
],
"environment": [],
"logConfiguration": {
"logDriver": "awsfirelens",
"secretOptions": [],
"options": {
"Name": "es",
"Host": "YOUR_ES_DOMAIN_URL",
"Port": "443",
"tls": "On",
"Index": "INDEX_NAME",
"Type": "TYPE"
}
},
"resourceRequirements": []
}
]
}
The log_router container collects the logs and ships it to ES. For more info, refer Custom Log Routing
Please note that the log_router container is required in the case of Fargate, but not with ECS.
This is the cheapest solution I know which does not involves Cloudwatch, Lamdas, Kinesis.

Like every resource, AWS charges for use and for maintenance. therefore, the charges will be for the execution of the lambda function and Storing the data in CloudWatch. the reason they mentioned that: Streaming large amounts of CloudWatch Logs data to other destinations might result in high usage charges. Is because it takes time for the lambda function to process the log and insert it into ES, When you try to stream a large number of logs the lambda function will be executed for a longer time.
Lambda function? Does every log-line triggers a lambda function?
Yes, when enabling the streaming from CloudWatch to ES every log inserted to CloudWatch triggers the lambda function.
Image from demonstration (see the trigger):
Is there still a possibility to lower the cost more?
The only way to lower the cost (when using this implementation) is to write your own lambda function which will be triggered every X seconds\minutes and insert to log to ES.
As much as I can tell the cost gap will be Meaningless.
More information:
Lambda code .
How this is working behind the scenes .

Mounting an elastic file system to AWS Batch Computer Enviroment

I'm trying to get my elastic file system (EFS) to be mounted in my docker container so it can be used with AWS batch. Here is what I did:
Create a new AMI that is optimized for Elastic Container Services (ECS). I followed this guide here to make sure it had ECS on it. I also put the mount into /etc/fstab file and verified that my EFS was being mounted (/mnt/efs) after reboot.
Tested an EC2 instance with my new AMI and verified I could pull the docker container and pass it my mount point via
docker run --volume /mnt/efs:/home/efs -it mycontainer:latest
Interactively running the docker image shows me my data inside efs
Set up a new compute enviorment with my new AMI that mounts EFS on boot.
Create a JOB definition File:
{
"jobDefinitionName": "MyJobDEF",
"jobDefinitionArn": "arn:aws:batch:us-west-2:#######:job-definition/Submit:8",
"revision": 8,
"status": "ACTIVE",
"type": "container",
"parameters": {},
"retryStrategy": {
"attempts": 1
},
"containerProperties": {
"image": "########.ecr.us-west-2.amazonaws.com/mycontainer",
"vcpus": 1,
"memory": 100,
"command": [
"ls",
"/home/efs",
],
"volumes": [
{
"host": {
"sourcePath": "/mnt/efs"
},
"name": "EFS"
}
],
"environment": [],
"mountPoints": [
{
"containerPath": "/home/efs",
"readOnly": false,
"sourceVolume": "EFS"
}
],
"ulimits": []
}
}
Run Job, view log
Anyway, while it does not say "no file /home/efs found" it does not list anything in my EFS which is populated, which I'm inerpreting as the container mounting an empty efs. What am I doing wrong? Is my AMI not mounting the EFS in the compute environment?

I covered this in a recent blog post
https://medium.com/arupcitymodelling/lab-note-002-efs-as-a-persistence-layer-for-aws-batch-fcc3d3aabe90
You need to set up a launch template for your batch instances, and you need to make sure that your subnets/security groups are configured properly.

AWS ECS Service for Wordpress

I created a service for wordpress on AWS ECS with the following container definitions
{
"containerDefinitions": [
{
"name": "wordpress",
"links": [
"mysql"
],
"image": "wordpress",
"essential": true,
"portMappings": [
{
"containerPort": 0,
"hostPort": 80
}
],
"memory": 250,
"cpu": 10
},
{
"environment": [
{
"name": "MYSQL_ROOT_PASSWORD",
"value": "password"
}
],
"name": "mysql",
"image": "mysql",
"cpu": 10,
"memory": 250,
"essential": true
}
],
"family": "wordpress"
}
Then went over to the public IP and completed the Wordpress installation. I also added a few posts.
But now, when I update the service to use a an updated task definition (Updated mysql container image)
"image": "mysql:latest"
I loose all the posts created and data and Wordpress prompts me to install again.
What am i doing wrong?
I also tried to use host volumes but to no vail - creates a bind mount and a docker managed volume (Did a docker inspect on container).
So, every time I update the task it resets Wordpress.

If your container needs access to the original data each time it
starts, you require a file system that your containers can connect to
regardless of which instance they’re running on. That’s where EFS
comes in.
EFS allows you to persist data onto a durable shared file system that
all of the ECS container instances in the ECS cluster can use.
Step-by-step Instructions to Setup an AWS ECS Cluster
Using Data Volumes in Tasks
Using Amazon EFS to Persist Data from Amazon ECS Containers

Where are the volumes located when using ECS and Fargate?

I have the following setup (I've stripped out the non-important fields):
{
"ECSTask": {
"Type": "AWS::ECS::TaskDefinition",
"Properties": {
"ContainerDefinitions": [
{
"Name": "mysql",
"Image": "mysql",
"MountPoints": [{"SourceVolume": "mysql", "ContainerPath": "/var/lib/mysql"}]
}
],
"RequiresCompatibilities": ["FARGATE"],
"Volumes": [{"Name": "mysql"}]
}
}
}
It seems to work (the container does start properly), but I'm not quite sure where exactly is this volume being saved. I assumed it would be an EBS volume, but I don't see it there. I guess it's internal to my task - but in that case - how do I access it? How can I control its limits (min/max size etc)? How can I create a backup for this volume?
Thanks.

Fargate does not support persistent volumes. Any volumes created attached to fargate tasks are ephemeral and cannot be initialized from an external source or backed up, sadly.
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/using_data_volumes.html

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

AWS Batch Job stuck RUNNABLE when Launch template is configured - amazon-web-services

Related

How to share EFS among different ECS tasks and hosted in different instances

Streaming Cloudwatch Logs to Amazon ES

Mounting an elastic file system to AWS Batch Computer Enviroment

AWS ECS Service for Wordpress

Where are the volumes located when using ECS and Fargate?

Categories

Resources