AWS ECS tasks are being killed by OOM without leaving any trace - amazon-web-services

I have an ECS cluster where I place a container that runs as a daemon to monitor all other processes. However, I'm seeing this containers being killed by OOM from time to time without leaving a trace. I just happened to spot one of them being killed. This is causing some log duplication but I wonder if there is a way to trace these restarts because when I look on the ECS Cluster events, there are no information about this tasks being restarted by any means.
I know more from kubernetes so I would say an analogy here. When this happens on kubernetes you would see a RESTARTS counter when you get information from all pods (kubectl get pods) is there any way to find this information on AWS ECS tasks? I'm struggling to find on documentation
I identified the tasks, and also I identified the status of each tasks to gain more information, but I'm unable to find any hint that the process was restarted or killed before.
this is a task detail example
- attachments: []
attributes:
- name: ecs.cpu-architecture
value: x86_64
availabilityZone: us-east-2c
clusterArn: arn:aws:ecs:us-west-2:99999999999:cluster/dev
connectivity: CONNECTED
connectivityAt: '2023-01-24T23:03:23.315000-05:00'
containerInstanceArn: arn:aws:ecs:us-east-2:99999999999:container-instance/dev/eb8875fhfghghghfjyjk88c8f96433b8
containers:
- containerArn: arn:aws:ecs:us-east-2:99999999999:container/dev/05d4a402ee274a3ca90a86e46292a63a/e54af51f-2420-47ab-bff6-dcd4f976ad2e
cpu: '500'
healthStatus: HEALTHY
image: public.ecr.aws/datadog/agent:7.36.1
lastStatus: RUNNING
memory: '750'
name: datadog-agent
networkBindings:
- bindIP: 0.0.0.0
containerPort: 8125
hostPort: 8125
protocol: udp
- bindIP: 0.0.0.0
containerPort: 8126
hostPort: 8126
protocol: tcp
networkInterfaces: []
runtimeId: 75559b7327258d69fe61cac2dfe58b12d292bdb7b3a720c457231ee9e3e4190a
taskArn: arn:aws:ecs:us-east-2:99999999999:task/dev/05d4a402ee274a3ca90a86e46292a63a
cpu: '500'
createdAt: '2023-01-24T23:03:22.841000-05:00'
desiredStatus: RUNNING
enableExecuteCommand: false
group: service:datadog-agent
healthStatus: HEALTHY
lastStatus: RUNNING
launchType: EC2
memory: '750'
overrides:
containerOverrides:
- name: datadog-agent
inferenceAcceleratorOverrides: []
pullStartedAt: '2023-01-24T23:03:25.471000-05:00'
pullStoppedAt: '2023-01-24T23:03:39.790000-05:00'
startedAt: '2023-01-24T23:03:47.514000-05:00'
startedBy: ecs-svc/1726924224402147943
tags: []
taskArn: arn:aws:ecs:us-west-2:99999999999:task/dev/05d4a402ee274a3ca90a86e46292a63a
taskDefinitionArn: arn:aws:ecs:us-west-2:99999999999:task-definition/datadog-agent-task:5
version: 2

I don't think ECS tracks or exposes a restart counter for tasks. If you want to be notified of tasks restarting you can create an Event Bridge subscription.

You can use ECS Event with EventBridge and add any action like logging when such event happen.

So, after debugging a lot within the little information AWS provides for this use case, I ended up doing a process to find the answer:
List all tasks ids of a given service with aws-cli with flag --desired-status STOPPED and dump all to a json file
aws ecs list-tasks --cluster dev --service-name datadog-agent
--desired-status STOPPED --output json > ecs_tasks.json
using jq and aws-cli, describe all previously found tasks ids to get further information on each one of them
aws ecs describe-tasks --cluster dev --tasks $(jq -j '.taskArns[] |
(.|" ",.)' ./ecs_tasks.json) --output yaml > ecs_tasks_describe.log
I could came up with a script to group and summarize the information but, since I only had to watch over 20 stopped tasks I ended up dumping the information in yaml format for easiness. I found two key properties on the output:
For each task object, there is a reason for why it was stopped that told me nothing more than it was stopped because a container within the task exited (doesn't say the exit code to help though)
stoppedReason: Essential container in task exited
* For each task object, there is an array of containers objects under **containers** property. There you'll sometimes find **reason** property which can explain a bit more of why the container stopped
reason: 'OutOfMemoryError: Container killed due to memory usage'
Note: This information would give you all events for a given service for at least the last hour. In my case it gave me 8 hours of events but AWS documentation only promises 1 hour https://docs.aws.amazon.com/AmazonECS/latest/developerguide/stopped-task-errors.html
Stopped tasks only appear in the Amazon ECS console, AWS CLI, and AWS SDKs for at least 1 hour after the task stops. After that, the details of the stopped task expire and aren't available in Amazon ECS.

Related

ECS Fargate foverver with "rolloutState" in "IN_PROGRESS" with no stopped task on AWS Console

I'm using ECS with Fargate, I have a service running and it is working OK.But after I update the task definition (new deploy) and the console (ECS -> Clusters -> Tasks tab) shows that my current task is INACTIVE, which is normal, but it doesn't show any new task being created, nor any stopped task, even after an hour. It is as if ECS is not trying to run my new definition.
If I use the awscli to find information about my service:
aws ecs describe-services --cluster cluster-xxxxxxx --services service-svc-xxxxxxx --region us-east-1
It has two deployments. The first is alright, it is the running deployment. The most recent one, it shows:
"desiredCount": 1,
"pendingCount": 0,
"runningCount": 0,
"failedTasks": 7,
...
"rolloutState": "IN_PROGRESS",
"rolloutStateReason": "ECS deployment ecs-svc/XXXXXXXXXXXXXXXXXX in progress."
Again, there is nothing on the ECS console that points to failed tasks. It is as if the task is failing on a so premature state, its not even logging anything.
I tried looking at CloudTrail event, but there is nothing there about failed tasks. On CloudWatch, the logs for container insights (/aws/ecs/containerinsights/cluster-xxxxxxx/performance) also don't mention failing tasks.
How can I troubleshoot this situation?
Turns out, if the account/organization is new, then AWS apply quotas that seems not to be documented anywhere. ECS was not authorized to start more than two tasks.
There is a trick, I found on this post, that says that I had to create an EC2 instance on the same region I was trying to use ECS. Shortly after I created the EC2 instance, I received an email from AWS and everything behaved normally again. https://forums.aws.amazon.com/thread.jspa?threadID=340770

Running AWS ECS Task Attached (Not Detached)

Is there easy way to run an ECS Task attached or to follow the logs only while the container is Running (ie. Detach after displaying all of the logs associated)?
Using the AWS CLI (1.17.0) and ecs-cli (1.21.0), I have gotten decently close with the following two commands:
aws ecs run-task --cluster "mycluster" --task-definition testhelloworldjob --launch-type FARGATE --network-configuration etc.etc.etc.
ecs-cli logs --task-id {TASK_ID_HERE_FROM_OUTPUT_OF_PREVIOUS_COMMAND} --follow
I am currently have two issues with the above approach:
There is a race condition being that the logs are not available when the task is in a pre "running" state. Instead of ecs-cli logs waiting for the logs to exist, there is an error immediately thrown.
Even after waiting for the task to be in a running state, and issuing the ecs-cli logs the command refuses to detach even AFTER the task is finished and in a Post Running status.
For the first issue I could poll until there is a post activating/pending status, prior to calling logs. For the second issue I could draft some type of threaded call that would poll to stop the following of a log after the container in question is no longer running.... But there has to be an easier way?
To clarify I am coming from numerous other container orchestration tools/technologies that seemingly supported this very seamlessly. Here are some examples of tools and their associated commands that would yield me my intended results:
Docker CLI:
docker run hello-world
Docker-Compose Yaml:
docker-compose up
K8 Kubectl Yaml:
kubectl apply -f ./hello-k8.yaml && kubectl logs --follow hello-world
I think ecs-cli is the best option available at the moment.
Apart from that, you can change the logs driver of the AWS ECS task to syslog and then watch the logs file from the terminal after doing SSH into the EC2 container instance in which it is running.
Another thing you can do is SSH into the EC2 container instance in which it was running before and then run the container of that AWS ECS task by yourself in it using docker run and once the testing is done, you can stop and remove that container and then get that task started via AWS ECS.
Note: You can use AWS SSM Session Manager in order to avoid using EC2 key pair and adding an inbound rule for SSH.

Bash, Conda, Docker, and Ray: What startup commands should be given to Ray to properly source the bash profile in a docker container at runtime?

I'm trying to use Ray and Docker to launch jobs programatically on EC2. I want to use conda in my Docker container for package management. I've figured out how to build the container such that if I run
docker run -i -t my_container:my_tag /bin/bash I can launch my jobs in the container locally. The problem is that when I add Ray into the picture to launch the jobs remotely, Ray fails with errors like these:
start: ray: command not found
Cluster: my-cluster
Checking AWS environment settings
AWS config
IAM Profile: ray-head-v1
EC2 Key pair (head & workers): [redacted]
VPC Subnets (head & workers): [redacted]
EC2 Security groups (head & workers): [redacted]
EC2 AMI (head & workers): [redacted]
No head node found. Launching a new cluster. Confirm [y/N]: y [automatic, due to --yes]
Acquiring an up-to-date head node
Launched 1 nodes [subnet_id=[redacted]]
Launched instance i-067e250cc8591da86 [state=pending, info=pending]
Launched a new head node
Fetching the new head node
<1/1> Setting up head node
Prepared bootstrap config
New status: waiting-for-ssh
[1/6] Waiting for SSH to become available
Running `uptime` as a test.
Waiting for IP
Not yet available, retrying in 10 seconds
Not yet available, retrying in 10 seconds
Not yet available, retrying in 10 seconds
Received: 3.21.104.163
SSH still not available SSH command failed., retrying in 5 seconds.
SSH still not available SSH command failed., retrying in 5 seconds.
Success.
Updating cluster configuration. [hash=1e011279ffec6f94b2bff4ebf536e6966be5c79a]
New status: syncing-files
[3/6] Processing file mounts
[4/6] No worker file mounts to sync
New status: setting-up
[3/6] No initialization commands to run.
[4/6] No setup commands to run.
[6/6] Starting the Ray runtime
New status: update-failed
!!!
SSH command failed.
!!!
Failed to setup head node.
At this point I've reached the limit of what I understand about how Ray and Docker interact. I assume the problem is that head_start_ray_commands gets passed to docker run somehow. Since Docker uses the sh shell to run commands, the bash profile isn't getting sourced properly, so packages like conda and ray aren't working. That explains why there's nothing wrong with the container when I launch a bash shell in interactive mode in a local container instance. I've tried adding /bin/bash --login at the beginning of head_start_ray_commands but that only seems to cause the whole program to freeze.
What is the right way to get Ray to source the bash profile before executing commands? If that isn't possible, is there a better way to do this? For reference, here's my current ray config:
init:
address: null
remote: {}
cluster:
cluster_name: my-cluster
min_workers: 0
max_workers: 2
initial_workers: 0
autoscaling_mode: default
target_utilization_fraction: 0.8
idle_timeout_minutes: 5
docker:
image: [redacted]
container_name: 'my-container'
pull_before_run: true
run_options: ["--gpus 'all'"]
provider:
type: aws
region: us-east-2
availability_zone: us-east-2a,us-east-2b
cache_stopped_nodes: false
key_pair:
key_name: [redacted]
auth:
ssh_user: ubuntu
head_node:
IamInstanceProfile:
Arn: [redacted]
InstanceType: p2.xlarge
ImageId: ami-08e16447bd5caf26a
worker_nodes:
IamInstanceProfile:
Arn: [redacted]
InstanceType: p2.xlarge
ImageId: ami-08e16447bd5caf26a
file_mounts: {}
initialization_commands: []
setup_commands: []
head_setup_commands: []
worker_setup_commands: []
head_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --head --port=6379 --object-manager-port=8076
--autoscaling-config=~/ray_bootstrap_config.yaml
worker_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076
Edit
The simplest fix seems to be just avoiding conda altogether in favor of venv.

Where can I specify Volumes in AWS FARGATE ECS

I have below data with me -
minio:
image: minio/minio:latest
#ports:
# - '9000:9000'
volumes:
- ./data/storage:/data
environment:
MINIO_ACCESS_KEY: minio
MINIO_SECRET_KEY: minio123
command: server /data
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
interval: 30s
timeout: 20s
retries: 3
restart: always
I want to manually create task definition in FARGATE ECS and then add containers in it.[No Coding]
Where can I specify volumes specified above inside containers ?
To answer your query specific to volumes, you would have to specify the volumes in a task definition which is used to run a task in AWS Fargate. You can have a look at this documentation. This also lists the limitation when it comes to storage in AWS Fargate. AWS Fargate does not support any way to have persistent storage except EFS which was launched recently.
If your use case allows EFS check out this blog which demonstrates Amazon Elastic Container Service & AWS Fargate, now support Amazon Elastic File System

ecs-cli compose service up doesn't terminate

I want to build a script to deploy a docker container to ecs.
This is the command I am using.
ecs-cli compose --file src/main/docker/docker-compose-export.yml -p export service up
It works about 60% of the time. The other 40% of the time the command stalls.
This is the compose file
version: '2'
services:
export:
image: 1234567890lalala.dkr.ecr.eu-central-1.amazonaws.com/export:${VERSION}
cpu_shares: 200
mem_limit: 100000000
I have uploaded the image to the ecs registry before.
This is the log I am getting:
WARN[0000] Skipping unsupported YAML option... option name=networks
WARN[0000] Skipping unsupported YAML option for service... option name=networks service name=export
INFO[0000] Using ECS task definition TaskDefinition="ecscompose-export:3"
INFO[0000] Updated the ECS service with a new task definition. Old containers will be stopped automatically, and replaced with new ones desiredCount=1 serviceName=ecscompose-service-export taskDefinition="ecscompose-export:3"
INFO[0000] Describe ECS Service status desiredCount=1 runningCount=1 serviceName=ecscompose-service-export
INFO[0030] Describe ECS Service status desiredCount=1 runningCount=1 serviceName=ecscompose-service-export
INFO[0061] Describe ECS Service status desiredCount=1 runningCount=1 serviceName=ecscompose-service-export
INFO[0091] Describe ECS Service status desiredCount=1 runningCount=1 serviceName=ecscompose-service-export
The running count goes to 2 and then back to 1 (which is expected). But then it does not stop like it does when everything works but it keeps checking the status a while and finally just stalls.
The service on the cluster is in a good state. The new docker image is running and everything is find. Its just that the command doesn't stop.
Has anyone an idea how to fix this? Are there maybe other commands that I could use to achieve the same in a more reliable fashion?