One of my ECS fargate tasks is stopping and restarting in what seems to be a somewhat random fashion. I started the task in Dec 2019 and it has stopped/restarted three times since then. I've found that the task stopped and restarted from its 'Events' log (image below) but there's no info provided as to why it stopped..
So what I've tried to do to date to debug this is
Checked the 'Stopped' tasks inside the cluster for info as to why it might have stopped. No luck here as it appears 'Stopped' tasks are only held there for a short period of time.
Checked CloudWatch logs for any log messages that could be pertinent to this issue, nothing found
Checked CloudTrail event logs for any event pertinent to this issue, nothing found
Confirmed the memory and CPU utilisation is sufficient for the task, in fact the task never reaches 30% of it's limits
Read multiple AWS threads about similar issues where solutions mainly seem to be connected to using an ELB which I'm not..
Any have any further debugging device or ideas what might be going on here?
I ran into the same issue and found this from aws
https://docs.aws.amazon.com/AmazonECS/latest/userguide/task-maintenance.html
When AWS determines that a security or infrastructure update is needed
for an Amazon ECS task hosted on AWS Fargate, the tasks need to be
stopped and new tasks launched to replace them.
Also a github post on storing stopped tasks info in cloudwatch logs:
https://github.com/aws/amazon-ecs-agent/issues/368
Related
I've been experiencing this with my ECS service for a few months now. Previously, when we would update the service with a new task definition, it would perform the rolling update correctly, deregistering them from the target group and draining all http connections to the old tasks before eventually stopping them. However, lately ECS is going straight to stopping the old tasks before draining connections or removing them from the target group. This is resulting in 8-12 seconds of API down time for us while new http requests continue to be routed to the now-stopped tasks that are still in the target group. This happens now whether we trigger the service update via the CLI or the console - same behaviour. Shown here are a screenshot showing a sample sequence of Events from ECS demonstrating the issue as well as the corresponding ECS agent logs for the same instance.
Of particular note when reviewing these ECS agent logs against the sequence of events is that the logs do not have an entry at 21:04:50 when the task was stopped. This feels like a clue to me, but I'm not sure where to go from here with it. Has anyone experienced something like this, or have any insights as to why the tasks wouldn't drain and be removed from the target group before being stopped?
For reference, the service is behind an AWS application load balancer. Happy to provide additional details if someone thinks of what else may be relevant
It turns out that ECS changed the timing of when the events would be logged in the UI in the screenshot. In fact, the targets were actually being drained before being stopped. The "stopped n running task(s)" message is now logged at the beginning of the task shutdown lifecycle steps (before deregistration) instead of at the end (after deregistration) like it used to.
That said, we were still getting brief downtime spikes on our service at the load balancer level during deployments, but ultimately this turned out to be because of the high startup overhead on the new versions of the tasks spinning up briefly pegging the CPU of the instances in the cluster to 100% when there was also sufficient taffic happening during the deployment, thus causing some requests to get dropped.
A good-enough for now solution was to adjust our minimum healthy deployment percentage up to 100% and set the maximum deployment percentage to 150% (as opposed to the old 200% setting), which forces the deployments to "slow down", only launching 50% of the intended new tasks at a time and waiting until they are stable before launching the rest. This spreads out the high task startup overhead to two smaller CPU spikes rather than one large one and has so far successfully prevented any more downtime during deployments. We'll also be looking into reducing the startup overhead itself. Figured I'd update this in case it helps anyone else out there.
I have ECS container running some tasks. The server running inside the task may take 1~10 minutes to complete one request.
I am using SQS for task queuing. When certain amount tasks exceeds it scale-up the ECS tasks. And it scale-down when task in queue go below certain numbers.
However, as there is no LifeCycleHook feature for ECS task, during the time of scale-down the ECS tasks are shut down while the processing is still running. And it's not possible to delay the task termination due to the lack of LifeCycleHook.
According to our specification, we can't use the timeout feature, as we don't know earlier how much time it will take to finish the job.
Please suggest how to solve the problem.
There is no general solution to this problem, especially if you don't want to use timeout. In fact there is long lasting, still open, github issue dedicated to this:
[ECS] [request]: Control which containers are terminated on scale in
You could somehow control this through running your services on EC2 (EC2 scale-in protection), not Fargate. So either you have to re-architect your solution, or manually scale-out and in your service.
I have a Spring Boot web application running on AWS ECS service on Fargate with a desired count of 1. It's configured with a LB in front for SSL termination and healthchecks.
Each night via #scheduled I run a batch job that does some recalculations. At various points either during or shortly after that job runs my task is killed and a new one is spun up. During the task running I notice a few things:
CPU on the service (via cloud watch) spikes to above 60%
My health checks from the load balancer still respond in a good amount of time
There are no errors in my spring boot logs
In the ECS service events I see service sname-app-lb deregistered 1 targets in target-group ecs-sname-app-lb
I'm trying to figure out how to tell exactly why the task is being killed. Any tips on how to debug / fix this would be greatly appreciated.
So, i have have similar experience in the past. This is what you need to do:
1. Make sure you are streaming the application logs to the cloudwatch using the awslogs driver in the task definition (if you are not doing it already).
2. Put a delay in the app as a catch/handler wherever it can fail. This delay will make sure that the application logs are sent to cw logs the event of an exception, and thus prevent an abrupt exit of the task.
I initially thought as a fargate issue, but the above really help understand the underlying issue. All the best.
If you are running your Spring application inside Docker in AWS Fargate, if it hits the memory limit, your application could get killed.
More information: https://developers.redhat.com/blog/2017/03/14/java-inside-docker/
My biggest frustration with ECS is that its not observable.
I deploy my service, my tasks go into "pending" and I cross my fingers.
Sometimes I get useful error messages in the console, sometimes they just hang out in "pending" indefinitely. I see no events being generated and have no idea what it's trying to do, or where it is stuck.
I can restart the ECS service or other hacks I've had to do before, but at this point I'd like to see what's actually happening when a task is in "pending". Are there logs anywhere for this?
It won't tell you until it stops.
Click "Stopped" next to "Running" and you should find all the previously pending tasks that already failed. It should show a reason for these failures.
Click on that task which is in pending state. It will show you the
single log status of the container and If you have attached the
cloud watch logs It will show you the full logs about why it is
pending?.
Check the docker container logs by SSHing to the ec2 container instance of the task
you are running.
Just in the past hour, our AWS CodeDeploy deployments have started hanging before even having a Start Time when looking at the Deployment Details page. The Status stays at In Progress indefinitely. We have not changed any of our deployment lifecycle details, so that leads me to believe that this is either some kind of CodeDeploy outage, or some kind of fluke that I'm not sure how to reset (Stopping the deployment and starting another ends up in the same place).
Has anyone else experienced this problem? Any ideas how to fix it?
Check the host agent on your instances. It's possible that it stopped running.
it looks like there is currently degraded performance on the Virginia region EC2 API, I'm also having issues with CodeDeploy not working and I assumed it may be from the increase in error rates on DescribeInstances in that region. AWS status page
I had this happen and I kept restarting the pipeline to no avail. I finally pushed a minor change and it magically started working again.