AWS ECS - Task stuck running an inactive task definition

AWS ECS - Task stuck running an inactive task definition - amazon-web-services

Very often when we update a task we got the old version of the task still running marked as in an inactive state. The only way to kill the old version is by stopping the container manually. According to the AWS team, this is happening because we still have connections attached to this old task. But how can we handle this behavior on services that have constant connections? I guess any kind of thread lock would cause similar behavior.
Any suggestions?

AWS waits for your target group's "deregistration delay" before hard-killing any open connections. Default is 300 seconds. Maybe try lowering the value for deregistration delay and see if the "constant" connections you're referring to are forcefully killed more quickly.

Related

Occasional failure on Amazon ECS with different error messages when starting task

We have a service running that orchestrates starting Fargate ECS tasks on messages from a RabbitMQ-queue. Sometimes tasks weirdly fail to start.
Info:
It starts a task somewhere between every other minute and every ten minutes.
It uses a set amount of task definitions. It re-uses the task definitions.
It consistently uses the same subnet in the same VPC.
The problem:
The vast majority of tasks starts fine. Say 98%. Sometimes tasks fail to start, and I get error messages. The error messages are not always the same, but they seem to be network-related.
Error messages I have gotten the last 36 hours:
'Timeout waiting for network interface provisioning to complete.'
'ResourceInitializationError: failed to configure ENI: failed to setup regular eni: netplugin failed with no error message'
'CannotPullContainerError: ref pull has been retried 5 time(s): failed to resolve reference <image that exists in repository>: failed to do request: Head https:<account-id>.dkr.ecr.eu-west-1.amazonaws.com/v2/k1-d...'
'ResourceInitializationError: failed to configure ENI: failed to setup regular eni: context deadline exceeded'
Thoughts:
It looks to me like there is a network-connectivity error of some sort.
The result of my Googling tells me that at least some of the errors can arise from having wrongly configured VPC or route-tables.
This is not the case here, I assume, since starting the exact same task with the exact same task definition in the same subnet works fine most of the time.
The ENI problem could maybe arise from me running out of ENI:s (?) on an EC2-instance, but since these tasks are started through Fargate I feel like that should not be the problem.
It seems like at least the network provisioning error can sometimes be an AWS issue.
Questions:
Why is this happening? Is it me or AWS?
Depending on the answer to the first question, is there something I can do to avoid this?
If there is nothing I can do, is there something I can do to mitigate it while it's happening? Should I simply just retry starting the task and hope that solves it?
Thanks very much in advance, I have been chasing this problem for months and feel like I am at least closing in on it, but this is as far as I can get on my own, I fear.

It is possible that tasks may fail to start due to a certain amount of reasons. Some of them are transient and are more "AWS" some others are more structural of your configuration and are more "you". For example the network time out is often due to a network misconfiguration where the task ENI does not have a proper route to the registry (e.g. Docker Hub). In all other cases it is possible that it's a transient one-off issue of the Fargate internals.
These problems may be transparent to you OR you may need to take action depending on how you use Fargate. For example, if you use Fargate tasks as part of an ECS service or an EKS deployment, the ECS/EKS routines will make sure they retry to instantiate the task to meet the service/deployment target configuration.
If you are launching the Fargate task using a one-off RunTask API call (i.e. not part of an orchestrator control loop that can monitor its failure) then it depends how you are calling that API. If you are calling it from tools such as AWS Step Functions, AWS Batch and possibly others, they all have retry mechanisms so if a task fails to launch they are smart enough to re-launch it.
However, if you are launching the task from an imperative line of code (or CLI command etc) then it's on your code to make sure the task has been launched properly and that you don't need to re-launch it upon an error message.

No CloudWatch logs for ECS task with reason "Essential container in task exited"

A task is running for a few seconds before terminating, I don't know why, and it's not pushing any logs.
I'm using the "awslogs" driver and the log group exists in CloudWatch.
The "Logs" tab is empty. The log-stream is created in CW but it's devoid of actual log events. There are also no results under Insights for that stream.
The task role has the permissions mentioned at https://docs.aws.amazon.com/AmazonECS/latest/developerguide/using_cloudwatch_logs.html .
Any idea what the deal is with the logs?

The command wasn't valid nor was it comma-separated. It was terminating too early in the workflow to log anything, but yet after any other deployment issue would be identified. So, it was looking like it was successful but in reality wasn't yet even running. Interestingly, it would still take around a minute to terminate, so maybe this includes the overhead of pulling the image.

Timestamps indicate that task started and exited after some seconds. awslogs will send logs if container has been successfully started, so, in this case it may not be helping. You can follow step 6 of documentation to diagnose. Specifically, if you have a container that has stopped, expand the container and inspect the Status reason row to see what caused the task state to change. In most cases, that will lead you to actual cause

When I get 'services has reached steady state', in Amazon ECS does it means some tasks had stopped?

Does this means that my service tasks are stopping or it's ok to get these log messages?

actually opposite this. The service scheduler reports status periodically. A normal state indicates that there is nothing for it to do -- all tasks are healthy, there are no scaling requests or deployments.

No it doesn't mean that any of your tasks had stopped. If a task stops you will see an event that clearly states so and will include a link to the specific task that was stopped. For example you will get something like this "service xxx has stopped 1 running tasks: task xxx."
If no tasks have been created or stopped in the last six hours the ECS console will duplicate the last event message to let you know that everything works as expected.
From the ECS docs:
"To ensure that this event view is helpful, we only show the 100 most recent events and duplicate event messages are omitted until either the cause is resolved or six hours passes. If the cause is not resolved within six hours, you will receive another service event message for that cause."
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-event-messages.html

Check this thread here on the aws forums. https://forums.aws.amazon.com/thread.jspa?threadID=182793
This sounds like normal behavior. The service scheduler reports status periodically. A normal state indicates that there is nothing for it to do -- all tasks are healthy, there are no scaling requests or deployments. Are you seeing any issues?

Cloud composer tasks fail without reason or logs

I run Airflow in a managed Cloud-composer environment (version 1.9.0), whic runs on a Kubernetes 1.10.9-gke.5 cluster.
All my DAGs run daily at 3:00 AM or 4:00 AM. But sometime in the morning, I see a few Tasks failed without a reason during the night.
When checking the log using the UI - I see no log and I see no log either when I check the log folder in the GCS bucket
In the instance details, it reads "Dependencies Blocking Task From Getting Scheduled" but the dependency is the dagrun itself.
Although the DAG is set with 5 retries and an email message it does not look as if any retry took place and I haven't received an email about the failure.
I usually just clear the task instance and it run successfully on the first try.
Has anyone encountered a similar problem?

Empty logs often means the Airflow worker pod was evicted (i.e., it died before it could flush logs to GCS), which is usually due to an out of memory condition. If you go to your GKE cluster (the one under Composer's hood) you will probably see that there is indeed a evicted pod (GKE > Workloads > "airflow-worker").
You will probably see in "Tasks Instances" that said tasks have no Start Date nor Job Id or worker (Hostname) assigned, which, added to no logs, is a proof of the death of the pod.
Since this normally happens in highly parallelised DAGs, a way to avoid this is to reduce the worker concurrency or use a better machine.
EDIT: I filed this Feature Request on your behalf to get emails in case of failure, even if the pod was evicted.

AWS AutoScaling, downscale - wait for processes termination

I want to use AWS AutoScaling to scaledown a group of instances when SQS queue is short.
These instances do some heavy work that sometimes requires 5-10 minutes to complete. And I want this work to be completed before the instance termination.
I know a lot of people should have faced the same problem. Is it possible on EC2 to handle the AWS termination request and complete all my running processes before the instance is actually terminated? What is the best approach to this?

You could also use Lifecycle hooks. You would need a way to control a specific worker remotely, because AWS will select a particular instance to put in Terminating:Wait state and you need to manage that instance. You would want to take the following actions:
instruct the worker process running on the instance to not accept any more work.
wait for the worker to finish the work it already is handling
call the complete-lifecycle action.
AWS will take care of the rest for you.
ps. if you are using celery to power your workers then you can remotely ask a worker to shutdown gracefully. It won't shutdown unless it finishes with the tasks it had started executing.

Assuming you are using linux, you can create a pre-baked AMI that you use in your Launch Config attached to your Auto Scaling Group.
In the AMI you can put a script under /etc/init.d say /etc/init.d/servicesdown. This script would execute anything that you need to shutdown which would be scripts under /usr/share/services for example.
Here's kind like the gist:
servicesdown
It would always get executed when doing a graceful shutdown.
Then say on Ubuntu/Debian you would do something like this to add it to your shutdown sequence:
/usr/sbin/update-rc.d servicesdown stop 25 0 1 6 .
On CentOS/RedHat you can use the chkconfig command to add it to the right shutdown runlevel.

I stumbled onto this problem because I didn't want to terminate an instance that was doing work. Thought I'd share my findings here. There are two ways to look at this though :
I need to terminate a worker, but I only want to terminate one that's not working
I need to terminate a SPECIFIC worker and I want that specific worker to wait until it's done with the work.
If you're goal is #1, Amazon's new "Instance Protection" looks like it was designed to resolve this.
See the below link for an example, they give this code snippet as an example:
https://aws.amazon.com/blogs/aws/new-instance-protection-for-auto-scaling/
while (true)
{
SetInstanceProtection(False);
Work = GetNextWorkUnit();
SetInstanceProtection(True);
ProcessWorkUnit(Work);
SetInstanceProtection(False);
}
I haven't tested this myself, but I see API calls related to setting the protection, so it appears that this could be integrated into the EC2 Worker App code-base and then when Scaling In, instances shouldn't be terminated if they are protected (currently working).
http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/autoscaling/AmazonAutoScaling.html

As far as I know currently there is no option to terminate instance while gracefully shutdown and let process to complete work.
I suggest you to look at http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/as-configure-healthcheck.html.
We implemented it for resque workers while we are moving instance to unhealthy state and than downsizing AS. There is a script which checking constantly health state on each instance. Once instance moved to unhealthy state it stops all services gracefully and sending terminate signal to ec2.
Hope it helps you.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js