I want to use AWS AutoScaling to scaledown a group of instances when SQS queue is short.
These instances do some heavy work that sometimes requires 5-10 minutes to complete. And I want this work to be completed before the instance termination.
I know a lot of people should have faced the same problem. Is it possible on EC2 to handle the AWS termination request and complete all my running processes before the instance is actually terminated? What is the best approach to this?
You could also use Lifecycle hooks. You would need a way to control a specific worker remotely, because AWS will select a particular instance to put in Terminating:Wait state and you need to manage that instance. You would want to take the following actions:
instruct the worker process running on the instance to not accept any more work.
wait for the worker to finish the work it already is handling
call the complete-lifecycle action.
AWS will take care of the rest for you.
ps. if you are using celery to power your workers then you can remotely ask a worker to shutdown gracefully. It won't shutdown unless it finishes with the tasks it had started executing.
Assuming you are using linux, you can create a pre-baked AMI that you use in your Launch Config attached to your Auto Scaling Group.
In the AMI you can put a script under /etc/init.d say /etc/init.d/servicesdown. This script would execute anything that you need to shutdown which would be scripts under /usr/share/services for example.
Here's kind like the gist:
servicesdown
It would always get executed when doing a graceful shutdown.
Then say on Ubuntu/Debian you would do something like this to add it to your shutdown sequence:
/usr/sbin/update-rc.d servicesdown stop 25 0 1 6 .
On CentOS/RedHat you can use the chkconfig command to add it to the right shutdown runlevel.
I stumbled onto this problem because I didn't want to terminate an instance that was doing work. Thought I'd share my findings here. There are two ways to look at this though :
I need to terminate a worker, but I only want to terminate one that's not working
I need to terminate a SPECIFIC worker and I want that specific worker to wait until it's done with the work.
If you're goal is #1, Amazon's new "Instance Protection" looks like it was designed to resolve this.
See the below link for an example, they give this code snippet as an example:
https://aws.amazon.com/blogs/aws/new-instance-protection-for-auto-scaling/
while (true)
{
SetInstanceProtection(False);
Work = GetNextWorkUnit();
SetInstanceProtection(True);
ProcessWorkUnit(Work);
SetInstanceProtection(False);
}
I haven't tested this myself, but I see API calls related to setting the protection, so it appears that this could be integrated into the EC2 Worker App code-base and then when Scaling In, instances shouldn't be terminated if they are protected (currently working).
http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/autoscaling/AmazonAutoScaling.html
As far as I know currently there is no option to terminate instance while gracefully shutdown and let process to complete work.
I suggest you to look at http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/as-configure-healthcheck.html.
We implemented it for resque workers while we are moving instance to unhealthy state and than downsizing AS. There is a script which checking constantly health state on each instance. Once instance moved to unhealthy state it stops all services gracefully and sending terminate signal to ec2.
Hope it helps you.
Related
We have a service running that orchestrates starting Fargate ECS tasks on messages from a RabbitMQ-queue. Sometimes tasks weirdly fail to start.
Info:
It starts a task somewhere between every other minute and every ten minutes.
It uses a set amount of task definitions. It re-uses the task definitions.
It consistently uses the same subnet in the same VPC.
The problem:
The vast majority of tasks starts fine. Say 98%. Sometimes tasks fail to start, and I get error messages. The error messages are not always the same, but they seem to be network-related.
Error messages I have gotten the last 36 hours:
'Timeout waiting for network interface provisioning to complete.'
'ResourceInitializationError: failed to configure ENI: failed to setup regular eni: netplugin failed with no error message'
'CannotPullContainerError: ref pull has been retried 5 time(s): failed to resolve reference <image that exists in repository>: failed to do request: Head https:<account-id>.dkr.ecr.eu-west-1.amazonaws.com/v2/k1-d...'
'ResourceInitializationError: failed to configure ENI: failed to setup regular eni: context deadline exceeded'
Thoughts:
It looks to me like there is a network-connectivity error of some sort.
The result of my Googling tells me that at least some of the errors can arise from having wrongly configured VPC or route-tables.
This is not the case here, I assume, since starting the exact same task with the exact same task definition in the same subnet works fine most of the time.
The ENI problem could maybe arise from me running out of ENI:s (?) on an EC2-instance, but since these tasks are started through Fargate I feel like that should not be the problem.
It seems like at least the network provisioning error can sometimes be an AWS issue.
Questions:
Why is this happening? Is it me or AWS?
Depending on the answer to the first question, is there something I can do to avoid this?
If there is nothing I can do, is there something I can do to mitigate it while it's happening? Should I simply just retry starting the task and hope that solves it?
Thanks very much in advance, I have been chasing this problem for months and feel like I am at least closing in on it, but this is as far as I can get on my own, I fear.
It is possible that tasks may fail to start due to a certain amount of reasons. Some of them are transient and are more "AWS" some others are more structural of your configuration and are more "you". For example the network time out is often due to a network misconfiguration where the task ENI does not have a proper route to the registry (e.g. Docker Hub). In all other cases it is possible that it's a transient one-off issue of the Fargate internals.
These problems may be transparent to you OR you may need to take action depending on how you use Fargate. For example, if you use Fargate tasks as part of an ECS service or an EKS deployment, the ECS/EKS routines will make sure they retry to instantiate the task to meet the service/deployment target configuration.
If you are launching the Fargate task using a one-off RunTask API call (i.e. not part of an orchestrator control loop that can monitor its failure) then it depends how you are calling that API. If you are calling it from tools such as AWS Step Functions, AWS Batch and possibly others, they all have retry mechanisms so if a task fails to launch they are smart enough to re-launch it.
However, if you are launching the task from an imperative line of code (or CLI command etc) then it's on your code to make sure the task has been launched properly and that you don't need to re-launch it upon an error message.
Very often when we update a task we got the old version of the task still running marked as in an inactive state. The only way to kill the old version is by stopping the container manually. According to the AWS team, this is happening because we still have connections attached to this old task. But how can we handle this behavior on services that have constant connections? I guess any kind of thread lock would cause similar behavior.
Any suggestions?
AWS waits for your target group's "deregistration delay" before hard-killing any open connections. Default is 300 seconds. Maybe try lowering the value for deregistration delay and see if the "constant" connections you're referring to are forcefully killed more quickly.
Does this means that my service tasks are stopping or it's ok to get these log messages?
actually opposite this. The service scheduler reports status periodically. A normal state indicates that there is nothing for it to do -- all tasks are healthy, there are no scaling requests or deployments.
No it doesn't mean that any of your tasks had stopped. If a task stops you will see an event that clearly states so and will include a link to the specific task that was stopped. For example you will get something like this "service xxx has stopped 1 running tasks: task xxx."
If no tasks have been created or stopped in the last six hours the ECS console will duplicate the last event message to let you know that everything works as expected.
From the ECS docs:
"To ensure that this event view is helpful, we only show the 100 most recent events and duplicate event messages are omitted until either the cause is resolved or six hours passes. If the cause is not resolved within six hours, you will receive another service event message for that cause."
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-event-messages.html
Check this thread here on the aws forums. https://forums.aws.amazon.com/thread.jspa?threadID=182793
This sounds like normal behavior. The service scheduler reports status periodically. A normal state indicates that there is nothing for it to do -- all tasks are healthy, there are no scaling requests or deployments. Are you seeing any issues?
In our beta stack, we have a single EC2 instance listening to a tasklist. Sometimes another developer in the team start's his own instance for testing purposes and forget to turn it off. This creates problems for the next developer who tries to start an activity only for it to be taken up by the last developer's machine. Is there a way to get the hostnames of all activity workers listening to a particular tasklist ?
It is not currently possible to get a list of pollers waiting on a task list through the SWF API. The workaround is to look at the identity field on the ActivityExecutionStarted event after it was picked up by the wrong worker.
One way to avoid this issue is always use a task list name that is specific to a machine or developer to avoid collisions.
I am facing issue with aws autoscaling + sidekiq busy jobs. I have 4 instances are running in auto scale group and multiple Sidekiq processes are running on these instances. All instances using same redis.
If one of my instance is terminating at that time the jobs in his busy queue pushed into failed state. It should enqueued this jobs.
I have added one script in /etc/rc0.d folder which will kill sidekiq processes at the time of instace termination, but still my jobs are going in the failure state.
I tried with TERM and USR1 signal for sidkeiq process termination but same thing happened with this signals.
I have used sidkeiq pro also enabled the reliable fetch.
Does anyone know, how to achieve the jobs in the busy queue should go into enqueued not in faliure state while killig sidekiq process manually or gracefully ?
I ran into the same issue myself and had to fix it by fully plumbing it into the daemon management, in my case chkconfig. I'm not sure what your EC2 setup is like, but I found that only scripts that chkconfig knew about were being executed.
If you don't already, you may also need to add a sleep to the script to ensure Sidekiq has enough time to fully exit.