Stop Sidekiq gracefully while terminating Autoscale EC2 instance - ruby-on-rails-4

I am facing issue with aws autoscaling + sidekiq busy jobs. I have 4 instances are running in auto scale group and multiple Sidekiq processes are running on these instances. All instances using same redis.
If one of my instance is terminating at that time the jobs in his busy queue pushed into failed state. It should enqueued this jobs.
I have added one script in /etc/rc0.d folder which will kill sidekiq processes at the time of instace termination, but still my jobs are going in the failure state.
I tried with TERM and USR1 signal for sidkeiq process termination but same thing happened with this signals.
I have used sidkeiq pro also enabled the reliable fetch.
Does anyone know, how to achieve the jobs in the busy queue should go into enqueued not in faliure state while killig sidekiq process manually or gracefully ?

I ran into the same issue myself and had to fix it by fully plumbing it into the daemon management, in my case chkconfig. I'm not sure what your EC2 setup is like, but I found that only scripts that chkconfig knew about were being executed.
If you don't already, you may also need to add a sleep to the script to ensure Sidekiq has enough time to fully exit.

Related

ECS choose idle tasks while scale in & deployment checkpointing for batch processing

I am a beginner to ECS. I have an ECS service on top of EC2 that does asynchronous processing of jobs that are hours long.
Problem 1 : In auto scale in event, instead of choosing idle task, busy task is abruptly interrupted.
Best solution I could think of for problem 1 : Need this feature mentioned in [ECS] [request]: Control which containers are terminated on scale in. This would mark tasks as protected and hence idle tasks instead of busy could be selected for termination.
Problem 2 : In deployment, while old set of tasks are being replaced with new set of tasks, the busy tasks are being abruptly terminated.
Solution 1 for problem 2 : Increase ECS_CONTAINER_STOP_TIMEOUT value to highest value of job latency (some hours).
Problem with solution 1 : Deployments are long and could even wait unnecessarily when tasks are idle.
Solution 2 for problem 2 : Do everything in solution 1 and use the SIGTERM signal. SIGTERM signal is sent to notify tasks to do cleanup before SIGKILL or termination of tasks. Our task will catch this SIGTERM signal and we need custom logic on our end to know if task is idle or not and then terminate if idle. If not idle, wait till it becomes idle so that no busy task abruptly terminates.
Problem with solution 2 : We still have one issue of deployments waiting on tasks to end and urgent deployments not possible.
Solution 3 for problem 2 : Do everything in solution 2 and instead of waiting for busy tasks to be idle, we will need custom logic to save the progress of the jobs and resume them when deployment ends.
The issue now is solution for problem 1 will make the deployment wait for busy tasks to finish instead of saving progress and resuming jobs (solution 3 for problem 2). Is it possible to do checkpointing in deployment and also choosing idle tasks in auto scaling in?
Thanks

AWS ECS - Task stuck running an inactive task definition

Very often when we update a task we got the old version of the task still running marked as in an inactive state. The only way to kill the old version is by stopping the container manually. According to the AWS team, this is happening because we still have connections attached to this old task. But how can we handle this behavior on services that have constant connections? I guess any kind of thread lock would cause similar behavior.
Any suggestions?
AWS waits for your target group's "deregistration delay" before hard-killing any open connections. Default is 300 seconds. Maybe try lowering the value for deregistration delay and see if the "constant" connections you're referring to are forcefully killed more quickly.

AWS SWF Simple Workflow - Best Way to Keep Activity Worker Scripts Running?

The maximum amount of time the pollForActivityTask method stays open polling for requests is 60 seconds. I am currently scheduling a cron job every minute to call my activity worker file so that my activity worker machine is constantly polling for jobs.
Is this the correct way to have continuous queue coverage?
The way that the Java Flow SDK does it and the way that you create an ActivityWorker, give it a tasklist, domain, activity implementations, and a few other settings. You set both the setPollThreadCount and setTaskExecutorSize. The polling threads long poll and then hand over work to the executor threads to avoid blocking further polling. You call start on the ActivityWorker to boot it up and when wanting to shutdown the workers, you can call one of the shutdown methods (usually best to call shutdownAndAwaitTermination).
Essentially your workers are long lived and need to deal with a few factors:
New versions of Activities
Various tasklists
Scaling independently on tasklist, activity implementations, workflow workers, host sizes, etc.
Handle error cases and deal with polling
Handle shutdowns (in case of deployments and new versions)
I ended using a solution where I had another script file that is called by a cron job every minute. This file checks whether an activity worker is already running in the background (if so, I assume a workflow execution is already being processed on the current server).
If no activity worker is there, then the previous long poll has completed and we launch the activity worker script again. If there is an activity worker already present, then the previous poll found a workflow execution and started processing so we refrain from launching another activity worker.

Workflow handling on Camunda engine restart

Scenario : Few jobs are running currently. If cluster reboot happens in the middle of the job execution, I shall be able to observe the continuity of process instance execution with proper state after reboot.
Will Camunda take care of preserving the process instance state by using some checkpoints and resumes automatically from where it halted ?
If you have reached at least one asynchronous continuation (e.g. check the property "async after" or on the start event), then the process instance has been persistent to the database and a job scheduled. Any crash would lead the following transaction to not commit and rollback. The job executor will restart processing from the last commit point when it detects a due job.

AWS AutoScaling, downscale - wait for processes termination

I want to use AWS AutoScaling to scaledown a group of instances when SQS queue is short.
These instances do some heavy work that sometimes requires 5-10 minutes to complete. And I want this work to be completed before the instance termination.
I know a lot of people should have faced the same problem. Is it possible on EC2 to handle the AWS termination request and complete all my running processes before the instance is actually terminated? What is the best approach to this?
You could also use Lifecycle hooks. You would need a way to control a specific worker remotely, because AWS will select a particular instance to put in Terminating:Wait state and you need to manage that instance. You would want to take the following actions:
instruct the worker process running on the instance to not accept any more work.
wait for the worker to finish the work it already is handling
call the complete-lifecycle action.
AWS will take care of the rest for you.
ps. if you are using celery to power your workers then you can remotely ask a worker to shutdown gracefully. It won't shutdown unless it finishes with the tasks it had started executing.
Assuming you are using linux, you can create a pre-baked AMI that you use in your Launch Config attached to your Auto Scaling Group.
In the AMI you can put a script under /etc/init.d say /etc/init.d/servicesdown. This script would execute anything that you need to shutdown which would be scripts under /usr/share/services for example.
Here's kind like the gist:
servicesdown
It would always get executed when doing a graceful shutdown.
Then say on Ubuntu/Debian you would do something like this to add it to your shutdown sequence:
/usr/sbin/update-rc.d servicesdown stop 25 0 1 6 .
On CentOS/RedHat you can use the chkconfig command to add it to the right shutdown runlevel.
I stumbled onto this problem because I didn't want to terminate an instance that was doing work. Thought I'd share my findings here. There are two ways to look at this though :
I need to terminate a worker, but I only want to terminate one that's not working
I need to terminate a SPECIFIC worker and I want that specific worker to wait until it's done with the work.
If you're goal is #1, Amazon's new "Instance Protection" looks like it was designed to resolve this.
See the below link for an example, they give this code snippet as an example:
https://aws.amazon.com/blogs/aws/new-instance-protection-for-auto-scaling/
while (true)
{
SetInstanceProtection(False);
Work = GetNextWorkUnit();
SetInstanceProtection(True);
ProcessWorkUnit(Work);
SetInstanceProtection(False);
}
I haven't tested this myself, but I see API calls related to setting the protection, so it appears that this could be integrated into the EC2 Worker App code-base and then when Scaling In, instances shouldn't be terminated if they are protected (currently working).
http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/autoscaling/AmazonAutoScaling.html
As far as I know currently there is no option to terminate instance while gracefully shutdown and let process to complete work.
I suggest you to look at http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/as-configure-healthcheck.html.
We implemented it for resque workers while we are moving instance to unhealthy state and than downsizing AS. There is a script which checking constantly health state on each instance. Once instance moved to unhealthy state it stops all services gracefully and sending terminate signal to ec2.
Hope it helps you.