Capistrano - mark instances as pending during deployment to an ALB target group - amazon-web-services

I am deploying a Rails application to an autoscaled environment using custom tasks in my deploy files (basically I am using the Ruby aws sdk to select instances by tags matching my production environment and deploying to those instances)
Those instances are actually registered under target groups and my app distributes traffic to those TGs from an Application Load Balancer (ELBv2)
During my capistrano deployments, the deploy:restart task asks to restart the server (I am using Phusion Passenger) to use the new application. Since restarting can be quite long (up to 1min), I have added a custom restart wait option of 60 seconds to ensure my servers are restarted one by one so as to ensure continuous usage of my service.
However the only thing that is missing and makes the above delay useless, is that during this time my ALB keeps sending requests to those instances because they are not marked as "unhealthy" or "pending" in my target groups.
I have seen some libraries like https://github.com/thattommyhall/capistrano-elb unfortunately they are quite outdated and not made to work with ALBs and TGs
One last piece of info : my capistrano deploy task actually deploys to several machines matching different roles :
API servers (front facing, behind the ALB+TG as described above)
Workers ans schedulers (those are not behind any ALB no special precautions must be taken)
So my (sub-)question(s) is(are)
Is it possible to flag an instance behind a TG as "pending" manually ? If not, then would an deregister followed by an immediate register achieve the same thing ?
How can I, from a Capistrano task, do the above to the instances of the :api role, assuming the instances are all in the AWS cloud, with an IAM role, under one target group (actually it would be useful if I could get some tricks to support several TGs for the same instance)

I'm currently setting up autoscaling with lifeCycle hooks and will probably get to this later, but otherwise a possible solution (that I haven't validated yet) would be to
deregister a target (CLI).
(optionally) wait for target to be deregistered (CLI).
resume the passenger:restart command
(optionally) register the target again (CLI).
wait for target to be in service (CLI).
Proceed with same hooks on next instance
The speed of execution would depend on the server restart time / health check efficiency. Maybe a better solution, if there are enough servers in production, would be to skip the wait times and ensure you always have a "window" of x servers online
ie. suppose you have 5 servers, that take 30sec to restart, you could possibly deregister-restart-register a server every 15 sec to make sure thre are always 2 servers up at any time (assuming the health checks are frequent enough to mark an instance as healthy within 15 secs)

Related

AWS ECS does not drain connections or remove tasks from Target Group before stopping them

I've been experiencing this with my ECS service for a few months now. Previously, when we would update the service with a new task definition, it would perform the rolling update correctly, deregistering them from the target group and draining all http connections to the old tasks before eventually stopping them. However, lately ECS is going straight to stopping the old tasks before draining connections or removing them from the target group. This is resulting in 8-12 seconds of API down time for us while new http requests continue to be routed to the now-stopped tasks that are still in the target group. This happens now whether we trigger the service update via the CLI or the console - same behaviour. Shown here are a screenshot showing a sample sequence of Events from ECS demonstrating the issue as well as the corresponding ECS agent logs for the same instance.
Of particular note when reviewing these ECS agent logs against the sequence of events is that the logs do not have an entry at 21:04:50 when the task was stopped. This feels like a clue to me, but I'm not sure where to go from here with it. Has anyone experienced something like this, or have any insights as to why the tasks wouldn't drain and be removed from the target group before being stopped?
For reference, the service is behind an AWS application load balancer. Happy to provide additional details if someone thinks of what else may be relevant
It turns out that ECS changed the timing of when the events would be logged in the UI in the screenshot. In fact, the targets were actually being drained before being stopped. The "stopped n running task(s)" message is now logged at the beginning of the task shutdown lifecycle steps (before deregistration) instead of at the end (after deregistration) like it used to.
That said, we were still getting brief downtime spikes on our service at the load balancer level during deployments, but ultimately this turned out to be because of the high startup overhead on the new versions of the tasks spinning up briefly pegging the CPU of the instances in the cluster to 100% when there was also sufficient taffic happening during the deployment, thus causing some requests to get dropped.
A good-enough for now solution was to adjust our minimum healthy deployment percentage up to 100% and set the maximum deployment percentage to 150% (as opposed to the old 200% setting), which forces the deployments to "slow down", only launching 50% of the intended new tasks at a time and waiting until they are stable before launching the rest. This spreads out the high task startup overhead to two smaller CPU spikes rather than one large one and has so far successfully prevented any more downtime during deployments. We'll also be looking into reducing the startup overhead itself. Figured I'd update this in case it helps anyone else out there.

AWS ECS Fargate: Application Load Balancer Returns 503 with a Weird Pattern

(NOTE: I've modified my original post since I've now collected a little more data.)
Something just started happening today after several weeks of no issues and I can't think of anything I changed that would've caused this.
I have a Spring Boot app living behind an NGINX proxy (all Dockerized), all of which is in an AWS ECS Fargate cluster.
After deployment, I'm noticing that--randomly (as in, sometimes this doesn't happen)--a call to the services being served up by Spring Boot will 503 (behind the NGINX proxy). It seems to do this on every second deployment, with a subsequent deployment fixing the matter, i.e. calls to the server will succeed for awhile (maybe a few seconds; maybe a few minutes), but then stop.
I looked at the "HealthyHostCount" and I noticed that when I get the 503 and my main target group says it has no registered/healthy hosts in either AZ. I'm not sure what would cause the TG to deregister a target, especially since a subsequent deployment seems to "fix" the issue.
Any insight/help would be greatly appreciated.
Thanks in advance.
UPDATE
It looks as though it seems to happen right after I "terminate original task set" from the Blue/Green deployment from CodeDeploy. I'm wondering if it's an AZ issue, i.e. I haven't specified enough tasks to run on them both.
I think it likely your services are failing the health check on the TargetGroup.
When the TargetGroup determines that an existing target is unhealthy, it will unregister it, which will cause ECS to then launch a task to replace it. In the meantime you'll typically see Gateway errors like this.
On the ECS page, click your service, then click on the Events tab... if I am correct, here you'd see messages about why the task was stopped like 'unhealthy' or 'health check timeout'.
The cause can be myriad. An overloaded cluster (probably not the case with Fargate), not enough resources, a runaway request that eats all the memory or CPU.
You case smells like a slow startup issue. ECS web services have a 'Health Check Grace Period'... that is, a time to wait before health checks start. If your container takes too long to get started when first launched, the first couple of health checks may fail and cause the tasks to 'cycle'. A new deployment may be slower if your images are particularly large, because the ECS host has to download the image. Fargate, in general, is a bit slower than EC2 hosts because of overhead in how it sets up networking. If the slow startup is your problem, you can try increasing this grace period, but should also investigate how to speed your container startup time (decrease image size, other nginx tweaks to speed intialization, etc).
It turns out it was an issue with how I'd configured one of my target groups and/or the ALB, although what the exact issue was, I couldn't determine as I re-created the TGs and the service from scratch.
If I had to guess, I'd guess my TG wasn't configured with the right port, or the ALB had a bad rule/listener.
If anyone else had this issue after creating a new environment for an existing app, make sure you configure your security groups. My environment was timing out (and had bad health) because my web pages did not have access to the database inbound rule on AWS.
You can see if this is your issue if you are able to connect to urls in your app that do not connect to the same web services/databases.

CloudFormation, CodeDeploy, ELB & Auto-Scaling Group

I am trying to build a stack with an ELB, an Auto-Scaling Group and a Pipeline (with CodeBuild and CodeDeploy).
I can't understand how it is supposed to work:
the auto-scaling group is starting two instances and wait X minutes before starting to check the instances state
the CodeDeploy application deployment group is waiting for the Auto-Scaling group to be created and ready
the pipeline takes about 10 minutes to start deploying the application
My issue is when I create the stack, it looks like there is a loop: AG requires an application from CodeDeploy and CodeDeploy requires an AG stabilized. To be clear, when the application is ready to deploy, my Auto-Scaling group is already starting to terminate instances and starting new ones, so the CodeDeployment is trying to deploy to instances already terminated or terminating.
I don't really want to configure HealthCheckGracePeriod and PauseTime to be ~10-15 minutes... it is way too long.
Are there any best practices for CloudFormation + ELB + AG + CodeDeploy via a Pipeline?
What should be the steps to achieve that?
Thank you!
This stopping/staring the instances is most probably linked to the Deployment Type: in-place vs. blue/green.
I have tried both in my setup, and I will try to summarize how they work.
Let's say that for this example, you have an Autoscaling group which at the time of deploying the application has 2 running instances and the deployment configuration is OneAtATime. Traffic is controlled by the Elastic Load Balancer. Then:
In-place deployment:
CodeDeploy gets notified of a new revision available.
It tells the ELB to stop directing traffic to 1st instance.
Once traffic to one instance is stopped, it starts the deployment process: Stop the application, download bundle etc.
If the deployment is successful (validate service hook returned 0), it tells ELB to resume traffic to that instance.
At this point, 1 instance is running the old code and 1 is running the new code.
Right after the ELB stops traffic to the 2nd instance, and repeats the deployment process there.
Important note:
With ELB enabled, the time it takes to block traffic to instance before deployment, and time it takes to allow traffic after it are directly dependent on your health check: time = Healthy threshold * Interval.
Blue/green deployment:
CodeDeploy gets notified of a new revision available.
It copies your Autoscaling Group: the same configuration of the group (including scaling policies, scheduled actions etc.) and the same number of instances (using same AMI as your original AG) there were there at the start of deployment - in our case 2.
At this point, there is no traffic going to the new AG.
CodeDeploy performs all the usual installation steps to one machine.
If successful, it deploys to the second machine too.
It directs traffic from the instances in your old AG to the new AG.
Once traffic is completely re-routed, it deletes the old AG and terminates all its instances (after a period specified in Deployment Settings - this option is only available if you select Blue/Green)
Now ELB is serving only the new AG.
From experience:
Blue/green deployment is a bit slower, since you need to wait for the
machines to boot up, but you get a much safer and fail-proof deployment.
In general I would stick with Blue/green, with load balancer
enabled and the Deployment Configuration: AllAtOnce (if it fails,
customers won't be affected since the instances won't be receiving
traffic. But it will be 2x as fast since you deploy in parallel
rather than sequentially).
If your health checks and validate
service are throughout enough, you can probably delete the original
AG with minimal waiting time (5 minutes at the time of writing the
post).

How to finish long-running task on Tomcat when AWS AutoScaling is terminating the EC2 instance?

I have an application that is deployed to Tomcat 8 that is hosted on ElasticBeanstalk environment with enabled auto-scaling. In the application I have long-running jobs that must be finished and all changes must be committed to a database.
The problem is that AWS might kill any EC2 instance during scale in and then some jobs might be not finished as it is expected. By default, AWS waits just 30 seconds and then kill the Tomcat process.
I've already changed /etc/tomcat8/tomcat8.conf file: set parameter SHUTDOWN_WAIT to 3600 (60 by default). But it didn't fix the issue - the whole instance is killed after 20-25 minutes.
Then I've tried to configure lifecycle hook via .ebextensions file (as it's explained here). But I couldn't approve that the lifecycle hook really postpones termination of the instance (still waiting for an answer from AWS support about that).
So the question is: do you know any "legal" ways to postpone or cancel instance termination when the autoscaling group scales in?
I want to have something like that:
AWS starts to scale in the autoscaling group
autoscaling group sends shutdown signal to the EC2 instance
EC2 instance starts to stop all active processes
Tomcat process receives a signal to shutdown, but waits until the active job is finished
the application commits the job result (it might take even 60 minutes)
Tomcat process is terminated
EC2 instances in terminated
Elastic Beanstalk consists of two parts - API, and Worker. API is auto scaled, so it can go down. Worker is something that runs longer. You can communicate between them with SQS. That is how they designed it.
For sure you can tweak the system. That is platform as a service, so you can force auto scaling group not to go down - by setting min instances to max. Also you can turn off health check - that can also kill instance... But that is hacking, latter can kick.

Elastic Beanstalk rolling update timeout not honored

I am trying to achieve zero down time redeploys on AWS elastic beanstalk.
I basically have two instances on my environment coupled with Jenkins for CI (Using Tomcat).
What I am trying to achieve is each time I trigger a redeploy from Jenkins, only one instance of the environment is redeployed, then have a timeout to allow the new instance to load the application and then redeploy the second instance.
In order to achieve that timeout I am setting both the "Pause time" and "Command timeout" but unfortunately its if though this limit is not honored.
The first instance is redeployed but after around 1 minute the second instance is redeployed regardless of the timeout value I set.
Have anyone archived this? any insights on how to achieve it?
"Pause Time" relates to environmental configuration made to instances. "Command timeouts" relates to commands executed to building the environment (for example if you've customised the container). Neither have anything to do with rolling application updates or zero downtime deployments. The documentation around this stuff is confusing and fragmented.
For zero downtime application deployments, AWS EB give you two options:
Batch application updates
Running two environments and cutting over
Option 1 feels like a lot less work but in my testing hasn't been truly zero downtime. There is a HARDCODED timeout value where traffic will be routed to instances after 1 minute regardless of whether the load balancer healthcheck passes or not.
But if you still want to go ahead then running two instances and setting a batch size of 50% or 1 should give you want you want.