Elastic Beanstalk rolling update timeout not honored - amazon-web-services

I am trying to achieve zero down time redeploys on AWS elastic beanstalk.
I basically have two instances on my environment coupled with Jenkins for CI (Using Tomcat).
What I am trying to achieve is each time I trigger a redeploy from Jenkins, only one instance of the environment is redeployed, then have a timeout to allow the new instance to load the application and then redeploy the second instance.
In order to achieve that timeout I am setting both the "Pause time" and "Command timeout" but unfortunately its if though this limit is not honored.
The first instance is redeployed but after around 1 minute the second instance is redeployed regardless of the timeout value I set.
Have anyone archived this? any insights on how to achieve it?

"Pause Time" relates to environmental configuration made to instances. "Command timeouts" relates to commands executed to building the environment (for example if you've customised the container). Neither have anything to do with rolling application updates or zero downtime deployments. The documentation around this stuff is confusing and fragmented.
For zero downtime application deployments, AWS EB give you two options:
Batch application updates
Running two environments and cutting over
Option 1 feels like a lot less work but in my testing hasn't been truly zero downtime. There is a HARDCODED timeout value where traffic will be routed to instances after 1 minute regardless of whether the load balancer healthcheck passes or not.
But if you still want to go ahead then running two instances and setting a batch size of 50% or 1 should give you want you want.

Related

AWS ECS does not drain connections or remove tasks from Target Group before stopping them

I've been experiencing this with my ECS service for a few months now. Previously, when we would update the service with a new task definition, it would perform the rolling update correctly, deregistering them from the target group and draining all http connections to the old tasks before eventually stopping them. However, lately ECS is going straight to stopping the old tasks before draining connections or removing them from the target group. This is resulting in 8-12 seconds of API down time for us while new http requests continue to be routed to the now-stopped tasks that are still in the target group. This happens now whether we trigger the service update via the CLI or the console - same behaviour. Shown here are a screenshot showing a sample sequence of Events from ECS demonstrating the issue as well as the corresponding ECS agent logs for the same instance.
Of particular note when reviewing these ECS agent logs against the sequence of events is that the logs do not have an entry at 21:04:50 when the task was stopped. This feels like a clue to me, but I'm not sure where to go from here with it. Has anyone experienced something like this, or have any insights as to why the tasks wouldn't drain and be removed from the target group before being stopped?
For reference, the service is behind an AWS application load balancer. Happy to provide additional details if someone thinks of what else may be relevant
It turns out that ECS changed the timing of when the events would be logged in the UI in the screenshot. In fact, the targets were actually being drained before being stopped. The "stopped n running task(s)" message is now logged at the beginning of the task shutdown lifecycle steps (before deregistration) instead of at the end (after deregistration) like it used to.
That said, we were still getting brief downtime spikes on our service at the load balancer level during deployments, but ultimately this turned out to be because of the high startup overhead on the new versions of the tasks spinning up briefly pegging the CPU of the instances in the cluster to 100% when there was also sufficient taffic happening during the deployment, thus causing some requests to get dropped.
A good-enough for now solution was to adjust our minimum healthy deployment percentage up to 100% and set the maximum deployment percentage to 150% (as opposed to the old 200% setting), which forces the deployments to "slow down", only launching 50% of the intended new tasks at a time and waiting until they are stable before launching the rest. This spreads out the high task startup overhead to two smaller CPU spikes rather than one large one and has so far successfully prevented any more downtime during deployments. We'll also be looking into reducing the startup overhead itself. Figured I'd update this in case it helps anyone else out there.

AWS ECS Fargate: Application Load Balancer Returns 503 with a Weird Pattern

(NOTE: I've modified my original post since I've now collected a little more data.)
Something just started happening today after several weeks of no issues and I can't think of anything I changed that would've caused this.
I have a Spring Boot app living behind an NGINX proxy (all Dockerized), all of which is in an AWS ECS Fargate cluster.
After deployment, I'm noticing that--randomly (as in, sometimes this doesn't happen)--a call to the services being served up by Spring Boot will 503 (behind the NGINX proxy). It seems to do this on every second deployment, with a subsequent deployment fixing the matter, i.e. calls to the server will succeed for awhile (maybe a few seconds; maybe a few minutes), but then stop.
I looked at the "HealthyHostCount" and I noticed that when I get the 503 and my main target group says it has no registered/healthy hosts in either AZ. I'm not sure what would cause the TG to deregister a target, especially since a subsequent deployment seems to "fix" the issue.
Any insight/help would be greatly appreciated.
Thanks in advance.
UPDATE
It looks as though it seems to happen right after I "terminate original task set" from the Blue/Green deployment from CodeDeploy. I'm wondering if it's an AZ issue, i.e. I haven't specified enough tasks to run on them both.
I think it likely your services are failing the health check on the TargetGroup.
When the TargetGroup determines that an existing target is unhealthy, it will unregister it, which will cause ECS to then launch a task to replace it. In the meantime you'll typically see Gateway errors like this.
On the ECS page, click your service, then click on the Events tab... if I am correct, here you'd see messages about why the task was stopped like 'unhealthy' or 'health check timeout'.
The cause can be myriad. An overloaded cluster (probably not the case with Fargate), not enough resources, a runaway request that eats all the memory or CPU.
You case smells like a slow startup issue. ECS web services have a 'Health Check Grace Period'... that is, a time to wait before health checks start. If your container takes too long to get started when first launched, the first couple of health checks may fail and cause the tasks to 'cycle'. A new deployment may be slower if your images are particularly large, because the ECS host has to download the image. Fargate, in general, is a bit slower than EC2 hosts because of overhead in how it sets up networking. If the slow startup is your problem, you can try increasing this grace period, but should also investigate how to speed your container startup time (decrease image size, other nginx tweaks to speed intialization, etc).
It turns out it was an issue with how I'd configured one of my target groups and/or the ALB, although what the exact issue was, I couldn't determine as I re-created the TGs and the service from scratch.
If I had to guess, I'd guess my TG wasn't configured with the right port, or the ALB had a bad rule/listener.
If anyone else had this issue after creating a new environment for an existing app, make sure you configure your security groups. My environment was timing out (and had bad health) because my web pages did not have access to the database inbound rule on AWS.
You can see if this is your issue if you are able to connect to urls in your app that do not connect to the same web services/databases.

Capistrano - mark instances as pending during deployment to an ALB target group

I am deploying a Rails application to an autoscaled environment using custom tasks in my deploy files (basically I am using the Ruby aws sdk to select instances by tags matching my production environment and deploying to those instances)
Those instances are actually registered under target groups and my app distributes traffic to those TGs from an Application Load Balancer (ELBv2)
During my capistrano deployments, the deploy:restart task asks to restart the server (I am using Phusion Passenger) to use the new application. Since restarting can be quite long (up to 1min), I have added a custom restart wait option of 60 seconds to ensure my servers are restarted one by one so as to ensure continuous usage of my service.
However the only thing that is missing and makes the above delay useless, is that during this time my ALB keeps sending requests to those instances because they are not marked as "unhealthy" or "pending" in my target groups.
I have seen some libraries like https://github.com/thattommyhall/capistrano-elb unfortunately they are quite outdated and not made to work with ALBs and TGs
One last piece of info : my capistrano deploy task actually deploys to several machines matching different roles :
API servers (front facing, behind the ALB+TG as described above)
Workers ans schedulers (those are not behind any ALB no special precautions must be taken)
So my (sub-)question(s) is(are)
Is it possible to flag an instance behind a TG as "pending" manually ? If not, then would an deregister followed by an immediate register achieve the same thing ?
How can I, from a Capistrano task, do the above to the instances of the :api role, assuming the instances are all in the AWS cloud, with an IAM role, under one target group (actually it would be useful if I could get some tricks to support several TGs for the same instance)
I'm currently setting up autoscaling with lifeCycle hooks and will probably get to this later, but otherwise a possible solution (that I haven't validated yet) would be to
deregister a target (CLI).
(optionally) wait for target to be deregistered (CLI).
resume the passenger:restart command
(optionally) register the target again (CLI).
wait for target to be in service (CLI).
Proceed with same hooks on next instance
The speed of execution would depend on the server restart time / health check efficiency. Maybe a better solution, if there are enough servers in production, would be to skip the wait times and ensure you always have a "window" of x servers online
ie. suppose you have 5 servers, that take 30sec to restart, you could possibly deregister-restart-register a server every 15 sec to make sure thre are always 2 servers up at any time (assuming the health checks are frequent enough to mark an instance as healthy within 15 secs)

How do I set the instance count of a Elastic Beanstalk environment to 0?

In other words I would like to temporarily turn off an environment (and its associated billing costs) but not delete it entirely.
It seems if I set the [Configuration > Web Tier > Scaling > Minimum instance count] to 0 along with the related "Maximum instance count", AWS rejects those settings as invalid. Ditto for questionable values like 0.1.
Any ideas for temporarily taking an Elastic Beanstalk environment out of service?
You can achieve this by defining time periods in your environment.
from your Elastic Load blanancer dashboard navigate through: Configuration -> Capacity -> Time-based scaling -> Scheduled actions.
You may choose to use multiple scheduled actions to start and stop your environment.
i.e. One to start your environment and another to stop it, or scale out during high load periods and back in after.
In the background, Elastic Beanstalk will terminate and recreate the EC2. This means you'll lose any data on the instances local storage.
I have used this to turn my environments off when they are no longer needed. For example, I have a workload that crawls looking for new content; and I pause this crawling after it has once completed. I also use this feature to spin down my nonproduction environments outside of working hours.
I prefer this method to others as it is supported by both single instance and load-balanced environments.
How about eb scale 0 using the Elastic Beanstalk CLI tools?
I don't believe this is possible. I any case, you would still have a load balancer in front for which you pay, so you're not really reducing the costs for this environment.
It makes more sense to delete the environment completely create a script that will deploy a version from your application versions to a new environment.
I now use the ParkMyCloud service to suspend Elastic Beanstalk environments (and also RDS database instances) that I'm not using, both manually and in scheduled fashion.
Although ParkMyCloud is non-free, it's saved me a lot more money than I've spent using it. It's worked quite well over the last few years so I'm comfortable recommending it.

Elastic Beanstalk Environment stuck on grey health

My AWS Elastic Beanstalk Environment is stuck in Health: Grey.
My application is working, I can access it fine. However, I am unable to change configuration or deploy new versions because I get a message saying that
Environment named ______ is in an invalid state for this operation. Mustbe Ready.
If I run eb health on my console, I get the following output:
Status: Ready Health Grey
And
ELB State: InService
Is there anything I can try to revive my environment? I have contacted AWS Support, by they are really slow. Another option I can think of is terminating the environment and creating a new one, but I really would prefer to avoid that.
EB can be fairly tough to trouble shoot when you have full access to EB, the instances, the ELB's, etc... never mind trying to proxy this through SO.
I'd do the following:
Bring up a new environment under the same application
When it comes up green, use the EB application "Swap" functionality to swap the environments
More details on this process is here: http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/using-features.CNAMESwap.html
This performs a DNS switch so you should have no down time. You'll still the old environment running if you want to troublshoot it later with your friendly AWS support staff.
The only negatives are:
You'll continue paying for both environment stacks while waiting to troubleshoot the other.
The DNS is a little tough as you can't guarantee clients respect the short time-outs EB DNS entries have. They should, but someone may decide to keep using a local cached version. As with anything relying on trusting client-side features, it's a bit out of your control.
If you deploy a RDS DB via EB, you can't us the swap as the DB is tied to the environment (NEVER deploy a RDS DB in production EB environment via EB!!!!)
I know the question's already been answered, but I think the cause of the problem is important, instead of recommending a complete rebuild of OP's environment.
Elastic Beanstalk has 4 different colors - green, yellow, red, and grey. However, each color can mean multiple different things that vary wildly. Here are the potential statuses behind the grey color:
Grey (Suspended) - Your application has had such severe health issues Elastic Beanstalk is no longer monitoring it
Grey (Unknown) - The health agent has not reported enough data on an instance yet
Grey (Pending) - An operation is in progress on an instance within the command timeout (for example bootstrapping the environment)
Notice the incredible disparity between "Pending" and "Suspended". In Pending, it just needs a little more time, or perhaps you can shut down a single resource and have it respawn. In Suspended, all monitoring is shot, and you ought to rebuild the environment ASAP. Big difference in impact to customers during the solution.
Baked into Beanstalk are the vanilla colors. To get the additional statuses, you have to enable Enhanced Monitoring. You can do it in a couple minutes, and the cost is nominal.
To read more about the statuses and common problems with Beanstalk, I'd recommend a blog my colleague wrote: Health Monitoring in AWS Beanstalk
A more difficult situation occurs when the environment state is unknown. Even the Abort current operation option doesn't work. To solve this I had to apply these steps:
copy the instance id associated with Beanstalk environment
find the instance using that instance id in EC2 dashboard
stop the EC2 instance
start the EC2 instance
The environment should come to the previous state after restarting the instance.