AWS ECS Spring Boot Task killed and restarted on background work - amazon-web-services

I have a Spring Boot web application running on AWS ECS service on Fargate with a desired count of 1. It's configured with a LB in front for SSL termination and healthchecks.
Each night via #scheduled I run a batch job that does some recalculations. At various points either during or shortly after that job runs my task is killed and a new one is spun up. During the task running I notice a few things:
CPU on the service (via cloud watch) spikes to above 60%
My health checks from the load balancer still respond in a good amount of time
There are no errors in my spring boot logs
In the ECS service events I see service sname-app-lb deregistered 1 targets in target-group ecs-sname-app-lb
I'm trying to figure out how to tell exactly why the task is being killed. Any tips on how to debug / fix this would be greatly appreciated.

So, i have have similar experience in the past. This is what you need to do:
1. Make sure you are streaming the application logs to the cloudwatch using the awslogs driver in the task definition (if you are not doing it already).
2. Put a delay in the app as a catch/handler wherever it can fail. This delay will make sure that the application logs are sent to cw logs the event of an exception, and thus prevent an abrupt exit of the task.
I initially thought as a fargate issue, but the above really help understand the underlying issue. All the best.

If you are running your Spring application inside Docker in AWS Fargate, if it hits the memory limit, your application could get killed.
More information: https://developers.redhat.com/blog/2017/03/14/java-inside-docker/

Related

AWS ECS does not drain connections or remove tasks from Target Group before stopping them

I've been experiencing this with my ECS service for a few months now. Previously, when we would update the service with a new task definition, it would perform the rolling update correctly, deregistering them from the target group and draining all http connections to the old tasks before eventually stopping them. However, lately ECS is going straight to stopping the old tasks before draining connections or removing them from the target group. This is resulting in 8-12 seconds of API down time for us while new http requests continue to be routed to the now-stopped tasks that are still in the target group. This happens now whether we trigger the service update via the CLI or the console - same behaviour. Shown here are a screenshot showing a sample sequence of Events from ECS demonstrating the issue as well as the corresponding ECS agent logs for the same instance.
Of particular note when reviewing these ECS agent logs against the sequence of events is that the logs do not have an entry at 21:04:50 when the task was stopped. This feels like a clue to me, but I'm not sure where to go from here with it. Has anyone experienced something like this, or have any insights as to why the tasks wouldn't drain and be removed from the target group before being stopped?
For reference, the service is behind an AWS application load balancer. Happy to provide additional details if someone thinks of what else may be relevant
It turns out that ECS changed the timing of when the events would be logged in the UI in the screenshot. In fact, the targets were actually being drained before being stopped. The "stopped n running task(s)" message is now logged at the beginning of the task shutdown lifecycle steps (before deregistration) instead of at the end (after deregistration) like it used to.
That said, we were still getting brief downtime spikes on our service at the load balancer level during deployments, but ultimately this turned out to be because of the high startup overhead on the new versions of the tasks spinning up briefly pegging the CPU of the instances in the cluster to 100% when there was also sufficient taffic happening during the deployment, thus causing some requests to get dropped.
A good-enough for now solution was to adjust our minimum healthy deployment percentage up to 100% and set the maximum deployment percentage to 150% (as opposed to the old 200% setting), which forces the deployments to "slow down", only launching 50% of the intended new tasks at a time and waiting until they are stable before launching the rest. This spreads out the high task startup overhead to two smaller CPU spikes rather than one large one and has so far successfully prevented any more downtime during deployments. We'll also be looking into reducing the startup overhead itself. Figured I'd update this in case it helps anyone else out there.

AWS ECS fargate task stopping and restarting somewhat randomnly

One of my ECS fargate tasks is stopping and restarting in what seems to be a somewhat random fashion. I started the task in Dec 2019 and it has stopped/restarted three times since then. I've found that the task stopped and restarted from its 'Events' log (image below) but there's no info provided as to why it stopped..
So what I've tried to do to date to debug this is
Checked the 'Stopped' tasks inside the cluster for info as to why it might have stopped. No luck here as it appears 'Stopped' tasks are only held there for a short period of time.
Checked CloudWatch logs for any log messages that could be pertinent to this issue, nothing found
Checked CloudTrail event logs for any event pertinent to this issue, nothing found
Confirmed the memory and CPU utilisation is sufficient for the task, in fact the task never reaches 30% of it's limits
Read multiple AWS threads about similar issues where solutions mainly seem to be connected to using an ELB which I'm not..
Any have any further debugging device or ideas what might be going on here?
I ran into the same issue and found this from aws
https://docs.aws.amazon.com/AmazonECS/latest/userguide/task-maintenance.html
When AWS determines that a security or infrastructure update is needed
for an Amazon ECS task hosted on AWS Fargate, the tasks need to be
stopped and new tasks launched to replace them.
Also a github post on storing stopped tasks info in cloudwatch logs:
https://github.com/aws/amazon-ecs-agent/issues/368

AWS ECS Fargate: Application Load Balancer Returns 503 with a Weird Pattern

(NOTE: I've modified my original post since I've now collected a little more data.)
Something just started happening today after several weeks of no issues and I can't think of anything I changed that would've caused this.
I have a Spring Boot app living behind an NGINX proxy (all Dockerized), all of which is in an AWS ECS Fargate cluster.
After deployment, I'm noticing that--randomly (as in, sometimes this doesn't happen)--a call to the services being served up by Spring Boot will 503 (behind the NGINX proxy). It seems to do this on every second deployment, with a subsequent deployment fixing the matter, i.e. calls to the server will succeed for awhile (maybe a few seconds; maybe a few minutes), but then stop.
I looked at the "HealthyHostCount" and I noticed that when I get the 503 and my main target group says it has no registered/healthy hosts in either AZ. I'm not sure what would cause the TG to deregister a target, especially since a subsequent deployment seems to "fix" the issue.
Any insight/help would be greatly appreciated.
Thanks in advance.
UPDATE
It looks as though it seems to happen right after I "terminate original task set" from the Blue/Green deployment from CodeDeploy. I'm wondering if it's an AZ issue, i.e. I haven't specified enough tasks to run on them both.
I think it likely your services are failing the health check on the TargetGroup.
When the TargetGroup determines that an existing target is unhealthy, it will unregister it, which will cause ECS to then launch a task to replace it. In the meantime you'll typically see Gateway errors like this.
On the ECS page, click your service, then click on the Events tab... if I am correct, here you'd see messages about why the task was stopped like 'unhealthy' or 'health check timeout'.
The cause can be myriad. An overloaded cluster (probably not the case with Fargate), not enough resources, a runaway request that eats all the memory or CPU.
You case smells like a slow startup issue. ECS web services have a 'Health Check Grace Period'... that is, a time to wait before health checks start. If your container takes too long to get started when first launched, the first couple of health checks may fail and cause the tasks to 'cycle'. A new deployment may be slower if your images are particularly large, because the ECS host has to download the image. Fargate, in general, is a bit slower than EC2 hosts because of overhead in how it sets up networking. If the slow startup is your problem, you can try increasing this grace period, but should also investigate how to speed your container startup time (decrease image size, other nginx tweaks to speed intialization, etc).
It turns out it was an issue with how I'd configured one of my target groups and/or the ALB, although what the exact issue was, I couldn't determine as I re-created the TGs and the service from scratch.
If I had to guess, I'd guess my TG wasn't configured with the right port, or the ALB had a bad rule/listener.
If anyone else had this issue after creating a new environment for an existing app, make sure you configure your security groups. My environment was timing out (and had bad health) because my web pages did not have access to the database inbound rule on AWS.
You can see if this is your issue if you are able to connect to urls in your app that do not connect to the same web services/databases.

How to finish long-running task on Tomcat when AWS AutoScaling is terminating the EC2 instance?

I have an application that is deployed to Tomcat 8 that is hosted on ElasticBeanstalk environment with enabled auto-scaling. In the application I have long-running jobs that must be finished and all changes must be committed to a database.
The problem is that AWS might kill any EC2 instance during scale in and then some jobs might be not finished as it is expected. By default, AWS waits just 30 seconds and then kill the Tomcat process.
I've already changed /etc/tomcat8/tomcat8.conf file: set parameter SHUTDOWN_WAIT to 3600 (60 by default). But it didn't fix the issue - the whole instance is killed after 20-25 minutes.
Then I've tried to configure lifecycle hook via .ebextensions file (as it's explained here). But I couldn't approve that the lifecycle hook really postpones termination of the instance (still waiting for an answer from AWS support about that).
So the question is: do you know any "legal" ways to postpone or cancel instance termination when the autoscaling group scales in?
I want to have something like that:
AWS starts to scale in the autoscaling group
autoscaling group sends shutdown signal to the EC2 instance
EC2 instance starts to stop all active processes
Tomcat process receives a signal to shutdown, but waits until the active job is finished
the application commits the job result (it might take even 60 minutes)
Tomcat process is terminated
EC2 instances in terminated
Elastic Beanstalk consists of two parts - API, and Worker. API is auto scaled, so it can go down. Worker is something that runs longer. You can communicate between them with SQS. That is how they designed it.
For sure you can tweak the system. That is platform as a service, so you can force auto scaling group not to go down - by setting min instances to max. Also you can turn off health check - that can also kill instance... But that is hacking, latter can kick.

Why every time Elastic Beanstalk issues a command to its instance it always timed out?

I have a PHP application deployed to Amazon Elastic Beanstalk. But I notice a problem that every time I push my code changes via git aws.push to the Elastic Beanstalk, the application deployed didn't picked up the changes. I checked the events log on my application Beanstalk environment and notice that every time the Beanstalk issues:
Deploying new version to instance(s)
it's always followed by:
The following instances have not responded in the allowed command timeout time (they might still finish eventually on their own):
[i-d5xxxxx]
The same thing happens when I try to request snapshot logs. The Beanstalk issues:
requestEnvironmentInfo is starting
then after a few minutes it's again followed by:
The following instances have not responded in the allowed command timeout time (they might still finish eventually on their own): [i-d5xxxxx].
I had this problem a few times. It seems to affect only particular instances. So it can be solved by terminating the EC2 instance (done via the EC2 page on the Management Console). Thereafter, Elastic Beanstalk will detect that there are 0 healthy instances and automatically launch a new one.
If this is a production environment and you have only 1 instance and you want minimal down time
configure minimum instances to 2, and Beanstalk will launch another instance for you.
terminate the problematic instance via EC2 tab, Beanstalk will launch another instance for you because minimum instance is 2
configure minimum instance back to 1, Beanstalk will remove one of your two instances.
By default Elastic Beanstalk "throws a timeout exception" after 8 minutes (480 seconds defined in settings) if your commands did not complete in time.
You can set an higher time up to 30 minutes (1800 seconds).
{
"Namespace": "aws:elasticbeanstalk:command",
"OptionName": "Timeout",
"Value": "1800"
}
Read here: http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/command-options.html
Had the same issue here (single t1.micro instance).
Did solve the problem by rebooting the EC2 instance via the EC2 page on the Management Console (and not from EB page).
Beanstalk deployment (and other features like Get Logs) work by sending SQS commands to instances. SQS client is deployed to instances and checks queue about every 20 secs (see /var/log/cfn-hup.log):
2018-05-30 10:42:38,605 [DEBUG] Receiving messages for queue https://sqs.us-east-2.amazonaws.com/124386531466/93b60687a33e19...
If SQS Client crashes or has network problems on t1/t2 instances then it will not be able to receive commands from Beanstalk, and deployment would time out. Rebooting instance restarts SQS Client, and it can receive commands again.
An easier way to fix SQS Client is to restart cfn-hup service:
sudo service cfn-hup restart
In the case of deployment, an alternative to shutting down the EC2 instances and waiting for Elastic Beanstalk to react, or messing about with minimum and maximum instances, is to simply perform a Rebuild environment on the target environment.
If a previous deployment failed due to timeout then the new version will still be registered against the environment, but due to the timeout it will not appear to be operational (in my experience the instance appears to still be running the old version).
Rebuilding the environment seems to reset things with the new version being used.
Obviously there's the downside with that of a period of downtime.
I think is the correct way to deal with this.
I think the correct way to deal with this is to figure out the cause of the timeout by doing what this answer suggests.
chongzixin's answer is what needs to be done if you need this fixed ASAP before investigating the reason for a timeout.
However, if you do need to increase timeout, see the following:
Add configuration files to your source code in a folder named .ebextensions and deploy it in your application source bundle.
Example:
option_settings:
"aws:elasticbeanstalk:command":
Timeout: 2400
*"value" represents the length of time before timeout in seconds.
Reference: https://serverfault.com/a/747800/496353
"Restart App Server(s)" from the "Actions" menu in Elastic Beanstalk management dashboard followed by eb deploy fixes it for me.
Visual cue for the first instruction
After two days of checking random issues, I restarted both EC2 instances one after another to make sure there is no downtime. Site worked fine but after a while, website started throwing error 504.
When I checked the http server, nginx was off and "Out of HDD space" was thrown. "Increased the HDD size", elastic beanstalk created new instances and the issue was fixed.
For me, the problem was my VPC security group rules. According to the docs, you need to allow outbound traffic on port 123 for NTP to work. I had the port closed, so the clock was drifting, and so the EC2's were becoming unresponsive to commands from the Elastic Beanstalk environment, taking forever to deploy (only to time out) failing to get logs, etc.
Thank you #Logan Pickup for the hint in your comment.