AWS Elastic Beanstalk restart docker if CPU is 100% for longer time - amazon-web-services

We have Elastic Beanstalk set to load balancing. When our app is consuming 100% CPU for longer time (i.e. after some downtime when we receive tons of webhooks) then the load balancer restarts docker inside the instance. Our app starts aprox. 2 minutes therefore you can never recover from downtime.
Is there any way how to extend this restart period or even disable it?
Scaling using CPU threshold is not an option for us as our app consumes lots of CPU during higher load.

This seems like a case of failed health check
You can go to your EC2 Dashboad => Load Balancers
Check the Load Balancer that target your EB, under the Health Check tab, you should see and edit the thresold of failed ping request to your instance until it is considered unhealthy and terminated
More information on health checks here and here

Increasing of an instance from small to medium actually solved my problem. It seems that the app could not handle this amount of load with limited resources of small instance type.

Related

Why deploying fargate over a load balancer service takes more time than its own service?

I'm making some basic architecture with some backend and frontend docker containers.
I started deploying a backend container, defined a task with AWS Fargate and configured its own service. The time of the deployments took about 2 minutes, but now I have a Network Load Balancer service with target group targeting to my container, and the deploy takes about 6-10 minutes, is it normal?
This deploy time is not bothering me but, while I'm studying this technology, this is getting a little slow (for example against a kubernetes cluster), my plan is to have a bigger architecture with back, front, db and auto-scalers, just for learning.
Without sharing your configuration (CloudFormation Template) or some screenshots its hard to say exactly what is causing this issue but I can point you to some areas that could cause longer deployments.
Possibly have a default health heck interval set to 30 seconds. This means that it waits 30 seconds to carry out another check on the TCP port.
You can reduce this to 10 seconds to speed up health checks
Possibly have the default health check healthy threshold set as 3. This coupled with the default check interval adds around 90 seconds to stabilise a container and return a healthy status.
Deploying a single container and not having Cross-Zone load balancing enabled and also having round robin routing algorithm.
This setup can fail the health check and then the threshold count starts again and you need 3 in a row to return a healthy status.
Lastly, Fargate because there is no image caching the pull time is a bit longer than ECS backed by EC2 instances so can also add to the deployment time.
With all the above configured correctly you can get back a few minutes usually on a Network Load Balancer target group deployment but they are usually a bit slower than Application Load Balancer deployments.
If your app does not require a Network Load Balancer its recommended to use an application load balancer. An application load balancer goes much deeper and is capable of determining availability based on not only a successful HTTP GET of a particular page but also the verification that the content is as was expected based on the input parameters.

Fargate deployment restarting multiple times before it comes online

I have a ECS Service deployed into Fargate.
It is attached to Network Load Balancer. Rolling update was working fine but suddenly I see the below issue.
When I update the service with new task definition Fargate starts the deployment and tries to start new container. Since I have the service attached to NLB, the new task registers itself with the NLB Target Group.
But NLB Target Group's health check fails. So Fargate kills the failed task and starts new task. This is being repeated multiple times(this number actually varies, today it took 7 hours for the rolling update to finish).
There are no changes to the infra after the deployment. Security group is allowing traffic within the VPC. NLB and ECS Service are deployed into same VPC, same subnet.
Fargate health check fails for the task with same docker image N number of times but after that it starts working.
Target Group healthy/unhealthy threshold is 3, protocol is TCP, port is traffic-port and the interval is 30. In the microservice startup log I see this,
Started myapp in 44.174 seconds (JVM running for 45.734)
When the task comes up, I tried opening security group rule for the VPN and tried accessing the Task IP directly. I can reach the microservice directly with task IP.
But why NLB Health Check is failing?
I had the exact same issue.
simulated it with different images (go, python) as I suspected of utilization overhead in CPU/Mem, which was false.
The mitigation can be changing the Fargate deployment parameter Minimum healthy percent to 50% (while before it was 100% and seemed to cause the issue).
After the change, the failures would become seldom, but it would still occur.
The real solution is still unknown, it seems to be something related to the NLB Configuration in Fargate

AWS Codedeploy BlockTraffic/AllowTraffic durations

I've been using AWS CodeDeploy to push our applications live, but it always takes ages doing the BlockTraffic and AllowTraffic steps. Currently, I have an application load balancer(ALB) with three EC2 nodes initially(behind an autoscaling group). So, If I do a CodeDeploy OneAtATime, the whole process takes up to 25 minutes.
The load balancer I'm using it with had connection draining set to 300s. I thought it was the reason for drag out. However, I disabled Connection Draining and got the same results. I then enabled Connection Draining and set timeout to 5 seconds and still got the same results.
Further, I found out CodeDeploy depends on the ALB Health Check settings. according to the AWS documentation
After an instance is bound to the ALB, CodeDeploy waits for the
status of the instance to be healthy ("inService") behind the load
balancer. This health check is done by ALB and depends on the health
check configuration.
So I tried by setting low timeouts and thresholds for health check settings. Even those changes didn't reduce the deployment time much.
Can someone direct me to a proper solution to speed up the process?
The issue is the de-registration of instances from the AWS target group. You want to change this value:
or find a way to update the deregistration_delay.timeout_seconds property - by default it's 300s, which is 5 minutes. The docs can be found here).

Elastic Beanstalk reports 5xx errors even though instances are in perfect health

I need to set up an api application for gathering event data to be used in a recommendation engine. This is my setup:
Elastic Beanstalk env with a load balancer and autoscaling group.
I have 2x t2.medium instances running behind a load balancer.
EBS configuration is 64bit Amazon Linux 2016.03 v2.1.1 running Tomcat 8 Java 8
Additionally I have 8x t2.micro instances that I use for high-load testing the api, sending thousands of requests/sec to be handled by the api.
Im using Locust (http://locust.io/) as my load testing tool.
Each t2.micro instance that is run by Locust can send up to about 500req/sec
Everything works fine while the reqs/sec are below 1000, maybe 1200. Once over that, my load balancer reports that some of the instances behind it are reporting 5xx errors (attached). I've also tried with 4 instances behind the load balancer, and although things start out well with up to 3000req/sec, soon after, the ebs health tool and Locust both report 503s and 504s, while all of the instances are in perfect health according to the actual numbers in the ebs Health Overview, showing only 10%-20% CPU utilization.
Is there smth I'm missing in configuring the env? It seems like no matter how many machines I have behind the load balancer, the env handles no more than 1000-2000 requests per second.
EDIT:
Now I know for sure that it's the ELB that is causing the problems, not the instances.
I ran a load test with 10 simulated users. Each user sends about 1req/sec and the load increases by 10 users/sec to 4000 users, which should equal to about 4000req/sec. Still it doesn't seem to like any request rate over 3.5k req/sec (attachment1).
As you can see from attachment2, the 4 instances behind the load balancer are in perfect health, but I still keep getting 503 errors. It's just the load balancer itself causing problems. Look how SurgeQueueLength and SpilloverCount increase rapidly at some point. (attachment3) I'm trying to figure out why.
Also I completely removed the load balancer and tested with just one instance alone. It can handle up to about 3k req/sec. (attachment4 and attachment5), so it's definitely the load balancer.
Maybe I'm missing some crucial limit that load balancers have by default, like the queue size of 1024? What is normal handle rate for 1 load balancer? Should I be adding more load balancers? Could it be related to availability zones? ELB listeners from one zone are trying to route to instances from a different zone?
attachment1:
attachment2:
attachment3:
attachment4:
attachment5:
UPDATE:
Cross zone load balancing is enabled
UPDATE:
maybe this helps more:
The message says that "9.8 % of the requests to the ELB are failing with HTTP 5xx (6 minutes ago)". This does not mean that your instances are not returning HTTP 5xx responses. The requests are failing at the ELB itself. This can happen when your backend instances are at capacity (e.g. connections are saturated and they are rejecting connections to the ELB).
Your requests are spilling over at the ELB. They never make it to the instance. If they were failing at the EC2 instances then the cause would be different and data for the environment would match the data for the instances.
Also note that the cause says that this was the state "6 minutes ago". Elastic Beanstalk multiple data sources - one is the data coming from the instance which shows the requests per second and HTTP status codes in the table shown. Another data source is cloudwatch metrics for your ELB. Since cloudwatch metrics for ELB are 1 minute, this data is slightly delayed and the cause tells you how old the information is.

Grace Period? - AWS EC2 Container Service and Elastic Load Balancers

When an elastic load balancer (ELB) is associated with an auto-scaling group, it is possible to specify a grace period during which new EC2 instances will not be terminated even if they are marked as unhealthy by the ELB. Is it possible to specify a similar grace period, during which new ECS tasks will not be killed and restarted by their associated ECS service, even if the ECS instance on which a task is running has been marked unhealthy by the ELB?
Update:
In our current use case, the docker container being run as an ECS task contains a JBoss instance that loads a number of caches on startup. These caches can take several minutes to load. However, the ECS service registers the container instance with the ELB, as soon as the container has started. This means that traffic can be routed to the new container before it is ready to accept it. We could increase the health check interval and the "healthy/unhealthy thresholds" on the ELB to prevent the ELB from routing traffic to the instance and the ECS service from restarting the container until the caches have been loaded. However, increasing the health check interval and thresholds is not desirable, because if an instance is marked as unhealthy after the caches have been loaded, the ECS service should restart the container as soon as possible (which necessitates a shorter health check interval and smaller thresholds).
Thus, is it possible to apply a grace period during which traffic will not be routed to a new container by the ELB and the ECS service will not restart the container (even if it fails the health checks)? Or failing that, are there any suggestions regarding a solution for our use case?
In case anyone else finds themselves here via google, in the linked support thread, it is noted that this has been added to AWS, it is called healthCheckGracePeriodSeconds https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_CreateService.html#ECS-CreateService-request-healthCheckGracePeriodSeconds
After a discussion with the support team, it turns out that ECS cannot support our current use case.
There is a workaround that solves one of the issues we are facing. That workaround is to create a separate, essential, health-check container and in the same ECS task as the actual application container. The purpose of the health-check container is to monitor the application container to determine when the application has been started completely. If it detects that the application has failed to start, it exits, causing the ECS service to cycle the task. The ELB is then configured to perform its health checks against the health-check container, which will always report that it is up via the relevant port. This workaround will prevent the ECS service from cycling the ECS task due to failed health checks.
However, the ELB will begin routing traffic to the application container immediately. It will do so, even if the application container is not yet ready to receive traffic (for example, because it is still waiting for a cache to load). Currently, there is no way to delay the ELB from sending traffic to the application container, as the ECS service provides no support a grace period. We have managed to workaround this issue by providing messages to our application containers via SQS and only having them pull from the queue when their caches are fully loaded. However, we have future use cases (such as serving web requests) where this is not a feasible option. To this end, I intend to raise a feature request for the grace period.
As an aside, both Kubernetes (http://kubernetes.io/v1.0/docs/user-guide/walkthrough/k8s201.html#application-health-checking) and Marathon (https://mesosphere.github.io/marathon/docs/health-checks.html) already support this option for health checking, if someone reading this is happy not to use a managed service.
Use env var ECS_CONTAINER_STOP_TIMEOUT
See https://github.com/aws/amazon-ecs-agent/issues/126