How can I delay AWS ALB health check until all services have been started on the newly created EC2 instance by autoscaling?
My current health check points to a login page on the app server but some services are not fully up when healt check starts returning success. Is there a way I can add a 2 mins delay before LB starts the health check allowing newly created instance to load all services?
I don't think there is a direct way to do this. You can config the interval time and healthy threshhold to meet your requirements.
For example, you can config interval to 30s and healthy threshold to 5, so that your target will be registered after 30s x 5 = 150s
Related
On AWS, I created an auto-scaling group with an automated scaling policy that adds a new instance based on an Application Load Balancer: Average Request Count Per Target above 5.
The target group is the number of HTTP requests sent to the Load Balancer.
The ASG is set to min 1, max 10 and desired 1.
I tried to send 200 requests to the ELB and record the IP of the instance that receives the request in a database. I found that most of the requests were sent to the same instance and some of them receive (Gateway Timeout 504) and few of them receive nothing.
The ASG launches new instances but after requests are already sent. So, the new instances receive nothing from the load balancer.
I think the reason is that cloud watch sends the average number of requests per instance every > 1 minute and perhaps opening a new instance happens in a longer time than the timeout of the request.
Q: Is there a method to keep the requests in a queue or increase their timeout till the new instances exist and then distribute these requests on all instances instead of losing them?
Q: If the user sends many requests at the same time, I want the ASG to start scaling immediately and these requests are distributed uniformly on the instances keeping a specific average number of requests per instance.
The solution was using Amazon Simple Queue Service. We forwarded the messages from the API Gateway to the queue. Then, a cloud watch alarm was used to open ECS fargate tasks when the queue size > 1 to read messages from the queue and process them. When the queue is empty, another alarm was used to set the # of tasks in the ECS service to 0.
I have a ECS Service deployed into Fargate.
It is attached to Network Load Balancer. Rolling update was working fine but suddenly I see the below issue.
When I update the service with new task definition Fargate starts the deployment and tries to start new container. Since I have the service attached to NLB, the new task registers itself with the NLB Target Group.
But NLB Target Group's health check fails. So Fargate kills the failed task and starts new task. This is being repeated multiple times(this number actually varies, today it took 7 hours for the rolling update to finish).
There are no changes to the infra after the deployment. Security group is allowing traffic within the VPC. NLB and ECS Service are deployed into same VPC, same subnet.
Fargate health check fails for the task with same docker image N number of times but after that it starts working.
Target Group healthy/unhealthy threshold is 3, protocol is TCP, port is traffic-port and the interval is 30. In the microservice startup log I see this,
Started myapp in 44.174 seconds (JVM running for 45.734)
When the task comes up, I tried opening security group rule for the VPN and tried accessing the Task IP directly. I can reach the microservice directly with task IP.
But why NLB Health Check is failing?
I had the exact same issue.
simulated it with different images (go, python) as I suspected of utilization overhead in CPU/Mem, which was false.
The mitigation can be changing the Fargate deployment parameter Minimum healthy percent to 50% (while before it was 100% and seemed to cause the issue).
After the change, the failures would become seldom, but it would still occur.
The real solution is still unknown, it seems to be something related to the NLB Configuration in Fargate
I deployed my service on ECS Fargate containers and using rolling deployment method. It has a ALB associate with the task. During deployment process, it will deploy a new container and mark the current one as Inactive. Then it destroys the current one after 300 seconds which is defined in ABL Deregistration delay field.
What I don't understand is that how many health check the ABL sends to the new instance? When both the old and new tasks running under the service, does both of them response the requests from load balancer or only the new task responses? If there is one unhealth response, will ABL roll back to the previous one?
I've been using AWS CodeDeploy to push our applications live, but it always takes ages doing the BlockTraffic and AllowTraffic steps. Currently, I have an application load balancer(ALB) with three EC2 nodes initially(behind an autoscaling group). So, If I do a CodeDeploy OneAtATime, the whole process takes up to 25 minutes.
The load balancer I'm using it with had connection draining set to 300s. I thought it was the reason for drag out. However, I disabled Connection Draining and got the same results. I then enabled Connection Draining and set timeout to 5 seconds and still got the same results.
Further, I found out CodeDeploy depends on the ALB Health Check settings. according to the AWS documentation
After an instance is bound to the ALB, CodeDeploy waits for the
status of the instance to be healthy ("inService") behind the load
balancer. This health check is done by ALB and depends on the health
check configuration.
So I tried by setting low timeouts and thresholds for health check settings. Even those changes didn't reduce the deployment time much.
Can someone direct me to a proper solution to speed up the process?
The issue is the de-registration of instances from the AWS target group. You want to change this value:
or find a way to update the deregistration_delay.timeout_seconds property - by default it's 300s, which is 5 minutes. The docs can be found here).
I am using aws ELB to report the status of my instances to an autoscaling group so a non-functional instance would be terminated and replaced by a new one. The ELB is configured to ping TCP:3000 every 60 seconds and wait for a timeout of 10 seconds to consider it a health check failure. the unhealthy threshold is 5 consecutive checks.
However the ELB always reports my instances as healthy and inservice all the time even though I periodically manually come across an instance that is timing out and I have to terminate it manually and launch a new one despite the ELB reporting it as inservice all the time
Why does this happen ?
After investigating a little bit I found that
I am trying to assess the health of the app through an api callto a web app running on the instance and wait for the response to timeout to declare the instance faulty. I needed to use http as the protocol to call port 3000 with a custom path through the load balancer instead of tcp.
Note: The api needs to return a status code of 200 for the load balancer to consider it healthy. It now works perfectly