AWS ELB - will it retry request if node fails? - amazon-web-services

I have an ELB and 3 nodes behind it.
Can someone please explain me what will ELB do in these scenarios:
Client Request -> ELB -> Node1 fails in the middle of the request (ELB timeout)
Client Request -> ELB -> Node1 timeouts (Server timeout and health check haven't kicked in yet)
Particularly I'm wondering if ELB retries the request to another node?
I made a test and it doesn't seem to, but maybe there's a setting that I've missed.
Thanks,

This may have been a matter of passage of time, but these days ELBs do retry requests that abort:
Either because of an Idle Timeout (60s by default);
Or because the instance went unhealthy due to failing health checks, with Connection Draining disabled (default is enabled)
However, this holds only if you haven't sent any response bytes yet. If you have sent incomplete headers, you will get a 408 Request Timeout. If you have sent full headers (but didn't send a body or got halfway through the body) the client will get the incomplete response as-is.
The experiments I've performed were all with a single HTTP request per connection. If you use Keep-Alive connections, the behavior might be different.

The AWS Elastic Load Balancing service uses Health Checks to identify healthy/unhealthy Amazon EC2 instances. If an instance is marked as Unhealthy, then no new traffic will be sent to that server. Once it is identified as Heathy, traffic will once again be sent to the instance.
If a request is sent to an instance and no response is received (either because the app fails or a timeout is triggered), the request will not be resent nor sent to another server. It would be up to the originator (eg a user or an app) to resend the request.

Related

How to monitor fargate ECS web app timeouts in CloudWatch?

I have a simple setup: Fargate ECS cluster with ALB, running web API.
I want to monitor (and ring alarms) the number of requests timed out for my web app. The only metric close to that I found in CloudWatch is AWS/ApplicationELB -> TargetResponseTime
But, it seems like requests that timed out from the ALB point of view are not recorded there at all.
How do you monitor ALB timeouts?
This answer is only from ALB time out requests point of view.
It is confusing because there is not a specific metric which is termed or contains timeout.
ALB Timeout generates an HTTP 408 error code for which ALB internally increments the HTTPCode_ELB_4XX_Count.
From the Docs
The load balancer sends the HTTP code to the client, saves the request to the access log, and increments the HTTPCode_ELB_4XX_Count or HTTPCode_ELB_5XX_Count metric.
In my view you can set up a CloudWatch alarm to monitor HTTPCode_ELB_4XX_Countmetric and initiate an action (such as sending a notification to an email address) if the metric goes outside what you consider an acceptable range.
More details about the HTTPCode_ELB_4XX_Count -> https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-cloudwatch-metrics.html

I am unable to make Post API call in ALB

I have created an API that has two endpoints. I containerized that API and deployed that into the ECS Fargate container behind the Application Load Balancer.
End Points.
Get = Return the status of the API
Post = Insert data into the RDS.
api/v1/healthcheck is working
api/v1/insertRecord is not working => 502 bad Gateway
The problem I am running into is that I am able to get the HealthCheck response but I am not able to make the Post API call I am getting 502 Bad Gateway error
Target Group
My target group is directed to the healthcheck endPoint so my ecs stays up. Can someone plz tell me where am I making mistake?
The 502 (Bad Gateway) status code indicates that the server, while acting as a gateway or proxy, received an invalid response from an inbound server it accessed while attempting to fulfill the request. if the service returns an invalid or malformed response, instead of returning that nonsensical information to the client.
Possible causes: taken from
check protocol and port number during REST call
The load balancer received a TCP RST from the target when attempting
to establish a connection.
The load balancer received an unexpected response from the target,
such as "ICMP Destination unreachable (Host unreachable)", when
attempting to establish a connection.
The target response is malformed or contains HTTP headers that are
not valid.
The target closed the connection with a TCP RST or a TCP FIN while
the load balancer had an outstanding request to the target.
you can enable cloudWatch log for further debugging.

Google Load Balancer retrying the request on web server after timeout expires

We are using Google Load Balancer with Tomcat Server. We have kept a specific timeout on load balancer from cloud console setting portal. Whenever any request takes more than the timeout time, GLB returns 502 i.e expected.
Here the problem is -
Whenever the request takes more than the given time, on tomcat side we are getting the same request again exactly after the timeout e.g when we have the timeout as 30 sec we got the same request on tomcat exactly after 30 sec.
On browser the response time for 502 is exactly twice of the timeout time. (It might be because of network turn around time but why always exact twice)
I assume you are referring to HTTP(S) load balancing. In this scenario, a reverse proxy is sitting in front of your application, handling requests and forwarding them to your backends. This proxy (a GFE) will retry as documented:
HTTP(S) load balancing retries failed GET requests in certain circumstances, such as when the response timeout is exhausted. It does not retry failed POST requests. Retries are limited to two attempts. Retried requests only generate one log entry for the final response. Refer to Logging for more information.

Why would AWS ELB (Elastic Load Balancer) sometimes returns 504 (gateway timeout) right away?

ELB occasionally returns 504 to our clients right away (under 1 seconds).
Problem is, it's totally random, when we repeat the request right away, it works as it should be.
Anyone have same issue or any idea on this?
Does this answers for your quiestion:
Troubleshooting Elastic Load Balancing: HTTP Errors
HTTP 504: Gateway Timeout
Description: Indicates that the load balancer closed a connection because a request did not complete within the idle timeout period.
Cause: The application takes longer to respond than the configured idle timeout.
Solution: Monitor the HTTPCode_ELB_5XX and Latency CloudWatch metrics. If there is an increase in these metrics, it could be due to the application not responding within the idle timeout period. For details about the requests that are timing out, enable access logs on the load balancer and review the 504 response codes in the logs that are generated by Elastic Load Balancing. If necessary, you can increase your back-end capacity or increase the configured idle timeout so that lengthy operations (such as uploading a large file) can complete.
Or this:
504 gateway timeout LB and EC2

408 timeouts on my Amazon ELB

We are seeing a lot of 408 timeouts on our ELB access logs. Have come across this thread https://serverfault.com/questions/485063/getting-408-errors-on-our-logs-with-no-request-or-user-agent
and also https://forums.aws.amazon.com/thread.jspa?messageID=307846
These are just two sample threads I found but others suggest the same solutions with no joy.
Have set web server timeout to be < ELB idle timeout, to be = to it and to be > than it, same result, our logs are polluted with these 408s. A bigger problem though it that they also throw off the average latency response time of our ELB which is what we trigger our auto scaler with.
We use Tomcat on our back end instances. No logs appear on tomcat to indicate a request was recieved but the ELB still shows as if requests had timed out.
On our ELB access logs there is no back end IP given for the 408s so in my opinon the requests never got to an instance at all but Amazon disagree :(.
Any one had this problem and got a reliable solution for it?
Following the suggestion of milsonspt in the linked thread, I added a virtual host to my server that monitors a different thread instead of 80 so all health checks will be executed on that host (replace CUSTOM_PORT with any port you want to be used for ELB health check).
Listen CUSTOM_PORT
<VirtualHost *:CUSTOM_PORT>
CustomLog "|/usr/sbin/rotatelogs /var/log/httpd/access_log_elb_health_check_rotated_%Y-%m-%d-%H_%M_%S 10M" combined
</VirtualHost>
Make sure that the ELB does NOT have a listener on that port.
That configuration removed the 408 errors and logs all health check in a separate log so you get an uncluttered log for your regular access log, and a dedicated log for health check.
This could happen when ELB is waiting on the client for complete request. If a partial request comes in, with incomplete headers AWS ELB would just wait. AWS ELB would not do anything with the partial request headers, and eventually respond with 408 req_timeout due to idle timeout expiry on the tcp connection.