How to monitor fargate ECS web app timeouts in CloudWatch? - amazon-web-services

I have a simple setup: Fargate ECS cluster with ALB, running web API.
I want to monitor (and ring alarms) the number of requests timed out for my web app. The only metric close to that I found in CloudWatch is AWS/ApplicationELB -> TargetResponseTime
But, it seems like requests that timed out from the ALB point of view are not recorded there at all.
How do you monitor ALB timeouts?

This answer is only from ALB time out requests point of view.
It is confusing because there is not a specific metric which is termed or contains timeout.
ALB Timeout generates an HTTP 408 error code for which ALB internally increments the HTTPCode_ELB_4XX_Count.
From the Docs
The load balancer sends the HTTP code to the client, saves the request to the access log, and increments the HTTPCode_ELB_4XX_Count or HTTPCode_ELB_5XX_Count metric.
In my view you can set up a CloudWatch alarm to monitor HTTPCode_ELB_4XX_Countmetric and initiate an action (such as sending a notification to an email address) if the metric goes outside what you consider an acceptable range.
More details about the HTTPCode_ELB_4XX_Count -> https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-cloudwatch-metrics.html

Related

Notify all EC2 instances running in ASG

I've a microservice application that has multiple instances running in ASG. All these applications maintains some internal state. This application exposes Actuator endpoints to refresh it's state. I've some applications which are running on-prem. The scenario is, On some event, I want to call those Actuator endpoints of applications running in AWS to refresh their state. The problem is, If I call LoadBalanced url, then call would go to only one instance. So, I'm thinking of below solutions.
Use SQS and let on-prem ap publish and AWS app consume that message. But here also, only one instance will receive the message.
Use SNS but listeners are http/s based so URL would remain same so I think only one instance would receive the message. (AFAIK)
Any other solution? Please suggest.
Thanks
Use SNS but listeners are http/s based so URL would remain same so I
think only one instance would receive the message. (AFAIK)
When using SNS each server would subscribe to the SNS topic, and when each server subscribes it would provide SNS with its direct HTTP(s) URL (not the load balancer URL). When SNS receives a message it would send it to each server that is currently subscribed. I'm not sure SNS will submit the request to the actuator endpoint in the correct format that your application needs though.
There are likely several solutions you could consider, including ones that won't require a code change. Such as establishing a VPN connection between your on-premise applications and the VPC that contains your ASGs, which would allow you to invoke each machine's refresh endpoint by it's unique private ip address.
However, more simply, if you're using an AWS Classic ELB or ALB, than repeated calls to the load balancer url should hit each machine running your application if enough calls to the refresh endpoint are made.
Although this may not meet your use case, say if you must strictly limit refresh calls to 1 time per endpoint. You'd have to experiment with your software and the load balancer's round-robin behavior.

How can I set up a Cloudwatch alarm for HTTP 4XX/5XX on an ECS service/task?

I'm trying to set up Cloudwatch alarms to monitor my application running in Amazon ECS. This web application runs in Docker containers, configured as an ECS service behind an application load balancer and inside an autoscaling group that can step up/down the number of running tasks.
I've been looking through the different namespaces and metrics that are available in Cloudwatch but am not seeing quite what I'm looking for. If my application receives starts throwing off a high number of HTTP 5XX errors, I want to know about it. Likewise, if my application were to throw off a high number of HTTP 4XX errors, I want to know about that as well.
I see that there are metrics such as HTTPCode_ELB_4XX_Count and HTTPCode_ELB_5XX_Count on the load balancer, but this is not the same as application monitoring. The documentation for those specific metrics even states "This count does not include any response codes generated by the targets."
Which (if any) metrics will monitor the HTTP codes generated by the targets, in the context of an ECS service or task?
If you'r using application load balancer for your application, it's very simple...
Go to ec2-dashboard
targetgroup (which attached to docker containers)
select monitoring tab
there create alarm
and select 4XX or 5XX count

How to get latency metric from AWS CloudWatch Application ELB?

Is there any way to get latency from AWS/ApplicationELB namespace? I know it is available in the AWS/ELB namespace, but I need it for AWS/ApplicationELB, as this is what I use.
The latency metric on ELB is comparable to the TargetResponseTime metric on ALB.
ELB Latency definition: (source)
The total time elapsed, in seconds, from the time the load balancer
sent the request to a registered instance until the instance started
to send the response headers.
ALB TargetResponseTime definition: (source)
The time elapsed, in seconds, after the request leaves the load
balancer until a response from the target is received. This is
equivalent to the target_processing_time field in the access logs.
Further Reading
AWS Documentation - CloudWatch Metrics for Your Application Load Balancer

AWS CloudWatch Web Server Metrics

I have a few EC2 instances with NGINX installed using both ports 80 and 443. The instances are serving different applications so I'm not using an ELB.
I would like to create a CloudWatch alarm to make sure port 80 is always returning 200 HTTP status code. I realize there are several commercial solutions for this such as New Relic, etc, but this is the task I have at hand at the moment.
None of the EC2 metrics look to be able to accomplish this, and I cannot use any ELB metrics since I have no ELB.
What's the best way to resolve this?
You can definetly do this manually (send a request and update a metric directly sent to Cloudwatch). Monitor that metric.
Or you could look into Route53 health checks. You might get away with just configuring a health check there if you are already using Route53:
http://docs.aws.amazon.com/Route53/latest/DeveloperGuide/dns-failover.html
Create a Route53 Heath Check. Supported protocols are TCP, HTTP, and HTTPS.
The HTTP/S protocol supports matching the response payload against a user-defined string so you can not only react to connectivity problems but also to unexpected content being returned to users.
For a more advanced monitoring enable Latency metrics which collect TTFB (time to first byte) and SSL handshake times.
You can then create alarms to get alerts when one your apps becomes inaccessible.

Why would AWS ELB (Elastic Load Balancer) sometimes returns 504 (gateway timeout) right away?

ELB occasionally returns 504 to our clients right away (under 1 seconds).
Problem is, it's totally random, when we repeat the request right away, it works as it should be.
Anyone have same issue or any idea on this?
Does this answers for your quiestion:
Troubleshooting Elastic Load Balancing: HTTP Errors
HTTP 504: Gateway Timeout
Description: Indicates that the load balancer closed a connection because a request did not complete within the idle timeout period.
Cause: The application takes longer to respond than the configured idle timeout.
Solution: Monitor the HTTPCode_ELB_5XX and Latency CloudWatch metrics. If there is an increase in these metrics, it could be due to the application not responding within the idle timeout period. For details about the requests that are timing out, enable access logs on the load balancer and review the 504 response codes in the logs that are generated by Elastic Load Balancing. If necessary, you can increase your back-end capacity or increase the configured idle timeout so that lengthy operations (such as uploading a large file) can complete.
Or this:
504 gateway timeout LB and EC2