What the reason for AWS Health status becoming RED? - amazon-web-services

I've deployed an application to AWS elastic beanstalk.
after start the application, it runs well. But after 5 minutes(I set health check every 5 min), it runs failed. I access the url but HTTP 503 error back.
From the event info, I only get the info that the health status from YELLOW TO GREEN.
But how can I get detailed info and what can I do about this error?
BTW: I don't understand that is this health status RED leads to application can't start up OR something else failed leads to application failed, then the health status becomes to RED?

Elastic Load Balancing has a health check daemon that checks the path you've provided for a 200-range HTTP status.
If there is a problem with the application, or its not returning a 2xx status code, or if you've misconfigured the health check URL, the status will go RED.
Two things you can do to see what's going on:
Hit the hostname of an individual instance in your web browser — particularly the health check path. Are you seeing what you expected?
SSH into the instance and check the logs in /var/log and /opt/elasticbeanstalk/var/log. Are there any errors that you can find?
Without knowing more about your application, stack or container type, that's the best I can do.
I hope this helps! :)

Related

Airflow web-server produces temporary 502 errors in Cloud Composer

I'm encountering 502 errors on AirFlow(2.0.2) UI hosted in Cloud Composer(1.17.0).
Error: Server Error The server encountered a temporary error and could not complete your request.
Please try again in 30 seconds.
They last for a few minutes and it happens several times a day after it's gone everything works fine.
At the moment of errors:
there is a gap in logs and after we can see that logs resumed with messages about staring gunicorn:
[1133] [INFO] Starting gunicorn 19.10.0
there is a spike in resource usage of web-server
I didn't spot any other suspicious activity in other parts of the system(workers, scheduler, DB)
I think that this is a result of OOM error because we have DAGs with a big number of tasks (2k).
But I'd like to be sure and I haven't found a way to connect to VM of app engine in tenant project(where Airflow server is hosted by default) to get additional logs.
Maybe anyone knows a way to get additional logs from AirFlow server VMs or have any other idea?
Cloud Composer documentation shows Troubleshooting DAGs sections. It shows how to check individual workers logs. It even mentions OOM issues (direct link).
Generally troubleshooting section is well documented so you should be able to find many interesting information. You can also use Cloud Monitoring and Cloud Logging to monitor Composer, but I am not sure if this will be valuable in this use case (reference).

AWS Application Load balancer - HTTPCode_ELB_4XX_Count vs HTTPCode_Target_4XX_Count?

What is the actual difference between HTTPCode_ELB_4XX_Count and HTTPCode_Target_4XX_Count?
Could understand that HTTPCode_ELB_4XX_Count is based on http code returned by ELB and HTTPCode_Target_4XX_Count is based on http code returned by target. Still not able to understand how to analyze and troubleshoot the count for each metrics?
Is this count depended on health check failure?
Based on the doc: https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-cloudwatch-metrics.html
The answer above seems wrong, should be:
HTTPCode_ELB_4XX_Count - This is the number
of 4XX errors that are returned from the ELB.
HTTPCode_Target_4XX_Count - This is simply a 4XX error returned from the
application servers.
These are 2 metrics used to help diagnose where the fault lies for ay 4XX errors.
HTTPCode_ELB_4XX_Count - This is simply a 4XX error returned from the application servers.
HTTPCode_Target_4XX_Count - This is the number of 4XX errors that are returned from the ELB.
To get a gauge of what can trigger this take a look at this page for a list of HTTP statuses mapped to the cause of the ELB error.

aws ec2: Recently I got high-rate 1/2 checks passed, how can I find what's wrong with my instances?

I have 4 t3.micro instances running. They are running fine at beginning. But after a few days, 2 of them got "1/2 checks passed" issue. I have to stop and start the instances and they running again. And then after another few days, 3 of them got "1/2 checks passed" issue. Stop&Start instances solved.
The fail rate is really high. How can I check what wrong with the instances?
As per AWS Documentation instance status check may fail due to the following reason:
Failed system status checks
Incorrect networking or startup configuration
Exhausted memory
Corrupted file system
Incompatible
You can view status check through the console in the status check tab.
For resolving the issue go through below link describing Troubleshooting Instances with Failed Status Checks:
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/TroubleshootingInstances.html#InitialSteps

setting up custom HTTP Health Checks on GCP

Apparently I cannot figure out how to do custom HTTP endpoint for Health Checks. Maybe I missed something or GCP doesn't offer it yet.
The ElasticSearch health check page describes various ways to check the EL cluster.
I was looking at the GCP health checks interface and it doesn't let us to add a URL endpoint, neither let us to define a parser for the health check to match against the "green" cluster.
What I was able to do is to wire in port 9200 and use a config like:
port: 9200, timeout: 5s, check interval: 60s, unhealthy threshold: 2 attempts
But this is way not the way to go for EL cluster, as the cluster may respond but having a yellow/red state.
There would be an easier way without parsing the output just adding a timeout check like:
GET /_cluster/health?wait_for_status=yellow&timeout=50s
Note: Will wait for 50 seconds for the cluster to reach the yellow level (if it reaches the green or yellow status before 50 seconds elapse, it will return at that point).
Any suggestions?
GCP health checks are simple and use the HTTP status code to determine if the check passes (200) - https://cloud.google.com/compute/docs/load-balancing/health-checks
what you can do is implement a simple HTTP service that will query ES's health check endpoint parse the output and decide if status code 200 should be returned or something else.

Health check in Cloud Foundry

Does anyone know how I can tell my cloud foundry instance to monitor my health endpoint, so that when my health endpoint says that the app health is not status: UP, that the app is restarted?
The cf CLI 6.24.0 (released Feb 2017) exposed this type of health checking.
In your app manifest, use:
applications:
- name: myapp
health-check-type: http
health-check-http-endpoint: /admin/health
Your app needs to return a 200 status code from that path, or an error code when it's not status UP.
You can also use the cf set-health-check command to configure it on existing apps.
Check out this documentation for more details on the different health check types.
If an app instance dies, Cloud Foundry, by default, will new up a new instance and try to start it. That resiliency is built into Cloud Foundry.
Actuators are rest end points auto injected in your app that allow you to see the app's status and health at runtime.
https://spring.io/guides/gs/actuator-service/
Try Actuators out.
I don't believe that custom url health checking is available to day in CF. If your application instance is no longer healthy and you want to restart it you can System.exit(1) and CF will restart it for you.
I've heard rumors of custom health checks possibly coming in the future with the CC V3 api and Diego.
the way to do health check in PCF
cf set-health-check APP-NAME <HEALTH-CHECK-TYPE> --endpoint <CUSTOM-HTTP-ENDPOINT>
HEALTH-CHECK-TYPE = process | port | http ( ideally http for web apps )
CUSTOM-HTTP-ENDPOINT = /health
Reference: https://docs.cloudfoundry.org/devguide/deploy-apps/healthchecks.html