GCP Kubernetes - Health Check Fails in Loader Balancer with NEG backends - google-cloud-platform

Here is what exists and works OK:
Kubernetes cluster in Google Cloud with deployed 8 workloads - basically GraphQL microservices.
Each of the workloads has a service that exposes port 80 via NEG (Network Endpoint Group). So, each workload has its ClusterIP in the 10.12.0.0/20 network. Each of the services has a custom namespace "microservices".
One of the workloads (API gateway) is exposed to the Internet via Global HTTP(S) Load Balancer. Its purpose is to handle all requests and route them to the right microservice.
Now, I needed to expose all of the workloads to the outside world so they can be reached individually without going through the gateway.
For this, I have created:
a Global Load Balancer, added backends (which referer to NEGs), configured routing (URL path defines which workload the request will go), and external IP
a Health Check that is used by Load Balancer for each of the backends
a firewall rule that allows traffic on TCP port 80 from the Google Health Check services 35.191.0.0/16, 130.211.0.0/22 to all hosts in the network.
The problem: Health Check fails and thus the load balancer does not work - it gives error 502.
What I checked:
logs show that the firewall rule allows traffic
logs for the Health Check show only changes I do to it and no other activities so I do not know what happens inside.
connected via SSH to the VM which hosts the Kubernetes node and checked that the clusterIPs (10.12.xx.xx) of each of workload return HTTP Status 200.
connected via SSH to a VM created for test purposes. From this VM I cannot reach any of the ClusterIPs (10.12.xx.xx)
It seems that for some reason traffic from the Health Check or my test VM does not get to the destination. What did I miss?

Related

How to add Cloud CDN to GCP VM? Always no load balancer available

I have a running Web server on Google Cloud. It's a Debian VM serving a few sites with low-ish traffic, but I don't like Cloudflare. So, Cloud CDN it is.
I created a load balancer with static IP.
I do all the items from the guides I've found. But when it comes time to Add origin to Cloud CDN, no load balancer is available because it's "unhealthy", as seen by rolling over the yellow triangle in the LB status page: "1 backend service is unhealthy".
At this point, the only option is to choose Create a Load Balancer.
I've created several load balancers with different attributes, thinking that might be it, but no luck. They all get the "1 backend service is unhealthy" tag, and thus are unavailable.
---Edit below---
During LB creation, I don't see anywhere that causes the LB to know about the VM, except in cert issue (see below). Nowhere does it ask for any field that would point to the VM.
I created another LB just now, and here are those settings. It finishes, then it's marked unhealthy.
Type
HTTP(S) Load Balancing
Internet facing or internal only?
From Internet to my VMs
(my VM is not listed in backend services, so I create one... is this the problem?)
Create backend service
Backend type: Instanced group
Port numbers: 80,443
Enable Cloud CDN: checked
Health check: create new: https, check /
Simple host and path rule: checked
New Frontend IP and port
Protocol: HTTPS
IP: v4, static reserved and issued
Port: 443
Certificate: Create New: Create Google-managed certificate, mydomain.com and www.mydomain.com
Load balancer's unhealthy state could mean that your LB's healthcheck probe is unable to reach your backend service(Your Debian VM in this case).
If your backend service looks good now, I think there is a problem with your firewall configuration.
Check your firewall rules whether it allows healthcheck probe's IP address range or not.
Refer to the docoment below to get more detailed information.
Required firewall rule

AWS: Using the application load balancer with ECS - Requests not reaching tasks

When I send requests using the ALB's DNS host, the listener's path, and the web services endpoint path, I don't get a response within the expected timeframe, which I've determined by successfully sending
requests directly to each of the tasks using their public ip addresses, they return successful responses.
For example:
The ALB's DNS entry: http://myapp-alb-11111111.us-west-1.elb.amazonaws.com
The web app, "abc", listens on port 80 for requests on "/api/health".
The web app is using "abc-svc/*" as the path in the listener.
The web app was assigned a public ip address of 10.88.77.66.
Sending a GET request to 'http://10.88.77.66/api/health' is successful.
Sending a GET request to 'http://myapp-alb-11111111.us-west-1.elb.amazonaws.com/abc-svc/api/health' does not return within several minutes, which is not expected behavior.
I've looked through the logs, but cannot find anything that is amiss. I'd appreciate any ideas or suggestions...
AWS CONFIGURATION
I have three docker images that are running in ECS. Each image is assigned to a separate service. Each service has a single task. Port 80 is open in the security group from the Internet to the ALB. Port 80 is open from the ALB to each task. The ALB's listener for port 80 is using path-based routing. There is a separate, unique path for each service. Each task contains a docker linux, spring boot 2, web service. Each web service's router has a "/api/health" route that expects a GET request with no parameters and returns a simple string. We are not using HTTP or SSL at this time.
Thank you for your time and interest.
Mike
There is a different reason for that but some of the common issues that you can debug
Check health check for each target group under LB target group, if its unhealthy LB will never route the traffic
Verify the target port is correct
Verify Target group associated properly with LB and is not showing unused.
Verify LB security group
Check the response from LB is it gateway timeout or service unavailbe if gateway timeout its not reachable if service unavailable probably restarting
Services Event logs, check that service is in steady-state or not, if not its mean restarting again and again
Check deployment logs of service, if you see unhealthy target group message then update the target group health path with status code

aws ECS, ECS instance is not registered to ALB target group

I create ECS service and it runs 1 ecs instance and I can see the instance is registered as a target of the load balancer.
Now I trigger a Auto Scaling Group (by just incrementing desired instance count) to launch a new instance.
The instance is launched and added to the ECS cluster. (I can see it on ECS instances tab)
But the instance is not added to the ALB target. (I expect to see 2 instances in the following image, but I only see 1)
I can edit AutoScalingGroup 's target group like the following
Then I see the following .
But the health check fails. It seems the 80 port is not reachable.
Although I have port 80 open for public in the security group for the instance. (Also, instance created from ecs service uses dynamic port mapping but instance created by ALS does not)
So AutoScalingGroup can launch new instance but my load balancer never gives traffic to the new instance.
I did try https://aws.amazon.com/premiumsupport/knowledge-center/troubleshoot-unhealthy-checks-ecs/?nc1=h_ls and it shows I can connect to port 80 from host to the docker container by something like curl -v http://${IPADDR}/health.
So it must be the case that there's something wrong with host port 80 (load balancer can't connect to it).
But it is also the case the security group setting is not wrong, because the working instance and this non working instance is using the same SG.
Edit
Because I used dynamic mapping, my webserver is running on some random port.
As you can see the instance started by ecs service has registered itself to target group with random port.
However instance started by ALB has registered itself to target group with port 80.
The instance will not be added to the target group if it's not healthy. So you need to fix the health check first.
From your first instance, your mapped port is 32769 so I assume if this is the same target group and if it is the same application then the port in new instance should be 32769.
When you curl the IP endpoint curl -I -v http://${IPADDR}/health. is the HTTP status code was 200, if it is 200 then it should be healthy if it's not 200 then update the backend http-status code or you can update health check HTTP status code.
I assume that you are also running ECS in both instances, so ECS create target group against each ECS services, are you running some mix services that you need target group in AS group? if you are running dynamic port then remove the health check path to traffic port.
Now if we look the offical possible causes for 502 bad Gateway
Dynamic port mapping is a feature of container instance in Amazon Elastic Container Service (Amazon ECS)
Dynamic port mapping with an Application Load Balancer makes it easier
to run multiple tasks on the same Amazon ECS service on an Amazon ECS
cluster.
With the Classic Load Balancer, you must statically map port numbers
on a container instance. The Classic Load Balancer does not allow you
to run multiple copies of a task on the same instance because the
ports conflict. An Application Load Balancer uses dynamic port mapping
so that you can run multiple tasks from a single service on the same
container instance.
Your created target group will not work with dynamic port, you have to bind the target group with ECS services.
dynamic-port-mapping-ecs
HTTP 502: Bad Gateway
Possible causes:
The load balancer received a TCP RST from the target when attempting to establish a connection.
The load balancer received an unexpected response from the target, such as "ICMP Destination unreachable (Host unreachable)", when attempting to establish a connection. Check whether traffic is allowed from the load balancer subnets to the targets on the target port.
The target closed the connection with a TCP RST or a TCP FIN while the load balancer had an outstanding request to the target. Check whether the keep-alive duration of the target is shorter than the idle timeout value of the load balancer.
The target response is malformed or contains HTTP headers that are not valid.
The load balancer encountered an SSL handshake error or SSL handshake timeout (10 seconds) when connecting to a target.
The deregistration delay period elapsed for a request being handled by a target that was deregistered. Increase the delay period so that lengthy operations can complete.
http-502-issues
It seems you know the root cause, which is that port 80 is failing the health check and thats why it is never added to ALB. Here is what you can try
First, check that your service is listening on port 80 on the new host. You can use command like netcat
nv -v localhost 80
Once you know that the service is listening, the recommended way to allow your ALB to connect to your host is to add a Security group inbound rule for your instance to allow traffic from your ALB security group on port 80

Exposing Istio Ingress Gateway as NodePort to GKE and run health check

I'm running Istio Ingress Gateway in a GKE cluster. The Service runs with a NodePort. I'd like to connect it to a Google backend service. However we need to have an health check that must run against Istio. Do you know if Istio expose any HTTP endpoint to run health check and verify its status?
Per this installation guide, "Istio requires no changes to the application itself. Note that the application must use HTTP/1.1 or HTTP/2.0 protocol for all its HTTP traffic because the Envoy proxy doesn't support HTTP/1.0: it relies on headers that aren't present in HTTP/1.0 for routing."
The healthcheck doesn't necessarily run against Istio itself, but against the whole stack behind the IP addresses you configured for the load balancer backend service. It simply requires a 200 response on / when invoked with no host name.
You can configure this by installing a small service like httpbin as the default path for your gateway.
You might also consider changing your Service to a LoadBalancer type, annotated to be internal to your network (no public IP). This will generate a Backend Service, complete with healthcheck, which you can borrow for your other load balancer. This method has worked for me with nesting load balancers (to migrate load) but not for a proxy like Google's IAP.

How to use unique health check port on an Application Load Balancer (Container Service) on AWS?

I have an domain that needs to be routed to both an Application Load Balancer and an EC2-instance depending on the URL path. The Application Load Balancer has a limit of 10 rules per ALB, and I need more.
So to workaround this limit of 10 URLs I would like to setup a request pipeline as follows:
ALB for domain.com -> Docker container with HAProxy with routing rules/reverse proxy -> routes to another ALB or EC2-instance
The setup is fine, I'm having problems with setting up the HAProxy and it's health check. I would like the ALB to health check on a different port rather than the traffic port. In HAProxy I can simply setup multiple frontends, one for the routing (port 80) and one for health check (port 60000). But if I enter port 60000 in the ALBs target group I can't deploy another service due to the dynamic mapping.
Any ideas how to solve this? I rather not expose the health check on port 80 due to it being available for the public net but if that's the only solution it's fine (but how to do it?).
I ended up with using monitor-uri as the healthcheck, not ideal since it's exposed to port 80 but no secret info is showing there anyway.