502 Bad gateway Error for nginx deployments if istio automatic injection enabled into namespace - istio

https://github.com/istio/istio/issues/41233
I have one namespace 'eshop-bg-poc' into kubernetes cluster and I have enabled istio automatic injection also into namespace. I have one nginx under service name 'eshop-bgpoc-nginx' running with minimum replicas 3 and maximum replicas 5 . but if i redeployed nginx than for few seconds i received 502 badgateway request on nginx .
I tested same scenario by disable istio automatic injection into same namespace . It worked fine for redeployment nginx not able to see 502 bad gateway request.
Note : I haven't applied any virtual services and destination rule for nginx

Related

Getting 5xx error with AWS Application Load Balancer - fluctuating healthy and unhealthy target group

My web application on AWS EC2 + load balancer sometimes shows 500 errors. How do I know if the error is on the server side or the application side?
I am using Route 53 domain and ssl on my url. I set the ALB redirect requests on port 80 to 443, and forward requests on port 443 to the target group (the EC2). However, the target group is returning 5xx error code sometimes when handling the request. Please see the screenshots for the metrics and configurations for the ALB.
Target Group Metrics
Target Group Configuration
Load Balancer Metrics
Load Balancer Listeners
EC2 Metrics
Right now the web application is running unsteady, sometimes it returns a 502 or 503 service unavailable (seems like it's a connnection timeout).
I have set up the ALB idle timeout 4000 secs.
ALB configuration
The application is using Nuxt.js + PHP7.0 + MySQL + Apache 2.4.54.
I have set the Apache prefork worker Maxclient number as 1000, which should be enough to handle the requests on the application.
The EC2 is a t2.Large resource, the CPU and Memory look enough to handle the processing.
It seems like if I directly request the IP address but not the domain, the amount of 5xx errors significantly reduced (but still exists).
I also have Wordpress application host on this EC2 in a subdomain (CNAME). I have never encountered any 5xx errors on this subdomain site, which makes me guess there might be some errors in my application code but not on the server side.
Is the 5xx error from my application or from the server?
I also tried to add another EC2 in the target group see if they can have at lease one healthy instance to handle the requests. However, the application is using a third-party API and has strict IP whitelist policy. I did some research that the Elastic IP I got from AWS cannot be attached to 2 different EC2s.
First of all, if your application is prone to stutters, increase healthcheck retries and timeouts, which will affect your initial question of flapping health.
To what I see from your screenshot, most of your 5xx are due to either server or application (you know obviously better what's the culprit since you have access to their logs).
To answer your question about 5xx errors coming from LB: this happens directly after LB kicks out unhealthy instance and if there's none to replace (which shouldn't be the case because you're supposed to have ASG if you enable evaluation of target health for LB), it can't produce meaningful output and thus crumbles with 5xx.
This should be enough information for you to make adjustments and logs investigation.

AWSEKS - Non Istio mesh Pod to pod connection issue after installing Istio 1.13.0

In kubernetes (AWS EKS) v1.20 have a default namespace with two pods, connected with a service type loadbalancer (CLB). Requesting the uri to the CLB worked fine and routed to either of the pods as required.
Post installation of 1.13.0 of Istio with istio-injection=enabled label set on a different namespace, the communication of the non-istio pods with no sidecar injection doesnt seem to work.
What I mean by doesnt work: (below 3 scenarios always worked without istio)
curl requests sent to https://default-nspods/apicall always worked with the non-istio pods.
i.e., the CLB always forwarded requests to to the 2 pods as required.
curl request after logging into the pod1 to pod2s IP worked and vice versa.
curl request to pod2 uri from the Node1 of pod1 worked and vice versa.
Post
Post installation, 2 and 3 doesnt work. The CLB also has trouble reading the nodeport of the pods at times.
Ive checked istioctl proxy-config endpoints and checked the deployments where the sidecar injection is enabled, the output doesnt show any other non mesh service/pod details.
Istio Version: 1.13.0
Ingress Gateway: Enabled (Loadbalancer mode)
Egress Gateway: Disabled
No addons like Kiali, Prometheus
Istio Operator based installation with modified yaml values.
Single cluster installation i.e., ISTIO_MESH_ROUTER_MODE='standard'
Istio pods, envoy sidecars, proxy-config dont show any errors.
Am kind of stuck, please let me know if I need to check kube-proxy, ip-tables or some where else.
Ive uninstalled istio using the "istioctl x uninstall --purge" option and re-installed , but the non-mesh pods seem to be not working now with Istio installed or not.
Istio pods and Istio injection namespace pods dont have issues.

HTTP ERROR 408 - After setting up kubernetes , along with AWS ELB and NGINX Ingress

I am finding it extremely hard to debug this issue, I have a Kubernetes cluster setup along with services for the pods, they are connected to the Nginx ingress and connected to was elb classic, which also connects to the AWS route53 DNS my domain name is connected to. Everything works fine with that, but then am facing an issue where my domains do not behave the way I would like them to.
My domains in the Nginx-ingress-rules are connected to a service which sends alive page when hit with a domain, now when I do that I get this page.
Please help me what what to do to resolve this quickly, thanks in advance!
Talk to you soon
enter image description here
While you are using web servers behind ELB you must know that they generate a lot of 408 responses due to their health checks.
Possible solutions:
1. Set RequestReadTimeout header=0 body=0
This disables the 408 responses if a request times out.
2. Disable logging for the ELB IP addresses with:
SetEnvIf Remote_Addr "10\.0\.0\.5" exclude_from_log
CustomLog logs/access_log common env=!exclude_from_log
3. Set up different port for ELB health check.
4. Adjust your request timeout higher than 60.
5. Make sure that the idle time configured on the Elastic Loadbalancer is slightly lower than the idle timeout configured for the Apache httpd running on each of the instances.
Take a look: amazon-aws-http-408, haproxy-elb, 408-http-elb.

AWS/EKS: Getting frequent 504 gateway timeout errors from ALB

I'm using EKS to deploy a service, with ingress running on top of alb-ingress-controller.
All in all I have about 10 replicas of a single pod, with a single service of type NodePort which forwards traffic to them. The replicas run on 10 nodes, established with eksctl, and spread across 3 availability zones.
The problem I'm seeing is very strange - inside the cluster, all the logs show that requests are being handled in less than 1s, mostly around 20-50 millis. I know this because I used linkerd to show the percentiles of request latencies, as well as the app logs themselves. However, the ALB logs/monitoring tell a very different story. I see a relatively high request latency (often approaching 20s or more), and often also 504 errors returned from the ELB (sometimes 2-3 every 5 minutes).
When trying to read the access logs for the ALB, I noticed that the 504 lines look like this:
https 2019-12-10T14:56:54.514487Z app/1d061b91-XXXXX-3e15/19297d73543adb87 52.207.101.56:41266 192.168.32.189:30246 0.000 -1 -1 504 - 747 308 "GET XXXXXXXX" "-" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2 arn:aws:elasticloadbalancing:eu-west-1:750977848747:targetgroup/1d061b91-358e2837024de757a3d/e59bbbdb58407de3 "Root=1-5defb1fa-cbcdd248dd043b5bf1221ad8" "XXXX" "XXXX" 1 2019-12-10T14:55:54.514000Z "forward" "-" "-" "192.168.32.189:30246" "-"
Where the request processing time is 0 and the target processing time is -1, indicating the request never made it to the backend, and response was returned immediately.
I tried to play with the backend HTTP keepalive timeout (currently at 75s) and with the ALB idle time (currently at 60s) but nothing seems to change much for this behavior.
If anyone can point me to how to proceed and investigate this, or what the cause can be, I'd appreciate it very much.
We faced a similar type of issue with EKS and ALB combination. If the target response code says -1, there may be a chance that the request waiting queue is full on the target side. So the ALB will immediately drop the request.
Try to do an ab benchmark by skipping the ALB and directly send the request to the service or the private IP address. Doing this will help you to identify where the problem is.
For us, 1 out of 10 requests failed if we send traffic via ALB. We are not seeing failures if we directly send the request to the service.
AWS recommendation is to use NLB over the ALB. NLB gives more advantages and suitable for Kubernetes. There is a blog which explains this Using a Network Load Balancer with the NGINX Ingress Controller on Amazon EKS
We changed to NLB and now we are not getting 5XX errors.

Exposing Istio Ingress Gateway as NodePort to GKE and run health check

I'm running Istio Ingress Gateway in a GKE cluster. The Service runs with a NodePort. I'd like to connect it to a Google backend service. However we need to have an health check that must run against Istio. Do you know if Istio expose any HTTP endpoint to run health check and verify its status?
Per this installation guide, "Istio requires no changes to the application itself. Note that the application must use HTTP/1.1 or HTTP/2.0 protocol for all its HTTP traffic because the Envoy proxy doesn't support HTTP/1.0: it relies on headers that aren't present in HTTP/1.0 for routing."
The healthcheck doesn't necessarily run against Istio itself, but against the whole stack behind the IP addresses you configured for the load balancer backend service. It simply requires a 200 response on / when invoked with no host name.
You can configure this by installing a small service like httpbin as the default path for your gateway.
You might also consider changing your Service to a LoadBalancer type, annotated to be internal to your network (no public IP). This will generate a Backend Service, complete with healthcheck, which you can borrow for your other load balancer. This method has worked for me with nesting load balancers (to migrate load) but not for a proxy like Google's IAP.