Istio 504 on ALB to EKS node other than Istio gateway is located - istio

Bug Description
I'm using EKS (1.23) and ALB. ALB is terminating TLS with certs provided by ACM.
Using terraform I installed in EKS cluster following helm charts:
istio-base
istiod
gateway
all 1.15.0 version.
Other things configured on cluster:
aws_security_group_rules, both ingress and egress, on EKS nodes for ports 15000-15090
required k8s namespaces
required k8s ingress configuring ALB via alb-controller
required ACM certificates for ALB
required Route53 DNS entries
All those things are quite common so I do not think there is any weird stuff there. I have it in multiple places configured that way without Istio.
I also added some httpbin Service and Deployment and related Gateway and VirtualService.
In ingress I have 2 paths configured (besides ssl-redirect directive for ALB):
/healthz/ready is pointing to status-port
and then / is pointing to http2
Ingress-gateway service is NodePort type, as required for this type of setup.
(Important) There is 2 nodes in the cluster.
AWS console Target Group details page shows that 2/2 targets are healthy.
Sooooooo ...
When I enter address https://httpbin.somedomain.com every second request gets 504 Gateway Timeout. When I enter https://httpbin.somedomain.com/healthz/ready I get 200 every time. When I increase amount of nodes in cluster to 3, 504 occurs for 2 out of 3 requests.
It's quite clear to me, that it's related to ALB round robin over machines ... but why? status-port is 200 always.
Version
$ istioctl version
client version: 1.15.0
control plane version: 1.15.0
data plane version: 1.15.0 (3 proxies)
$ kubectl version --short
Client Version: v1.23.2
Server Version: v1.23.7-eks-4721010
$ helm version --short
v3.8.0+gd141386
Additional Information
$ istioctl bug-report
Target cluster context: v2-xxx
Running with the following config:
istio-namespace: istio-system
full-secrets: false
timeout (mins): 30
include: { }
exclude: { Namespaces: kube-node-lease,kube-public,kube-system,local-path-storage }
end-time: 2022-09-27 17:29:26.34498 +0200 CEST
Cluster endpoint: https://yyy.yl4.eu-west-1.eks.amazonaws.com
CLI version:
version.BuildInfo{Version:"1.15.0", GitRevision:"e3364ab424b70ca8ee1ca76cb0b3afb73476aaac", GolangVersion:"go1.19", BuildStatus:"Clean", GitTag:"1.15.0"}
The following Istio control plane revisions/versions were found in the cluster:
Revision default:
&version.MeshInfo{
{
Component: "pilot",
Info: version.BuildInfo{Version:"1.15.0", GitRevision:"e3364ab424b70ca8ee1ca76cb0b3afb73476aaac", GolangVersion:"go1.19", BuildStatus:"Clean", GitTag:"1.15.0"},
},
}
The following proxy revisions/versions were found in the cluster:
Revision default: Versions {1.15.0}
Fetching proxy logs for the following containers:
argocd//argo-cd-argocd-application-controller-0/application-controller
argocd/argo-cd-argocd-applicationset-controller/argo-cd-argocd-applicationset-controller-9dddcffbf-zrcgl/applicationset-controller
argocd/argo-cd-argocd-dex-server/argo-cd-argocd-dex-server-75c975ccb7-xmd82/dex-server
argocd/argo-cd-argocd-notifications-controller/argo-cd-argocd-notifications-controller-5854964cbf-z8nlr/notifications-controller
argocd/argo-cd-argocd-redis/argo-cd-argocd-redis-664b98cfd7-lndsf/argo-cd-argocd-redis
argocd/argo-cd-argocd-repo-server/argo-cd-argocd-repo-server-75f49f7ccf-xsblh/repo-server
argocd/argo-cd-argocd-server/argo-cd-argocd-server-6599d8d846-dqr6s/server
first/httpbin/httpbin-7bffdcffd-2klzj/httpbin
first/httpbin/httpbin-7bffdcffd-2klzj/istio-proxy
...
istio-ingress-internal/internal/internal-554ddcb684-kr52c/istio-proxy
istio-ingress-internet-facing/internet-facing/internet-facing-555fd48d8d-2tx74/istio-proxy
istio-system/istiod/istiod-86cd5997bb-r6797/discovery
...
Fetching Istio control plane information from cluster.
Running istio analyze on all namespaces and report as below:
Analysis Report:
Info [IST0102] (Namespace argocd) The namespace is not enabled for Istio injection. Run 'kubectl label namespace argocd istio-injection=enabled' to enable it, or 'kubectl label namespace argocd istio-injection=disabled' to explicitly mark it as not needing injection.
Info [IST0102] (Namespace default) The namespace is not enabled for Istio injection. Run 'kubectl label namespace default istio-injection=enabled' to enable it, or 'kubectl label namespace default istio-injection=disabled' to explicitly mark it as not needing injection.
Info [IST0118] (Service argocd/argo-cd-argocd-applicationset-controller) Port name webhook (port: 7000, targetPort: webhook) doesn't follow the naming convention of Istio port.
...
Creating an archive at /Users/zzz/bug-report.tar.gz.
Cleaning up temporary files in /var/folders/l4/82mt4l7x4r5dzp1j4ppxqqzm0000gn/T/bug-report.
Done.
Original issue here

I solved it with allowing port 80 to be allowed between machines in EKS node group. I do not understand why does it help TBH.

Related

Target health check fails - AWS Network Load Balancer

NOTE: I tried to include screenshots but stackoverflow does not allow me to add images with preview so I included them as links.
I deployed a web app on AWS using kOps.
I have two nodes and set up a Network Load Balancer.
The target group of the NLB has two nodes (each node is an instance made from the same template).
Load balancer actually seems to be working after checking ingress-nginx-controller logs.
The requests are being distributed over pods correctly. And I can access the service via ingress external address.
But when I go to AWS Console / Target Group, one of the two nodes is marked as and I am concerned with that.
Nodes are running correctly.
I tried to execute sh into nginx-controller and tried curl to both nodes with their internal IP address.
For the healthy node, I get nginx response and for the unhealthy node, it times out.
I do not know how nginx was installed on one of the nodes and not on the other one.
Could anybody let me know the possible reasons?
I had exactly the same problem before and this should be documented somewhere on AWS or Kubernetes. The answer is copied from AWS Premium Support
Short description
The NGINX Ingress Controller sets the spec.externalTrafficPolicy option to Local to preserve the client IP. Also, requests aren't routed to unhealthy worker nodes. The following troubleshooting implies that you don't need to maintain the cluster IP address or preserve the client IP address.
Resolution
If you check the ingress controller service you will see the External Traffic Policy field set to Local.
$ kubectl -n ingress-nginx describe svc ingress-nginx-controller
Output:
Name: ingress-nginx-controller
Namespace: ingress-nginx
...
External Traffic Policy: Local
...
This Local setting drops packets that are sent to Kubernetes nodes that aren't running instances of the NGINX Ingress Controller. Assign NGINX pods (from the Kubernetes website) to the nodes that you want to schedule the NGINX Ingress Controller on.
Update the pec.externalTrafficPolicy option to Cluster
$ kubectl -n ingress-nginx patch service ingress-nginx-controller -p '{"spec":{"externalTrafficPolicy":"Cluster"}}'
Output:
service/ingress-nginx-controller patched
By default, NodePort services perform source address translation (from the Kubernetes website). For NGINX, this means that the source IP of an HTTP request is always the IP address of the Kubernetes node that received the request. If you set a NodePort to the value of the externalTrafficPolicy field in the ingress-nginx service specification to Cluster, then you can't maintain the source IP address.

Using existing load balancer for K8S service

I have a simple app that I need to deploy in K8S (running on AWS EKS) and expose it to the outside world.
I know that I can add a service with the type LoadBalancer and viola K8S will create AWS ALB for me.
spec:
type: LoadBalancer
However, the issue is that it will create a new LB.
The main reason why this is an issue for me is that I am trying to separate out infrastructure creation/upgrades (vs. software deployment/upgrade). All of my infrastructures will be managed by Terraform and all of my software will be defined via K8S YAML files (may be Helm in the future).
And the creation of a load balancer (infrastructure) breaks this model.
Two questions:
Do I understand correctly that you can't change this behavior (create vs. use existing)?
I read multiple articles about K8S and all of them lead me into the direction of Ingress + Ingress Controller. Is this the way to solve this problem?
I am hesitant to go in this direction. There are tons of steps to get it working and it will take time for me to figure out how to retrofit it in Terraform and k8s YAML files
Short Answer , you can only change it to "NodePort" and couple the existing LB manually by adding EKS nodes with the right exposed port.
like
spec:
type: NodePort
externalTrafficPolicy: Cluster
ports:
- name: http
port: 80
protocol: TCP
targetPort: http
nodePort: **30080**
But to attach it like a native, that is not supported by AWS k8s Controller yet and may not be a priority to do support such behavior as :
Configuration: Controllers get configuration from k8s config maps or special CustomResourceDefinitions(CRDs) that will conflict with any manual
config on the already existing LB and my lead to wiping existing configs as not tracked in configs source.
Q: Direct expose or overlay ingress :
Note: Use ingress ( Nginx or AWS ALB ) if you have (+1) services to expose or you need to add controls on exposed APIs.

Getting External-DNS to work with Ingress Objects in Kops AWS 1.7 K8s Cluster

I'm trying to figure out how to get this setup to work:
I am using Kube 1.7 (no RBAC) spun up from kops in AWS
I have a single nginx ingress controller for my entire cluster that is using a LoadBalancer service in the kube-system, namespace installed via Helm
I have cert-manager setup in kube-system, installed via Helm and
using ClusterIssuers
I have external-dns setup in kube-system installed via Helm
I have multiple applications, one per namespace, with associated Ingress objects in each namespace.
I am annotating the Ingresses with the appropriate annotations for both cert-manager (certmanager.k8s.io/cluster-issuer: letsencrypt-prod) and external-dns (dns.alpha.kubernetes.io/external: app.contoso.com)
In this scenario, cert-manager is reacting appropriately to the Ingress object (modifying it to complete the ACME challenge), but external-dns is not doing anything (logs are saying all hostnames are up to date). If I manually add a Route53 record for the ELB associated with the LB service, everything works as expected. Inspecting the Ingress object, I see that the status block looks like so:
status:
loadBalancer:
ingress:
- {}
which I suppose is why external-dns isn't reacting? How do I get this to work? Per the documentation
More troubleshooting information (pod definitions, ingress definitions, controller logs, etc.) can be found here: https://gist.github.com/DWSR/f6d596850346223393bec23b289c9731
I solved this myself. The nginx ingress controller has a --publish-service command line argument which will cause it to update the status fields on the ingress objects which, in turn, will cause external-dns to create the appropriate DNS records. When installing via Helm, simply set .Values.controller.publishService.enabled to true and this will take effect.
Sources:
https://github.com/kubernetes/ingress-nginx/blob/master/docs/user-guide/cli-arguments.md
https://github.com/kubernetes/charts/tree/master/stable/nginx-ingress#configuration

How to configure an AWS Elastic IP to point to an OpenShift Origin running pod?

We have set up OpenShift Origin on AWS using this handy guide. Our eventual
hope is to have some pods running REST or similar services that we can access
for development purposes. Thus, we don't need DNS or anything like that at this
point, just a public IP with open ports that points to one of our running pods.
Our first proof of concept is trying to get a jenkins (or even just httpd!) pod
that's running inside OpenShift to be exposed via an allocated Elastic IP.
I'm not a network engineer by any stretch, but I was able to successuflly get
an Elastic IP connected to one of my OpenShift "worker" instances, which I
tested by sshing to the public IP allocated to the Elastic IP. At this point
we're struggling to figure out how to make a pod visible that allocated Elastic IP,
owever. We've tried a kubernetes LoadBalancer service, a kubernetes Ingress,
and configuring an AWS Network Load Balancer, all without being able to
successfully connect to 18.2XX.YYY.ZZZ:8080 (my public IP).
The most promising success was using oc port-forward seemed to get at least part way
through, but frustratingly hangs without returning:
$ oc port-forward --loglevel=7 jenkins-2-c1hq2 8080 -n my-project
I0222 19:20:47.708145 73184 loader.go:354] Config loaded from file /home/username/.kube/config
I0222 19:20:47.708979 73184 round_trippers.go:383] GET https://ec2-18-2AA-BBB-CCC.us-east-2.compute.amazonaws.com:8443/api/v1/namespaces/my-project/pods/jenkins-2-c1hq2
....
I0222 19:20:47.758306 73184 round_trippers.go:390] Request Headers:
I0222 19:20:47.758311 73184 round_trippers.go:393] X-Stream-Protocol-Version: portforward.k8s.io
I0222 19:20:47.758316 73184 round_trippers.go:393] User-Agent: oc/v1.6.1+5115d708d7 (linux/amd64) kubernetes/fff65cf
I0222 19:20:47.758321 73184 round_trippers.go:393] Authorization: Bearer Pqg7xP_sawaeqB2ub17MyuWyFnwdFZC5Ny1f122iKh8
I0222 19:20:47.800941 73184 round_trippers.go:408] Response Status: 101 Switching Protocols in 42 milliseconds
I0222 19:20:47.800963 73184 round_trippers.go:408] Response Status: 101 Switching Protocols in 42 milliseconds
Forwarding from 127.0.0.1:8080 -> 8080
Forwarding from [::1]:8080 -> 8080
( oc port-forward hangs at this point and never returns)
We've found a lot of information about how to get this working under GKE, but
nothing that's really helpful for getting this working for OpenShift Origin on
AWS. Any ideas?
Update:
So we realized that sysdig.com's blog post on deploying OpenShift Origin on AWS was missing some key AWS setup information, so based on OpenShift Origin's Configuring AWS page, we set the following env variables and re-ran the ansible playbook:
$ export AWS_ACCESS_KEY_ID='AKIASTUFF'
$ export AWS_SECRET_ACCESS_KEY='STUFF'
$ export ec2_vpc_subnet='my_vpc_subnet'
$ ansible-playbook -c paramiko -i hosts openshift-ansible/playbooks/byo/config.yml --key-file ~/.ssh/my-aws-stack
I think this gets us closer, but creating a load-balancer service now gives us an always-pending IP:
$ oc get services
NAME CLUSTER-IP EXTERNAL-IP PORT(S) AGE
jenkins-lb 172.30.XX.YYY <pending> 8080:31338/TCP 12h
The section on AWS Applying Configuration Changes seems to imply I need to use AWS Instance IDs rather than hostnames to identify my nodes, but I tried this and OpenShift Origin fails to start if I use that method. Still at a loss.
It may not satisfy the "Elastic IP" part but how about using AWS cloud provider ELB to expose the IP/port to the pod via a service to the pod with LoadBalancer option?
Make sure to configure the AWS cloud provider for the cluster (References)
Create a svc to the pod(s) with type LoadBalancer.
For instance to expose a Dashboard via AWS ELB.
kind: Service
apiVersion: v1
metadata:
labels:
k8s-app: kubernetes-dashboard
name: kubernetes-dashboard
namespace: kube-system
spec:
type: LoadBalancer <-----
ports:
- port: 443
targetPort: 8443
selector:
k8s-app: kubernetes-dashboard
Then the svc will be exposed as an ELB and the pod can be accessed via the ELB public DNS name a53e5811bf08011e7bae306bb783bb15-953748093.us-west-1.elb.amazonaws.com.
$ kubectl (oc) get svc kubernetes-dashboard -n kube-system -o wide
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
kubernetes-dashboard LoadBalancer 10.100.96.203 a53e5811bf08011e7bae306bb783bb15-953748093.us-west-1.elb.amazonaws.com 443:31636/TCP 16m k8s-app=kubernetes-dashboard
References
K8S AWS Cloud Provider Notes
Reference Architecture OpenShift Container Platform on Amazon Web Services
DEPLOYING OPENSHIFT CONTAINER PLATFORM 3.5 ON AMAZON WEB SERVICES
Configuring for AWS
Check this guide out: https://github.com/dwmkerr/terraform-aws-openshift
It's got some significant advantages vs. the one you referring to in your post. Additionally, it has a clear terraform spec that you can modify and reset to using an Elastic IP (haven't tried myself but should work).
Another way to "lock" your access to the installation is to re-code the assignment of the Public URL to the master instance in the terraform script, e.g., to a domain that you own (the default script sets it to an external IP-based value with "xip.io" added - works great for testing), then set up a basic ALB that forwards https 443 and 8443 to the master instance that the install creates (you can do it manually after the install is completed, also need a second dummy Subnet; dummy-up the healthcheck as well) and link the ALB to your domain via Route53. You can even use free Route53 wildcard certs with this approach.

Kubernetes Multinode CoreOS gude doesn't create ELBs in AWS

The CoreOS Multinode Cluster guide appears to have a problem. When I create a cluster and configure connectivity, everything appears fine -- however, I'm unable to create an ELB through service exposing:
$ kubectl expose rc my-nginx --port 80 --type=LoadBalancer
service "my-nginx" exposed
$ kubectl describe services
Name: my-nginx
Namespace: temp
Labels: run=my-nginx
Selector: run=my-nginx
Type: LoadBalancer
IP: 10.100.6.247
Port: <unnamed> 80/TCP
NodePort: <unnamed> 32224/TCP
Endpoints: 10.244.37.2:80,10.244.73.2:80
Session Affinity: None
No events.
The IP line that says 10.100.6.247 looks promising, but no ELB is actually created in my account. I can otherwise interact with the cluster just fine, so it seems bizarre. A "kubectl get services" listing is similar -- it shows the private IP (same as above) but the EXTERNAL_IP column is empty.
Ultimately, my goal is a solution that allows me to easily configure my VPC (ie. private subnets with NAT instances) and if I can get this working, it'd be easy enough to drop into CloudFormation since it's based on user-data. The official method of kube-up doesn't leave room for VPC-level customization in a repeatable way.
Unfortunately, that getting-started guide isn't nearly as up to date as the kube-up implementation. For instance, I don't see a --cloud-provider=aws flag anywhere, and the kubernetes-controller-manager would need that in order to know to call the AWS APIs.
You may want to check out the official CoreOS on AWS guide:
https://coreos.com/kubernetes/docs/latest/kubernetes-on-aws.html
If you hit a deadend or find a problem, I recommend asking in the AWS Special Interest Group forum:
https://groups.google.com/forum/#!forum/kubernetes-sig-aws