Istio: failed calling admission webhook Address is not allowed - amazon-web-services

I am getting the following error while creating a gateway for the sample bookinfo application
Internal error occurred: failed calling admission webhook
"pilot.validation.istio.io": Post
https://istio-galley.istio-system.svc:443/admitpilot?timeout=30s:
Address is not allowed
I have created a EKS poc cluster using two node-groups (each with two instances), one with t2.medium and another one is with t2.large type of instances in my dev AWS account using two subnets with /26 subnet with default VPC-CNI provided by EKS
But as the cluster is growing with multiple services running, I started facing issues of IPs not available (as per docs default vpc-cni driver treat pods as an EC2 instance)
to avoid same I followed following post to change networking from default to weave
https://medium.com/codeops/installing-weave-cni-on-aws-eks-51c2e6b7abc8
because of same I have resolved IPs unavailability issue,
Now after network reconfiguration from vpc-cni to weave
I am started getting above issue as per subject line for my service mesh configured using Istio
There are a couple of services running inside the mesh and also integrated kiali, prometheus, jaeger with the same.
I tried to have a look at Github (https://github.com/istio/istio/issues/9998) and docs
(https://istio.io/docs/ops/setup/validation/), but could not get a proper valid answer.
Let me if anyone face this issue and have partial/full solution on this.

This 'appears' to be related to the switch from AWS CNI to weave. CNI uses the IP range of your VPC while weave uses its own address range (for pods), so there may be remaining iptables rules from AWS CNI, for example.
Internal error occurred: failed calling admission webhook "pilot.validation.istio.io": Post https://istio-galley.istio-system.svc:443/admitpilot?timeout=30s: Address is not allowed
The message above implies that whatever address istio-galley.istio-system.svc resolves to, internally in your K8s cluster, is not a valid IP address. So I would also try to see what that resolves to. (It may be related to coreDNS).
You can also try the following these steps;
Basically, (quoted)
kubectl delete ds aws-node -n kube-system
delete /etc/cni/net.d/10-aws.conflist on each of the node
edit instance security group to allow UDP, TCP on 6873, 6874 ports
flush iptables, nat, mangle, filter
restart kube-proxy pods
apply weave-net daemonset
delete existing pods so the get recreated in Weave pod CIDR's address-space.
Furthermore, you can try reinstalling everything from the beginning using weave.
Hope it helps!

Related

"ERR_EMPTY_RESPONSE" - ShinyApp hosted over AWS (EC2 / EKS / ShinyProxy) does not work

Update #2:
I have checked the health status of my instances within the auto scaling group - here the instances are titled as "healthy". (Screenshot added)
I followed this trouble-shooting tutorial from AWS - without success:
Solution: Use the ELB health check for your Auto Scaling group. When you use the ELB health check, Auto Scaling determines the health status of your instances by checking the results of both the instance status check and the ELB health check. For more information, see Adding health checks to your Auto Scaling group in the Amazon EC2 Auto Scaling User Guide.
Update #1:
I found out that the two Node-Instances are "OutOfService" (as seen in the screenshots below) because they are failing the Healtcheck from the loadbalancer - could this be the problem? And how do i solve it?
Thanks!
I am currently on the home stretch to host my ShinyApp on AWS.
To make the hosting scalable, I decided to use AWS - more precisely an EKS cluster.
For the creation I followed this tutorial: https://github.com/z0ph/ShinyProxyOnEKS
So far everything worked, except for the last step: "When accessing the load balancer address and port, the login interface of ShinyProxy can be displayed normally.
The load balancer gives me the following error message as soon as I try to call it with the corresponding port: ERR_EMPTY_RESPONSE.
I have to admit that I am currently a bit lost and lack a starting point where the error could be.
I was already able to host the Shiny sample application in the cluster (step 3.2 in the tutorial), so it must be somehow due to shinyproxy, kubernetes proxy or the loadbalancer itself.
I link you to the following information below:
Overview EC2 Instances (Workspace + Cluster Nodes)
Overview Loadbalancer
Overview Repositories
Dockerfile ShinyProxy
Dockerfile Kubernetes Proxy
Dockerfile ShinyApp (sample application)
I have painted over some of the information to be on the safe side - if there is anything important, please let me know.
If you need anything else I haven't thought of, just give me a hint!
And please excuse the confusing question and formatting - I just don't know how to word / present it better. sorry!
Many thanks and best regards
Overview EC2 Instances (Workspace + Cluster Nodes)
Overview Loadbalancer
Overview Repositories
Dockerfile ShinyProxy (source https://github.com/openanalytics/shinyproxy-config-examples/tree/master/03-containerized-kubernetes)
Dockerfile Kubernetes Proxy (source https://github.com/openanalytics/shinyproxy-config-examples/tree/master/03-containerized-kubernetes - Fork)
Dockerfile ShinyApp (sample application)
The following files are 1:1 from the tutorial:
application.yaml (shinyproxy)
sp-authorization.yaml
sp-deployment.yaml
sp-service.yaml
Health-Status in the AutoScaling-Group
Unfortunately, there is a known issue in AWS
externalTrafficPolicy: Local with Type: LoadBalancer AWS NLB health checks failing · Issue #80579 · kubernetes/kubernetes
Closing this for now since it's a known issue
As per k8s manual:
.spec.externalTrafficPolicy - denotes if this Service desires to route external traffic to node-local or cluster-wide endpoints. There are two available options: Cluster (default) and Local. Cluster obscures the client source IP and may cause a second hop to another node, but should have good overall load-spreading. Local preserves the client source IP and avoids a second hop for LoadBalancer and NodePort type Services, but risks potentially imbalanced traffic spreading.
But you may try to fix local protocol like in this answer
Upd:
This is actually a known limitation where the AWS cloud provider does not allow for --hostname-override, see #54482 for more details.
Upd 2: There is a workaround via patching kube-proxy:
As per AWS KB
A Network Load Balancer with the externalTrafficPolicy is set to Local (from the Kubernetes website), with a custom Amazon VPC DNS on the DHCP options set. To resolve this issue, patch kube-proxy with the hostname override flag.

metrics-server:v0.4.2 cannot scrape metrics inside AWS kubernetes cluster environment (cannot validate certificate, doesn't contain any IP SANs)

Situation:
The metrics-server deployment image is: k8s.gcr.io/metrics-server/metrics-server:v0.4.2
I have used kops tool to deploy a kubernetes cluster into one AWS account.
The error and reason why it is failing, fetched by
kubectl -n kube-system logs metrics-server-bcc948649-dsnd6
unable to fully scrape metrics: [unable to fully scrape metrics from node ip-10-33-47-106.eu-central-1.compute.internal: unable to fetch metrics from node ip-10-33-47-106.eu-central-1.compute.internal: Get "https://10.33.47.106:10250/stats/summary?only_cpu_and_memory=true": x509: cannot validate certificate for 10.33.47.106 because it doesn't contain any IP SANs, unable to fully scrape metrics from node ip-10-33-50-109.eu-central-1.compute.internal: unable to fetch metrics from node ip-10-33-50-109.eu-central-1.compute.internal: Get "https://10.33.50.109:10250/stats/summary?only_cpu_and_memory=true": x509: cannot validate certificate for 10.33.50.109 because it doesn't contain any IP SANs]
I can solve this easy by modifying the metrics-server deployment template, and adding the argument
- --kubelet-insecure-tls to the container args, but does not seem production solution.
What I want to ask and learn here is, how can I resolve this in the proper way without losing security ?
Kubelet certificates created by kOps contain only node hostname among their SANs, while metrics server deployed with default manifest is trying to use node private IPs for scraping. Changing the kubelet-preferred-address-types argument resolves this issue:
- --kubelet-preferred-address-types=Hostname

Can't reach a pod from an eks node when using security group for pods

I am testing to use security group for pods following the tutorial in here: https://docs.aws.amazon.com/eks/latest/userguide/security-groups-for-pods.html
I was able to deploy it successfully, as I can see my pods annotated with: vpc.amazonaws.com/pod-eni:[eni etc], and I can successfully confirm in the AWS console that a new ENI is created with the same private IP as the pods, and the selected security group is attached to the created ENI.
For testing purposes, I have this security group to accept all traffic. This means that all my pods can reach each other under any port. I can also confirm that DNS resolution can be done from any pod as I can reach services outside AWS (namely curl google / facebook etc) My only problem is that I can't seem to be able to reach the pod from the same node is being executed (on any port). The weird part is that I can reach the pod from any other node in which the pod doesn't live (So in a 3 node EKS cluster, if pod "pod-A" runs in node1, then I can reach only "pod-A" from node2 and node3, but not from node1).
This is a problem because the kubelet in that node is failing to pass all the http liveness/readiness checks, and my statefulset never comes up (I would assume this will also be a problem for deployments, although I haven't tried)
Like I said, I do get the security groups for pod successfully deployed but I am having a hard time understanding why I can't reach a pod from the same node, even though, I have set All Traffic for that security group.
eks version: eks.3
kubernetes version: 1.17
cni: amazon-k8s-cni-init:v1.7.4
amazon-k8s-cni:v1.7.4
I was never going to figure that out so I asked in the aws cni repo as well, and they answered it:
https://github.com/aws/amazon-vpc-cni-k8s/issues/1260

Why is communication from GKE to a private ip in GCP not working?

I have what I think is a reasonably straightforward setup in Google Cloud - A GKE cluster, a Cloud SQL instance, and a "Click-To-Deploy" Kafka VM instance.
All of the resources are in the same VPC, with firewall rules to allow all traffic to the internal VPC CIDR blocks.
The pods in the GKE cluster have no problem accessing the Cloud SQL instance via its private IP address. But they can't seem to access the Kafka instance via its private IP address:
# kafkacat -L -b 10.1.100.2
% ERROR: Failed to acquire metadata: Local: Broker transport failure
I've launched another VM manually into the VPC, and it has no problem connecting to the Kafka instance:
# kafkacat -L -b 10.1.100.2
Metadata for all topics (from broker -1: 10.1.100.2:9092/bootstrap):
1 brokers:
broker 0 at ....us-east1-b.c.....internal:9092
1 topics:
topic "notifications" with 1 partitions:
partition 0, leader 0, replicas: 0, isrs: 0
I can't seem to see any real difference in the networking between the containers in GKE and the manually launched VM, especially since both can access the Cloud SQL instance at 10.10.0.3.
Where do I go looking for what's blocking the connection?
I have seen that the error is relate to the network,
however if you are using gke on the same VPC network, you will ensure to configure properly the Internal Load Balancer, also I saw that this product or feature is BETA version, this means that it is not yet guaranteed to work as expected, another suggestion is that you ensure that you are not using any policy, that maybe block the connection, I found the next article on the community that maybe help you to solve it
This gave me what I needed: https://serverfault.com/a/924317
The networking rules in GCP still seem wonky to me coming from a long time working with AWS. I had rules that allowed anything in the VPC CIDR blocks to contact anything else in those same CIDR blocks, but that wasn't enough. Explicitly adding the worker nodes subnet as a source for a new rule opened it up.

The ECS servive with aws_vpc cannot start due to ENI issues

I have got one service. I have ECS cluster with 2 instances of t3.small.
I cannot start the ECS task. I have ECS task with 2 containers(NGINX and PHP-FPM). NGINX exposes port 80 and PHP-FPM exposes ports 9000, 9001, 9002.
Error I can see:
dev-cluster/ecs-agents i-12345678901234567 2019-09-15T13:20:48Z [ERROR] Task engine [arn:aws:ecs:us-east-1:123456789012:task/ea1d6e4b-ff9f-4e0a-b77a-1698721faa5c]: unable to configure pause container namespace: cni setup: invoke bridge plugin failed: bridge ipam ADD: failed to execute plugin: ecs-ipam: getIPV4AddressFromDB commands: failed to get available ip from the db: getAvailableIP ipstore: failed to find available ip addresses in the subnet
ECS agent: 1.29.
Do you know How Can I figure out what is wrong?
Here is logs snippet: https://pastebin.com/my620Kip
Task definition: https://pastebin.com/C5khX9Zy
UPDATE: My observations
Edited because my post below was deleted...
I recreated cluster, then the problem disappears.
Then I removed the application image from the ECR and I was seeing an error in AWS web console:
CannotPullContainerError: Error response from daemon: manifest for 123456789123.dkr.ecr.us-east-1.amazonaws.com/application123:development-716b4e55dd3235f6548d645af9e463e744d3785f not found
Then I waited a few hours until the original issue happened again.
Then I restarted instance manually with systemctl reboot and the problem disappeared again only for restarted instance.
This issue appears when On the cluster is hundred(s) awsvpc task which cannot start.
I think this is a bug in ECS agent. And When We are trying to create too many containers with requires ENI it is trying to use all free IPs in the subnet. (255) I think after restart/recreate EC2 instance some cache is cleared and the problem is solved.
Here is similar solution I found today: https://github.com/aws/amazon-ecs-cni-plugins/issues/93#issuecomment-502099642
What do you think about it?
I am opened for suggestions.
This is probably just a wild guess, but can it be that you simply don't have enough ENIs?
ENIs are quite limited (depending on the instance type):
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-eni.html
For instance, t3.medium only has 3 ENIs, one of which is used for primary network interface. Which leaves you with 2 ENIs only. So I can imagine that ECS tasks fail to start due to insufficient ENIs.
As mitigation, try ENI trunking:
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/container-instance-eni.html
This will multiply available ENIs per instance.