GCE/GKE NAT gateway route kills ssh connection - google-cloud-platform

im trying to setUp a NAT Gateway for Kubernetes Nodes on the GKE/GCE.
I followed the instructions on the Tutorial (https://cloud.google.com/vpc/docs/special-configurations chapter: "Configure an instance as a NAT gateway") and also tried the tutorial with terraform (https://github.com/GoogleCloudPlatform/terraform-google-nat-gateway)
But at both Tutorials (even on new created google-projects) i get the same two errors:
The NAT isn't working at all. Traffic still outgoing over nodes.
I can't ssh into my gke-nodes -> timeout. I already tried setting up a rule with priority 100 that allows all tcp:22 traffic.
As soon as i tag the gke-node-instances, so that the configured route applies to them, the SSH connection is no longer possible.

You've already found the solution to the first problem: tag the nodes with the correct tag, or manually create a route targeting the instance group that is managing your GKE nodes.
Regarding the SSH issue:
This is answered under "Caveats" in the README for the NAT Gateway for GKE example in the terraform tutorial repo you linked (reproduced here to comply with StackOverflow rules).
The web console mentioned below uses the same ssh mechanism as kubectl exec internally. The short version is that as of time of posting it's not possible to both route all egress traffic through a NAT gateway and use kubectl exec to interact with pods running on a cluster.
Update # 2018-09-25:
There is a workaround available if you only need to route specific traffic through the NAT gateway, for example, if you have a third party whose service requires whitelisting your IP address in their firewall.
Note that this workaround requires strong alerting and monitoring on your part as things will break if your vendor's public IP changes.
If you specify a strict destination IP range when creating your Route in GCP then only traffic bound for those addresses will be routed through the NAT Gateway. In our case we have several routes defined in our VPC network routing table, one for each of our vendor's public IP addresses.
In this case the various kubectl commands including exec and logs will continue to work as expected.
A potential workaround is to use the command in the snippet below to connect to a node and use docker exec on the node to enter a container. This of course means you will need to first locate the node your pod is running on before jumping through the gateway onto the node and running docker exec.
Caveats
The web console SSH will no longer work, you have to jump through the NAT gateway machine to SSH into a GKE node:
eval ssh-agent $SHELL
ssh-add ~/.ssh/google_compute_engine
CLUSTER_NAME=dev
REGION=us-central1
gcloud compute ssh $(gcloud compute instances list --filter=name~nat-gateway-${REGION} --uri) --ssh-flag="-A" -- ssh $(gcloud compute instances list --filter=name~gke-${CLUSTER_NAME}- --limit=1 --format='value(name)') -o StrictHostKeyChecking=no
Source: https://github.com/GoogleCloudPlatform/terraform-google-nat-gateway/tree/master/examples/gke-nat-gateway

You can use kubeip in order to assign IP addresses
https://blog.doit-intl.com/kubeip-automatically-assign-external-static-ips-to-your-gke-nodes-for-easier-whitelisting-without-2068eb9c14cd

Related

AWS App Runner service cannot access Internet when added to a VPC

I've set up an AWS App Runner service, which works fine. Currently for networking it's configured as public access, but I'd like to change this to a VPC so that I can connect the service to an RDS instance without having to open the database up to the world.
When I change the networking config to use my default security group, the service is unable to access the Internet. Cloning a git repo from Bitbucket brings up the error: ssh: Could not resolve hostname bitbucket.org: Try again
... and trying to run npm install brings up:
npm ERR! network request to https://registry.npmjs.org/gulp failed, reason: connect ETIMEDOUT 104.16.24.35:443
My security group has an outgoing rule allowing all traffic out to any destination. My RDS instance is in the same VPC/security group and I'm able to connect to this without issue (currently I've opened up port 3306 to the world). Everything else I've read from a bunch of Googling seems fine: route tables, internet gateways, firewall rules, etc.
Any help would be much appreciated!
Probably too late to be really helpful but moving the App Runner to a VPC sends all outgoing traffic to the VPC.
The two options given in the docs are
Adding NAT gateways to each VPC
Setting up VPC endpoints
Documented within the first bullet point of the Considerations when selecting a subnet section
https://docs.aws.amazon.com/apprunner/latest/dg/network-vpc.html

AWS Loadbalancer is not accessible

I have a solution (AnzoGraph DB) deployed on my AWS Kubernetes Cluster (EC2 Instance), and was working totally fine.
Suddenly this solution stopped and i could not access it via the DNS anymore.
I tested the solution deployed on my cluster using kubectl port-forward command and they are working fine (the pods and services), thus i assume the problem is with AWS Loadbalancer.
To access the application we need to go through this path:
Request -> DNS -> AWS Load Balancer -> Services -> Pods.
The LoadBalancer is (classic) internal, so it's only accessible for me or the company using VPN.
Every time when I try to access the DNS , I got no response.
Any idea how i can fix it ? or where is the exact issue ? how can I troubleshoot this issue and follow the traffic on AWS ?
Thanks a lot for the help!
sorry I missed your post earlier.
lets start with a few questions...
You say you use k8s on AWS EC2, do you actually use EKS, or do you run a different k8s stack?
Also ... you mentioned that you access the LB from your (DB) client/ your software by DNS resolving the LB and then access AnzoGraph DB.
I want to make sure that the solution is actually DNS resolving the LB via DNS every time. if you have a long running service, and AWS changes the IP address of the LB, and your SW possibly had cached the IP, you would not be able to connect to the LB.
on the system you run your Software accessing AnzoGraph DB ... (I assume CentOS (7) )
make sure you have dig installed (yum install bind-utils)
dig {{ your DNS name of your LB }}
is that actually the IP address your SW is accessing?
has the IP address of the client changed? make sure the LB SG allows access
(https://docs.aws.amazon.com/elasticloadbalancing/latest/classic/elb-security-groups.html)
I assume you access the AnzoGraph DB frontend POD via 443?
as you write
"I tested the solution deployed on my cluster using kubectl port-forward command and they are working fine (the pods and services)"
we probably do not have to look for pod logs.
(if that was not the case, the LB would obviously block traffic as well.)
So I agree, that the most likely issue is (bad) DNS caching or SG due to different SRC IP being rejected by the classic LB SG.
also for completeness .. please tell us more about your env.
AnzoGraph DB image
EKS/k8s version
helm chart / AnzoGraph operator used.
Best - Frank

Why is communication from GKE to a private ip in GCP not working?

I have what I think is a reasonably straightforward setup in Google Cloud - A GKE cluster, a Cloud SQL instance, and a "Click-To-Deploy" Kafka VM instance.
All of the resources are in the same VPC, with firewall rules to allow all traffic to the internal VPC CIDR blocks.
The pods in the GKE cluster have no problem accessing the Cloud SQL instance via its private IP address. But they can't seem to access the Kafka instance via its private IP address:
# kafkacat -L -b 10.1.100.2
% ERROR: Failed to acquire metadata: Local: Broker transport failure
I've launched another VM manually into the VPC, and it has no problem connecting to the Kafka instance:
# kafkacat -L -b 10.1.100.2
Metadata for all topics (from broker -1: 10.1.100.2:9092/bootstrap):
1 brokers:
broker 0 at ....us-east1-b.c.....internal:9092
1 topics:
topic "notifications" with 1 partitions:
partition 0, leader 0, replicas: 0, isrs: 0
I can't seem to see any real difference in the networking between the containers in GKE and the manually launched VM, especially since both can access the Cloud SQL instance at 10.10.0.3.
Where do I go looking for what's blocking the connection?
I have seen that the error is relate to the network,
however if you are using gke on the same VPC network, you will ensure to configure properly the Internal Load Balancer, also I saw that this product or feature is BETA version, this means that it is not yet guaranteed to work as expected, another suggestion is that you ensure that you are not using any policy, that maybe block the connection, I found the next article on the community that maybe help you to solve it
This gave me what I needed: https://serverfault.com/a/924317
The networking rules in GCP still seem wonky to me coming from a long time working with AWS. I had rules that allowed anything in the VPC CIDR blocks to contact anything else in those same CIDR blocks, but that wasn't enough. Explicitly adding the worker nodes subnet as a source for a new rule opened it up.

Accessing RDS from within a Docker container not getting through security group?

I'm attempting to run a webserver that uses an RDS database with EC2 inside a docker container.
I've setup the security groups so the EC2 host's role is allowed to access the RDS and if I try to access it from the host machine directly everything works correctly.
However, when I run a simple container on the host and attempt to access the RDS, it get's blocked as if the security group weren't letting it through. After a bunch of trial and error it seemed that indeed the containers requests aren't appearing to come from the EC2 host so the firewall says no.
I was able to work around this in the short-run by setting --net=host on the docker container, however this breaks a lot of great docker networking functionality like being able to map ports (ie, now I need to make sure each instance of the container listens on a different port by hand).
Has anyone found a way around this? It seems like a pretty big limitation to running containers in AWS if you're actually using any AWS resources.
Yes, containers do hit the public IPs of RDS. But you do not need to tune low-level Docker options to allow your containers to talk to RDS. The ECS cluster and the RDS instance have to be in the same VPC and then access can be configured through security groups. The easiest way to do this is to:
Navigate to the RDS instances page
Select the DB instance and drill in to see details
Click on the security group id
Navigate over to the Inbound tab and choose Edit
And ensure there is a rule of type MySQL/Aurora with source Custom
When entering the custom source, just start typing in the name of the ECS cluster and the security group name will be auto-completed for you
This tutorial has screenshots that illustrate where to go.
Full disclosure: This tutorial features containers from Bitnami and I work for Bitnami. However the thoughts expressed here are my own and not the opinion of Bitnami.
Figured out what was happening, posting here in case it helps anyone else.
Requests from within the container were hitting the public ip of the RDS rather than the private (which is how the security groups work). It looks like the DNS inside the docker container was using the 8.8.8.8 google dns and that wouldn't do the AWS black magic of turning the rds endpoint into the private ip.
So for instance:
DOCKER_OPTS="--dns 10.0.0.2 -H tcp://127.0.0.1:4243 -H unix:///var/run/docker.sock -g /mnt/docker"
The inbound rule for the RDS should be set to the private IP of the EC2 instance rather than the public IPv4.
As #adamneilson mentions, setting the Docker options are your best bet. Here is how to discover your Amazon DNS server on the VPC. Also the section Enabling Docker Debug Output in the Amazon EC2 Container Service Developer Guide Troubleshooting mentions where the Docker options file is.
Assuming you are running a VPC block of 10.0.0.0/24 the DNS would be 10.0.0.2.
For CentOS, Red Hat and Amazon:
sed -i -r 's/(^OPTIONS=\")/\1--dns 10.0.0.2 /g' /etc/sysconfig/docker
For Ubuntu and Debian:
sed -i -r 's/(^OPTIONS=\")/\1--dns 10.0.0.2 /g' /etc/default/docker
When I tried to connect to AWS RDS in inside of docker container, I got "Access denied for user 'username'#'xxx.xx.xxx.x' (using password: YES)" error.
To solve this issue, I did below two ways:
I created new user and assigned grant.
$ CREATE USER 'newuser'#'%' IDENTIFIED BY 'password';
$ GRANT ALL ON newuser#'%' IDENTIFIED BY 'password';
$ FLUSH PRIVILEGES;
Added global DNS address 8.8.8.8 into docker container when run docker, so that the docker container can resolve IP address of AWS RDS from domain name.
$ docker run --name backend-app --dns=8.8.8.8 -p 8000:8000 -d backend-app
Then I connected from inside of docker container to AWS RDS, successfully.
Note: Firstly, I tried second way. But I didn't solve the connection problem. When I tried both two ways, I was success.

Public IP on service for AWS in Kubernetes fails

I started a cluster in aws following the guides and then went about following the guestbook. The problem I have is accessing it externally. I set the PublicIP to the ec2 publicIP and then use the ip to access it in the browser with port 8000 as specified in the guide.
Nothing showed. To make sure it was actually the service that wasn't showing anything I then removed the service and set a host port to be 8000. When I went to the ec2 instance IP I could access it correctly. So it seems there is a problem with my setup or something. The one thing I can think of is, I am inside a VPC with an internet gateway. I didn't add any of my json files I used, because they are almost exactly the same as the guestbook example with a few changes to allow my ec2 PublicIP, and a few changes for the VPC.
On AWS you have to use your PRIVATE ip address with Kubernetes' services, since your instance is not aware of its public ip. The NAT-ing on amazon's side is done in such a way that your service will be accessible using this configuration.
Update: please note that the possibility to set the public IP of a service explicitly was removed in the v1 API, so this issue is not relevant anymore.
Please check the following documentation page for workarounds: https://kubernetes.io/docs/user-guide/services/