Unable to find the server at www.googleapis.com only within GCP - google-cloud-platform

I know there have been a few questions similar to this issue. But in my case this issue is only happening on GCP. We have been running our services within AKS (Azure) for almost one year with not a single occurrence. Right after we moved to GCP GKE, a few requests of our Python application are falling into the error: Unable to find the server at www.googleapis.com. In most cases, the request works, so it seems to be random. I already tried to increase TCP timeouts and also the minimum Minimum ports per VM instance in my Cloud Nat. We are running the services with GKE and we have Cloud Nat Gateway setup for the Network.
Is there any exclusive setting on GCP that could be causing the issue?

I figured out what was the issue. The kube-dns service was being scheduled to nodes suffering from high memory pressure, causing kube-dns to be evicted and restarted. During the time it was out some requests would not be resolved. In order to fix the issue I created a nodepool exclusive to the kube-system services, then edited the kube-system deployments and set a nodeSelector so they always get scheduled to safe Nodes. After that, the issue has ceased.

Related

Service not responding to ping command

My service (myservice.com) which is hosted in EC2 is up and running. I could see java process running within the machine but not able to reach the service from external machines. Tried the following option,
dns +short myservice.com
ping myservice.com
(1) is resolving and giving me ip address. ping is causing 100% packet loss. Not able to reach the service.
Not sure where to look at. Some help to debug would be helpful.
EDIT:
I had an issue with previous deployment due to which service was not starting - which I've fixed and tried to update - but the deployment was blocked due to ongoing deployment (which might take ~3hrs to stabilize). So I tried enabling Force deployment option from the console
Also tried minimising the "Number of Tasks" count to 0 and reverted it back to 1 (Reference: How do I deploy updated Docker images to Amazon ECS tasks?) to stop the ongoing deployment.
Can that be an issue?
You probably need to allow ICMP protocol in the security group.
See https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/security-group-rules-reference.html#sg-rules-ping

Why are outbound SSH connections from Google CloudRun to EC2 instances unspeakably slow?

I have a Node API deployed to Google CloudRun and it is responsible for managing external servers (clean, new Amazon EC2 Linux VM's), including through SSH and SFTP. SSH and SFTP actually work eventually but the connections take 2-5 MINUTES to initiate. Sometimes they timeout with handshake timeout errors.
The same service running on my laptop, connecting to the same external servers, has no issues and the connections are as fast as any normal SSH connection.
The deployment on CloudRun is pretty standard. I'm running it with a service account that permits access to secrets, etc. Plenty of memory allocated.
I have a VPC Connector set up, and have routed all traffic through the VPC connector, as per the instructions here: https://cloud.google.com/run/docs/configuring/static-outbound-ip
I also tried setting UseDNS no in the /etc/ssh/sshd_config file on the EC2 as per some suggestions online re: slow SSH logins, but that has not make a difference.
I have rebuilt and redeployed the project a few dozen times and all tests are on brand new EC2 instances.
I am attempting these connections using open source wrappers on the Node ssh2 library, node-ssh and ssh2-sftp-client.
Ideas?
Cloud Run works only until you have a HTTP request active.
You proably don't have an active request during this on Cloud Run, as outside of the active request the CPU is throttled.
Best for this pipeline is Cloud Workflows and regular Compute Engine instances.
You can setup a Workflow to start a Compute Engine for this task, and stop once it finished doing the steps.
I am the author of article: Run shell commands and orchestrate Compute Engine VMs with Cloud Workflows it will guide you how to setup.
Executing the Workflow can be triggered by Cloud Scheduler or by HTTP ping.

Fargate deployment restarting multiple times before it comes online

I have a ECS Service deployed into Fargate.
It is attached to Network Load Balancer. Rolling update was working fine but suddenly I see the below issue.
When I update the service with new task definition Fargate starts the deployment and tries to start new container. Since I have the service attached to NLB, the new task registers itself with the NLB Target Group.
But NLB Target Group's health check fails. So Fargate kills the failed task and starts new task. This is being repeated multiple times(this number actually varies, today it took 7 hours for the rolling update to finish).
There are no changes to the infra after the deployment. Security group is allowing traffic within the VPC. NLB and ECS Service are deployed into same VPC, same subnet.
Fargate health check fails for the task with same docker image N number of times but after that it starts working.
Target Group healthy/unhealthy threshold is 3, protocol is TCP, port is traffic-port and the interval is 30. In the microservice startup log I see this,
Started myapp in 44.174 seconds (JVM running for 45.734)
When the task comes up, I tried opening security group rule for the VPN and tried accessing the Task IP directly. I can reach the microservice directly with task IP.
But why NLB Health Check is failing?
I had the exact same issue.
simulated it with different images (go, python) as I suspected of utilization overhead in CPU/Mem, which was false.
The mitigation can be changing the Fargate deployment parameter Minimum healthy percent to 50% (while before it was 100% and seemed to cause the issue).
After the change, the failures would become seldom, but it would still occur.
The real solution is still unknown, it seems to be something related to the NLB Configuration in Fargate

Istio: failed calling admission webhook Address is not allowed

I am getting the following error while creating a gateway for the sample bookinfo application
Internal error occurred: failed calling admission webhook
"pilot.validation.istio.io": Post
https://istio-galley.istio-system.svc:443/admitpilot?timeout=30s:
Address is not allowed
I have created a EKS poc cluster using two node-groups (each with two instances), one with t2.medium and another one is with t2.large type of instances in my dev AWS account using two subnets with /26 subnet with default VPC-CNI provided by EKS
But as the cluster is growing with multiple services running, I started facing issues of IPs not available (as per docs default vpc-cni driver treat pods as an EC2 instance)
to avoid same I followed following post to change networking from default to weave
https://medium.com/codeops/installing-weave-cni-on-aws-eks-51c2e6b7abc8
because of same I have resolved IPs unavailability issue,
Now after network reconfiguration from vpc-cni to weave
I am started getting above issue as per subject line for my service mesh configured using Istio
There are a couple of services running inside the mesh and also integrated kiali, prometheus, jaeger with the same.
I tried to have a look at Github (https://github.com/istio/istio/issues/9998) and docs
(https://istio.io/docs/ops/setup/validation/), but could not get a proper valid answer.
Let me if anyone face this issue and have partial/full solution on this.
This 'appears' to be related to the switch from AWS CNI to weave. CNI uses the IP range of your VPC while weave uses its own address range (for pods), so there may be remaining iptables rules from AWS CNI, for example.
Internal error occurred: failed calling admission webhook "pilot.validation.istio.io": Post https://istio-galley.istio-system.svc:443/admitpilot?timeout=30s: Address is not allowed
The message above implies that whatever address istio-galley.istio-system.svc resolves to, internally in your K8s cluster, is not a valid IP address. So I would also try to see what that resolves to. (It may be related to coreDNS).
You can also try the following these steps;
Basically, (quoted)
kubectl delete ds aws-node -n kube-system
delete /etc/cni/net.d/10-aws.conflist on each of the node
edit instance security group to allow UDP, TCP on 6873, 6874 ports
flush iptables, nat, mangle, filter
restart kube-proxy pods
apply weave-net daemonset
delete existing pods so the get recreated in Weave pod CIDR's address-space.
Furthermore, you can try reinstalling everything from the beginning using weave.
Hope it helps!

Unexpected latency issues AWS-API Gateway

I need help to troubleshoot AWS API gateway latency issues. We have same configuration and even data everything same but facing high latency issues in Non Prod. Actually we are using Nlb and VPC link for API Gateway . Please find same values here below.
We have copied the data from dev mongo to test environment to make sure the same volume of data is present in both the places. We hit /test/16 from both the environment, but experiencing very high latency in dev as compared to sandbox.
Test:
Request:/test/16
Status:200
Latency:213ms
Dev:
Request:/test/16
Status:200
Latency:4896ms
Have you checked your VPC logs to see the flow paths for the requests? If not, I suggest starting there.
As FYI, you can learn about VPC flow logs at https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html#working-with-flow-logs.
What is behind the load balancer? Anything you are reaching for with DNS names or just IPs?
We had a similar problem at one point, looking in the monitoring of the load balancer(ELB) we found that the problem was downstreams.
The monitoring even showed that we got 504s in the load balancer.
In our case it was DNS caching that caused it, the target instances had been replaced and the DNS in some nginx instances, on the network path to the target, had not been updated.
The nginx instances had to be updated with dynamic DNS resolving. Since nginx default only resolved the target on startup.
With out knowing your architecture however, hard to say what can cause your problems. Here is another DNS story, with some debugging examples: https://srvaroa.github.io/kubernetes/migration/latency/dns/java/aws/microservices/2019/10/22/kubernetes-added-a-0-to-my-latency.html 🍿
Good luck.