Kubernetes does not seem to update internal IP table after node restart

Kubernetes does not seem to update internal IP table after node restart - google-cloud-platform

We're currently experiencing an issue with our GCP Kubernetes which is forwarding client requests to pods that have been assigned IPs that other pods within the cluster have /previously/ gotten. The way we can see this is by using the following query in Logs Explorer:
resource.type="http_load_balancer"
httpRequest.requestMethod="GET"
httpRequest.status=404
Snippet from one of the logs:
httpRequest: {
latency: "0.017669s"
referer: "https://asdf.com/"
remoteIp: "5.57.50.217"
requestMethod: "GET"
requestSize: "34"
requestUrl: "https://[asdf.com]/api/service2/[...]"
responseSize: "13"
serverIp: "10.19.160.16"
status: 404
userAgent: "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
}
...where the requestUrl property indicates the incoming URL to the load balancer.
Then I search for the IP 10.19.160.16 to find out which pod the IP is assigned to:
c:\>kubectl get pods -o wide | findstr 10.19.160.16
service1-675bfc4f97-slq6g 1/1 Terminated 0 40h 10.19.160.16 gke-namespace-te-namespace-te-153a9649-p2mg
service2-574d69cf69-c7knp 0/1 Error 0 3d16h 10.19.160.16 gke-namespace-te-namespace-te-153a9649-p2mg
service3-6db4c97784-428pq 1/1 Running 0 16h 10.19.160.16 gke-namespace-te-namespace-te-153a9649-p2mg
So based on requestUrl the request should have been sent to service2. Instead, what we see is that it gets sent to service3 because it's gotten the IP that service2 once used to have, in other words it seems that the cluster still thinks that service2 is holding on to the IP 10.19.160.16. The effect is that service3 returns status code 404 because it doesn't recognize the endpoint.
This behavior only stops if we manually delete the pods in failed state (eg Error or Terminated) by using the kubectl delete pod ... command.
We suspect that this behavior started since we upgraded our cluster to v1.23 which required us to migrate away from extensions/v1beta1 to networking.k8s.io/v1 as described in https://cloud.google.com/kubernetes-engine/docs/deprecations/apis-1-22.
Our test environment is using pre-emptible VM and whilst we're not 100% (but pretty close) sure it seems like the pods end in Error state after a node is pre-empted.
Why does the cluster still think that a dead pod still has the IP that it used to have? Why is the problem gone after deleting failed pods? Shouldn't they have been cleaned up after a node pre-emption?

Gari Singh provided the answer in the comment.

Related

Suspicious network activity on amazon EC2 instance

I have created an amazon ec2 instance and I am hosting a flask server (the public ip of the server is known only to another server, it is not meant to be used by clients but only by another computer).
For some reason, I am receiving a weird network activity:
From the logs:
162.142.125.10 - - [18/Apr/2022 19:45:39] "GET / HTTP/1.1" 200 -
118.123.105.85 - - [18/Apr/2022 20:06:30] "GET / HTTP/1.0" 200 -
198.235.24.20 - - [18/Apr/2022 22:37:16] "GET / HTTP/1.1" 200 -
128.14.209.250 - - [19/Apr/2022 01:24:07] "GET / HTTP/1.1" 200 -
128.14.209.250 - - [19/Apr/2022 01:24:15] code 400, message Bad request version ('À\x14À')
128.14.209.250 - - [19/Apr/2022 07:05:32] "▬♥☺ ±☺ ♥♥Ýfé$0±6nu♀¤♫ëe éSV∟É#☼ß↨♠\ VÀ◄ÀÀ‼À À¶À" HTTPStatus.BAD_REQUEST -
I have looked all these IPs and they are across the globe.
Why am I getting these kind of requests ? What are they probably trying to achieve ?
[EDIT]
162.142.125.10 -> https://about.censys.io/
118.123.105.85 -> ChinaNet Sichuan Province Network
198.235.24.20 -> Palo Alto Networks Inc
128.14.209.250 -> zl-dal-us-gp1-wk123.internet-census.org

As others said, it's common that bots and (ethical?) hackers around the world scan your machine if it's on a public network.
Your assumption that "the public ip of the server is known only to another server" simply isn't true.
If you want to achieve that, you should place your server inside a private VPC subnet and/or allow the traffic only from the specific server via Security Group configuration.

Slack Bot deployed in Cloud Foundry returns 502 Bad Gateway errors

In Slack, I have set up an app with a slash command. The app works well when I use a local ngrok server.
However, when I deploy the app server to PCF, it is returning 502 errors:
[CELL/0] [OUT] Downloading droplet...
[CELL/SSHD/0] [OUT] Exit status 0
[APP/PROC/WEB/0] [OUT] Exit status 143
[CELL/0] [OUT] Cell e6cf018d-0bdd-41ca-8b70-bdc57f3080f1 destroying container for instance 28d594ba-c681-40dd-4514-99b6
[PROXY/0] [OUT] Exit status 137
[CELL/0] [OUT] Downloaded droplet (81.1M)
[CELL/0] [OUT] Cell e6cf018d-0bdd-41ca-8b70-bdc57f3080f1 successfully destroyed container for instance 28d594ba-c681-40dd-4514-99b6
[APP/PROC/WEB/0] [OUT] ⚡️ Bolt app is running! (development server)
[OUT] [APP ROUTE] - [2021-12-23T20:35:11.460507625Z] "POST /slack/events HTTP/1.1" 502 464 67 "-" "Slackbot 1.0 (+https://api.slack.com/robots)" "10.0.1.28:56002" "10.0.6.79:61006" x_forwarded_for:"3.91.15.163, 10.0.1.28" x_forwarded_proto:"https" vcap_request_id:"7fe6cea6-180a-4405-5e5e-6ba9d7b58a8f" response_time:0.003282 gorouter_time:0.000111 app_id:"f1ea0480-9c6c-42ac-a4b8-a5a4e8efe5f3" app_index:"0" instance_id:"f46918db-0b45-417c-7aac-bbf2" x_cf_routererror:"endpoint_failure (use of closed network connection)" x_b3_traceid:"31bf5c74ec6f92a20f0ecfca00e59007" x_b3_spanid:"31bf5c74ec6f92a20f0ecfca00e59007" x_b3_parentspanid:"-" b3:"31bf5c74ec6f92a20f0ecfca00e59007-31bf5c74ec6f92a20f0ecfca00e59007"
Besides endpoint_failure (use of closed network connection), I also see:
x_cf_routererror:"endpoint_failure (EOF (via idempotent request))"
x_cf_routererror:"endpoint_failure (EOF)"
In PCF, I created an https:// route for the app. This is the URL I put into my Slack App's "Redirect URLs" section as well as my Slash command URL.
In Slack, the URLs end with /slack/events
This configuration all works well locally, so I guess I missed a configuration point in PCF.
Manifest.yml:
applications:
- name: kafbot
buildpacks:
- https://github.com/starkandwayne/librdkafka-buildpack/releases/download/v1.8.2/librdkafka_buildpack-cached-cflinuxfs3-v1.8.2.zip
- https://github.com/cloudfoundry/python-buildpack/releases/download/v1.7.48/python-buildpack-cflinuxfs3-v1.7.48.zip
instances: 1
disk_quota: 2G
# health-check-type: process
memory: 4G
routes:
- route: "kafbot.apps.prod.fake_org.cloud"
env:
KAFKA_BROKER: 10.32.17.182:9092,10.32.17.183:9092,10.32.17.184:9092,10.32.17.185:9092
SLACK_BOT_TOKEN: ((slack_bot_token))
SLACK_SIGNING_SECRET: ((slack_signing_key))
command: python app.py

When x_cf_routererror says endpoint_failure it means that the application has not handled the request sent to it by Gorouter for some reason.
From there, you want to look at response_time. If the response time is high (typically the same value as the timeout, like 60s almost exactly), it means your application is not responding quickly enough. If the value is low, it could mean that there is a connection problem, like Gorouter tries to make a TCP connection and cannot.
Normally this shouldn't happen. The system has a health check in place that makes sure the application is up and listening for requests. If it's not, the application will not start correctly.
In this particular case, the manifest has health-check-type: process which is disabling the standard port-based health check and using a process-based health check. This allows the application to start up even if it's not on the right port. Thus when Gorouter sends a request to the application on the expected port, it cannot connect to the application's port. Side note: typically, you'd only use process-based health checks if your application is not listening for incoming requests.
The platform is going to pass in a $PORT env variable with a value in it (it is always 8080, but could change in the future). You need to make sure your app is listening on that port. Also, you want to listen on 0.0.0.0, not localhost or 127.0.0.1.
This should ensure that Gorouter can deliver requests to your application on the agreed-upon port.

Istio Circuit Breaker who trips it?

I am currently doing research on the service mesh Istio in version 1.6. The data plane (Envoy proxies) are configured by the controle plane.
When I configure a Circuit Breaker by creating a Destination rule and the circuit breaker opens, does the client side sidecar proxy already return the 503 or the server side sidecar proxy?
Does the client side sidecar proxy route the request to another available instance of the service automatically or does it simply return the 503 to the application container?
Thanks in advance!

In the log entries, you can inspect them to figure out both end of the connection that was stopped by the circuit breaker. IP addresses of both sides of the connection are present in the log message from the istio-proxy container.
{
insertId: "..."
labels: {
k8s-pod/app: "circuitbreaker-jdwa8424"
k8s-pod/pod-template-hash: "..."
}
logName: ".../logs/stdout"
receiveTimestamp: "2020-06-09T05:59:30.209882320Z"
resource: {
labels: {
cluster_name: "..."
container_name: "istio-proxy"
location: "..."
namespace_name: "circuit"
pod_name: "circuit-service-a31cb334d-66qeq"
project_id: "..."
}
type: "k8s_container"
}
severity: "INFO"
textPayload: "[2020-06-09T05:59:27.854Z] UO 0 0 0 "-" - - 172.207.3.243:443 10.1.13.216:36774 "
timestamp: "2020-06-09TT05:59:28.071001549Z"
}
The message is coming from istio-proxy container which runs Envoy that was affected by CircuitBreaker policy that request was sent to. Also there is the IP address of both the source and destination of the connection that was interrupted.
It will return 503. There is option to configure retries, however I did not test its synergy with CircuitBreaker and if the retry actually will go to different pod if previous returned an error.
Also take a look at the most detailed explanation of CircuitBreaker I managed to find.
Hope it helps.

kube-dns fails to start

I'm trying to install k8s on AWS with kops. The k8s cluster fails to come on line because the kube-dns service keeps crashing. Logs:
I0111 19:21:30.189260 1 logs.go:41] skydns: ready for queries on cluster.local. for tcp://0.0.0.0:10053 [rcache 0]
I0111 19:21:30.189278 1 logs.go:41] skydns: ready for queries on cluster.local. for udp://0.0.0.0:10053 [rcache 0]
I0111 19:21:30.689342 1 dns.go:174] Waiting for services and endpoints to be initialized from apiserver...
...
E0111 19:22:00.189691 1 reflector.go:199] k8s.io/dns/vendor/k8s.io/client-go/tools/cache/reflector.go:94: Failed to list *v1.Endpoints: Get https://100.64.0.1:443/api/v1/endpoints?resourceVersion=0: dial tcp 100.64.0.1:443: i/o timeout
E0111 19:22:00.190016 1 reflector.go:199] k8s.io/dns/vendor/k8s.io/client-go/tools/cache/reflector.go:94: Failed to list *v1.Service: Get https://100.64.0.1:443/api/v1/services?resourceVersion=0: dial tcp 100.64.0.1:443: i/o timeout
All the user kube-system pods are in good order. The api-server log on the master server doesn't list any errors. What might the problem be?

ELB Keeps Inconsistently Failing Health Checks, but EC2 Status Checks OK

Servers are accessible normally. Checking /
(default page)
We will have some sort of load, these will respond a little slower than it likes. Then Take down the Instances from our load balancer.
Because the Application doesnt fail, I cannot "reboot" the instances via Ec2. I can often access the webpage / IP direct myself when it's "out of service"
This isn't a general failure or a misconfiguration, it can be up for 12-2400 hours, but then randomly fail 3x in 3 hrs. Under medium-low load.
Server set to 10s response timeouts, 30s intervals, 5x to Fail; 2x to say its ok.
Any ideas?
Health check logs are responding normal, and nothing in ERRORS. Heres a sample from access:
10.0.100.30 - - [25/Nov/2016:06:49:22 +0000] "GET /index.html HTTP/1.1" 200 11415 "-" "ELB-HealthChecker/1.0"
::1 - - [25/Nov/2016:06:49:26 +0000] "OPTIONS * HTTP/1.0" 200 126 "-" "Apache/2.4.20 (Ubuntu) (internal dummy connection)"

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js