Lots of connection logs after open ports of k8s service - amazon-web-services

Im now using aws k8s service (eks). deployments and Loadbalancer Services were used.
(2 nodes, 1 loadbalancer service, 1 deployment, 1 pod, 1 replicaset was used.)
However, when I add a port to the service, so many connections are being connected to the port I opened.
Log looks like as below.
[17:12:21.843] Client Connected [/192.168.179.222:28607]
[17:12:21.843] Client Disconnected [/192.168.179.222:28607]
[17:12:21.864] Client Connected [/192.168.179.222:16888]
[17:12:21.864] Client Disconnected [/192.168.179.222:16888]
[17:12:21.870] Client Connected [/192.168.79.91:58902]
[17:12:21.870] Client Disconnected [/192.168.79.91:58902]
[17:12:22.000] Client Connected [/192.168.179.222:52060]
[17:12:22.000] Client Disconnected [/192.168.179.222:52060]
[17:12:23.118] Client Connected [/192.168.79.91:14650]
[17:12:23.119] Client Disconnected [/192.168.79.91:14650]
192.168.179.222 and 192.168.79.91 are my nodes' IPs and logs are from pods.
I thought it is because of health check of aws loadbalancer, but health check interval is 30sec and it doesn't make sense.
Since lots of logs, i cannot see my real transaction logs.
How can I get rid of those connections? What is the reason of those logs?
--- add
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
ip-192-168-179-222.ap-northeast-2.compute.internal Ready <none> 11d v1.16.12-eks-904af05 192.168.179.222 ########## Amazon Linux 2 4.14.181-142.260.amzn2.x86_64 docker://19.3.6
ip-192-168-79-91.ap-northeast-2.compute.internal Ready <none> 11d v1.16.12-eks-904af05 192.168.79.91 ########## Amazon Linux 2 4.14.181-142.260.amzn2.x86_64 docker://19.3.6
these are my node info. Im probably sure that ips from log is node IP.
I have several processes in my pod, and Every processes are countered with too many connection logs.

What you are seeing is a result of scanners on the internet restlessly trying to find vulnerable applications
To fix that and to have a cleaner logs you can
Do IP white listing on the security group, so that certain IPs can only connect to your service
Install WAF to filter scanners out
Also you may want to have a structured logs, where your legit logs has a certain format that can be easily spotted and filtered away from garbage logs created by the scanner

Related

Reason for sudden inability to SSH into GCP VM instance

I was no longer able to SSH into a Google Cloud Compute Engine VM instance that previously showed no problems.
The error logs show the following
#type: "type.googleapis.com/google.protobuf.Struct" value: {
conditionNotMet: { userVisibleMessage: "Supplied fingerprint does not
match current metadata fingerprint."
Trying SSH through the console showed
Code: 4003 Reason: failed to connect to backend Please ensure that:
your user account has iap.tunnelInstances.accessViaIAP permission
VM has a firewall rule that allows TCP ingress traffic from the IP range XXX.0/20, port: 22
you can make a proper https connection to the IAP for TCP hostname: https://tunnel.cloudproxy.app You may be able to connect without using
the Cloud Identity-Aware Proxy.
The VM instance logs showed the following
Error watching metadata: Get
http://metadata.google.internal/computeMetadata/v1//?recursive=true&alt=json&wait_for_change=true&timeout_sec=60&last_etag=XXX:
net/http: request canceled (Client.Timeout exceeded while awaiting
headers)
After stopping and restarting the instance I was able to ssh again but I would like to understand the reason for the problem in the first place.
The error message you received indicates that the metadata server's response caused the connection to the Google Compute Engine VM instance to time out. This could be because the server was taking too long to respond or there was a problem with the network. You can try to resolve this issue by either increasing the timeout value by using this doc or waiting for the instance to become healthy using the gcloud compute wait command.
The instance was unable to reach the metadata server, as suggested by the timeout error message. This could be a problem with the instance itself or with the network connection. A firewall or network configuration issue could have prevented the instance from connecting to the metadata server, or an issue with the underlying infrastructure could have rendered the instance temporarily unavailable.
To prevent this issue from happening again, you can increase the timeout value or use the gcloud compute wait command to wait for the instance to become healthy.it is recommended that you regularly update the SSH key used to connect to the instance, and check that the instance can reach the metadata server by making an HTTPS request to the IAP for TCP hostname. Additionally, it is important to ensure that your user account has the "iap.tunnelInstances.accessViaIAP" permission, and that the VM has a firewall rule that allows TCP ingress traffic from the IP range XXX.0/20, port: 22.
If you are using windows vm try troubleshooting steps mentioned in this doc.

Health Check keeps failing for ECS container

I am currently trying to deploy 2 ECS services on a single EC2 instance for test environment.
Here is what I have done so far:
Successfully created 2 Security Groups for Load Balancer and EC2 instance.
My EC2 Security Group
My ALB Security Group
Successfully created 2 different Task Definitions for my 2 applications, all Spring Boot application. First application is running on port 8080, Container Port in Task Definition is also 8080. The second application is running on port 8081, Container Port in Task Definition is also 8081.
Successfully created an ECS cluster with an Auto-Scaling Group as Capacity Provider. The cluster also recognizes the Container Instance created from Auto-Scaling Group (I am using t2.micro since it is in free-tier package). Attached created Security Group to EC2 instance.
My EC2 Security Group
Successfully created an ALB with 2 forward listeners 8080 and 8081 configured to 2 different Target Groups for each service. Attached created Security Group to ALB.
Here is how the ECS behaves with my services:
I attempted to create 2 new services. First service mapped with port 8080 on ALB. The second one mapped with port 8081 on ALB. Each of them have different Target Group but the Health Check configurations are the same
Health Check Configuration for Service 1
Health Check Configuration for Service 2
The first service was deployed pretty smooth, health check returned success on the first try.
However, for the second service, I use the exact same configuration as the first one, just a different port listener on ALB and the application container running on a different port number as well (which I believe that it should not be a problem). The service attempted 10 times before it fails the deployment and I noticed getting this repeated error message: service <service_name> instance <instance_id> port <port_number> is unhealthy in target-group <target_group_name> due to (reason Health checks failed).
This did not happen with my first service with the same configuration. The weird thing is that when I attempted to send a request the ALB domain name on port 8081, the application on the second service seems to be working fine without any error. It is just that the Unhealthy Check keeps throwing my service off.
I went over bunch of posts and nothing really helps with the current situation. Also, it is kind of dumb since I cannot dig any further details rather than this info in this image below.
Anyone has any suggestion to resolve this problem? Would really appreciate it.

AWS Security Group connection tracking failing for responses with a body in ASP.NET Core app running in ECS + Fargate

In my application:
ASP.NET Core 3.1 with Kestrel
Running in AWS ECS + Fargate
Services run in a public subnet in the VPC
Tasks listen only in the port 80
Public Network Load Balancer with SSL termination
I want to set the Security Group to allow inbound connections from anywhere (0.0.0.0/0) to port 80, and disallow any outbound connection from inside the task (except, of course, to respond to the allowed requests).
As Security Groups are stateful, the connection tracking should allow the egress of the response to the requests.
In my case, this connection tracking only works for responses without body (just headers). When the response has a body (in my case, >1MB file), they fail. If I allow outbound TCP connections from port 80, they also fail. But if I allow outbound TCP connections for the full range of ports (0-65535), it works fine.
I guess this is because when ASP.NET Core + Kestrel writes the response body it initiates a new connection which is not recognized by the Security Group connection tracking.
Is there any way I can allow only responses to requests, and no other type of outbound connection initiated by the application?
So we're talking about something like that?
Client 11.11.11.11 ----> AWS NLB/ELB public 22.22.22.22 ----> AWS ECS network router or whatever (kubernetes) --------> ECS server instance running a server application 10.3.3.3:8080 (kubernetes pod)
Do you configure the security group on the AWS NLB or on the AWS ECS? (I guess both?)
Security groups should allow incoming traffic if you allow 0.0.0.0/0 port 80.
They are indeed stateful. They will allow the connection to proceed both ways after it is established (meaning the application can send a response).
However firewall state is not kept for more than 60 seconds typically (not sure what technology AWS is using), so the connection can be "lost" if the server takes more than 1 minute to reply. Does the HTTP server take a while to generate the response? If it's a websocket or TCP server instead, does it spend whole minutes at times without sending or receiving any traffic?
The way I see it. We've got two stateful firewalls. The first with the NLB. The second with ECS.
ECS is an equivalent to kubernetes, it must be doing a ton of iptables magic to distribute traffic and track connections. (For reference, regular kubernetes works heavily with iptables and iptables have a bunch of -very important- settings like connection durations and timeouts).
Good news is. If it breaks when you open inbound 0.0.0.0:80, but it works when you open inbound 0.0.0.0:80 + outbound 0.0.0.0:*. This is definitely an issue due to the firewall dropping the connection, most likely due to losing state. (or it's not stateful in the first place but I'm pretty sure security groups are stateful).
The drop could happen on either of the two firewalls. I've never had an issue with a single bare NLB/ELB, so my guess is the problem is in the ECS or the interaction of the two together.
Unfortunately we can't debug that and we have very little information about how this works internally. Your only option will be to work with the AWS support to investigate.

AWS: Using the application load balancer with ECS - Requests not reaching tasks

When I send requests using the ALB's DNS host, the listener's path, and the web services endpoint path, I don't get a response within the expected timeframe, which I've determined by successfully sending
requests directly to each of the tasks using their public ip addresses, they return successful responses.
For example:
The ALB's DNS entry: http://myapp-alb-11111111.us-west-1.elb.amazonaws.com
The web app, "abc", listens on port 80 for requests on "/api/health".
The web app is using "abc-svc/*" as the path in the listener.
The web app was assigned a public ip address of 10.88.77.66.
Sending a GET request to 'http://10.88.77.66/api/health' is successful.
Sending a GET request to 'http://myapp-alb-11111111.us-west-1.elb.amazonaws.com/abc-svc/api/health' does not return within several minutes, which is not expected behavior.
I've looked through the logs, but cannot find anything that is amiss. I'd appreciate any ideas or suggestions...
AWS CONFIGURATION
I have three docker images that are running in ECS. Each image is assigned to a separate service. Each service has a single task. Port 80 is open in the security group from the Internet to the ALB. Port 80 is open from the ALB to each task. The ALB's listener for port 80 is using path-based routing. There is a separate, unique path for each service. Each task contains a docker linux, spring boot 2, web service. Each web service's router has a "/api/health" route that expects a GET request with no parameters and returns a simple string. We are not using HTTP or SSL at this time.
Thank you for your time and interest.
Mike
There is a different reason for that but some of the common issues that you can debug
Check health check for each target group under LB target group, if its unhealthy LB will never route the traffic
Verify the target port is correct
Verify Target group associated properly with LB and is not showing unused.
Verify LB security group
Check the response from LB is it gateway timeout or service unavailbe if gateway timeout its not reachable if service unavailable probably restarting
Services Event logs, check that service is in steady-state or not, if not its mean restarting again and again
Check deployment logs of service, if you see unhealthy target group message then update the target group health path with status code

aws ECS, ECS instance is not registered to ALB target group

I create ECS service and it runs 1 ecs instance and I can see the instance is registered as a target of the load balancer.
Now I trigger a Auto Scaling Group (by just incrementing desired instance count) to launch a new instance.
The instance is launched and added to the ECS cluster. (I can see it on ECS instances tab)
But the instance is not added to the ALB target. (I expect to see 2 instances in the following image, but I only see 1)
I can edit AutoScalingGroup 's target group like the following
Then I see the following .
But the health check fails. It seems the 80 port is not reachable.
Although I have port 80 open for public in the security group for the instance. (Also, instance created from ecs service uses dynamic port mapping but instance created by ALS does not)
So AutoScalingGroup can launch new instance but my load balancer never gives traffic to the new instance.
I did try https://aws.amazon.com/premiumsupport/knowledge-center/troubleshoot-unhealthy-checks-ecs/?nc1=h_ls and it shows I can connect to port 80 from host to the docker container by something like curl -v http://${IPADDR}/health.
So it must be the case that there's something wrong with host port 80 (load balancer can't connect to it).
But it is also the case the security group setting is not wrong, because the working instance and this non working instance is using the same SG.
Edit
Because I used dynamic mapping, my webserver is running on some random port.
As you can see the instance started by ecs service has registered itself to target group with random port.
However instance started by ALB has registered itself to target group with port 80.
The instance will not be added to the target group if it's not healthy. So you need to fix the health check first.
From your first instance, your mapped port is 32769 so I assume if this is the same target group and if it is the same application then the port in new instance should be 32769.
When you curl the IP endpoint curl -I -v http://${IPADDR}/health. is the HTTP status code was 200, if it is 200 then it should be healthy if it's not 200 then update the backend http-status code or you can update health check HTTP status code.
I assume that you are also running ECS in both instances, so ECS create target group against each ECS services, are you running some mix services that you need target group in AS group? if you are running dynamic port then remove the health check path to traffic port.
Now if we look the offical possible causes for 502 bad Gateway
Dynamic port mapping is a feature of container instance in Amazon Elastic Container Service (Amazon ECS)
Dynamic port mapping with an Application Load Balancer makes it easier
to run multiple tasks on the same Amazon ECS service on an Amazon ECS
cluster.
With the Classic Load Balancer, you must statically map port numbers
on a container instance. The Classic Load Balancer does not allow you
to run multiple copies of a task on the same instance because the
ports conflict. An Application Load Balancer uses dynamic port mapping
so that you can run multiple tasks from a single service on the same
container instance.
Your created target group will not work with dynamic port, you have to bind the target group with ECS services.
dynamic-port-mapping-ecs
HTTP 502: Bad Gateway
Possible causes:
The load balancer received a TCP RST from the target when attempting to establish a connection.
The load balancer received an unexpected response from the target, such as "ICMP Destination unreachable (Host unreachable)", when attempting to establish a connection. Check whether traffic is allowed from the load balancer subnets to the targets on the target port.
The target closed the connection with a TCP RST or a TCP FIN while the load balancer had an outstanding request to the target. Check whether the keep-alive duration of the target is shorter than the idle timeout value of the load balancer.
The target response is malformed or contains HTTP headers that are not valid.
The load balancer encountered an SSL handshake error or SSL handshake timeout (10 seconds) when connecting to a target.
The deregistration delay period elapsed for a request being handled by a target that was deregistered. Increase the delay period so that lengthy operations can complete.
http-502-issues
It seems you know the root cause, which is that port 80 is failing the health check and thats why it is never added to ALB. Here is what you can try
First, check that your service is listening on port 80 on the new host. You can use command like netcat
nv -v localhost 80
Once you know that the service is listening, the recommended way to allow your ALB to connect to your host is to add a Security group inbound rule for your instance to allow traffic from your ALB security group on port 80