AWS ECS service Tasks getting replaced with (reason Request timed out) - amazon-web-services

We are running ECS as container orchestration layer for more than 2 years. But there is one problem which we are not able to figure out the reason for, In few of our (node.js) services we have started observing errors in ECS events as
service example-service (instance i-016b0a460d9974567) (port 1047) is unhealthy in target-group example-service due to (reason Request timed out)
This causes our dependent service to start experiencing 504 gateway timeout which impacts them in big way.
Upgraded Docker storage driver from devicemapper to overlay2
We increased the resources for all ECS instances including CPU, RAM and EBS storage as we saw in few containers.
We increase health check grace period for the service from 0 to 240secs
Increased KeepAliveTimeout and SocketTimeout to 180 secs
Enabled awslogs on containers instead of stdout, but there was no unusual behavior
Enabled ECSMetaData at container and pipelined all information in our application logs. This helped us in looking all the logs for problematic container only.
Enabled container insights for better container level debugging
Out of this things which helped the most if upgrading devicemapper to overlay2 storage driver and increasing healthcheck grace period.
The amount of errors have come down amazingly with these two but still we are getting this issue once a while.
We have seen all the graphs related to instance and container which went down below are the logs for it:
ECS container insights logs for victim container :
Query :
fields CpuUtilized, MemoryUtilized, #message
| filter Type = "Container" and EC2InstanceId = "i-016b0a460d9974567" and TaskId = "dac7a872-5536-482f-a2f8-d2234f9db6df"
Example Logs answered :
{
"Version":"0",
"Type":"Container",
"ContainerName":"example-service",
"TaskId":"dac7a872-5536-482f-a2f8-d2234f9db6df",
"TaskDefinitionFamily":"example-service",
"TaskDefinitionRevision":"2048",
"ContainerInstanceId":"74306e00-e32a-4287-a201-72084d3364f6",
"EC2InstanceId":"i-016b0a460d9974567",
"ServiceName":"example-service",
"ClusterName":"example-service-cluster",
"Timestamp":1569227760000,
"CpuUtilized":1024.144923245614,
"CpuReserved":1347.0,
"MemoryUtilized":871,
"MemoryReserved":1857,
"StorageReadBytes":0,
"StorageWriteBytes":577536,
"NetworkRxBytes":14441583,
"NetworkRxDropped":0,
"NetworkRxErrors":0,
"NetworkRxPackets":17324,
"NetworkTxBytes":6136916,
"NetworkTxDropped":0,
"NetworkTxErrors":0,
"NetworkTxPackets":16989
}
None of logs were having CPU and Memory utilised ridiculously high.
We stopped getting responses from the victim container at let's say t1, we got errors in dependent services at t1+2mins and container was taken away by ECS at t1+3mins
Our health check configurations are below :
Protocol HTTP
Path /healthcheck
Port traffic port
Healthy threshold 10
Unhealthy threshold 2
Timeout 5
Interval 10
Success codes 200
Let me know if you need any more information, I will be happy to provide it. Configurations which we are running are :
docker info
Containers: 11
Running: 11
Paused: 0
Stopped: 0
Images: 6
Server Version: 18.06.1-ce
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 468a545b9edcd5932818eb9de8e72413e616e86e
runc version: 69663f0bd4b60df09991c08812a60108003fa340
init version: fec3683
Security Options:
seccomp
Profile: default
Kernel Version: 4.14.138-89.102.amzn1.x86_64
Operating System: Amazon Linux AMI 2018.03
OSType: linux
Architecture: x86_64
CPUs: 16
Total Memory: 30.41GiB
Name: ip-172-32-6-105
ID: IV65:3LKL:JESM:UFA4:X5RZ:M4NZ:O3BY:IZ2T:UDFW:XCGW:55PW:D7JH
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
There should some indication about resource contention or service crashing or genuine network failure to explain all this. But as mentioned there was nothing which we got to know caused any issue.

Your steps from 1 to 7 almost no thing do with the error.
service example-service (instance i-016b0a460d9974567) (port 1047) is
unhealthy in target-group example-service due to (reason Request timed
out)
The error is very clear, you ECS service is not reachable to Load balancer health check.
Target Group Unhealthy
When this is the case, go straight and check the container SG, Port, application status or health status code.
Possible reason
There might be the case, there is no route Path /healthcheck in the backend service
The status code from /healthcheck is not 200
Might be the case that target port is invalid, configure it correctly, if an application running on port 8080 or 3000 it should be 3000 or 8080
The security group is not allowing traffic on the target group
Application is not running in the container
These are the possible reason when there is a timeout from health check.

I faced the same issue of ( Reason request timeout ).
I managed to solve it by updating my security-group inbound rules.
Currently, there was no rule defined in Inbound rules so I add general allow-all traffic for the ipv4 rule for the time being because I was in development at that time.

Related

network: CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container issue in EKS. help anybody, please

When pods are increased through hpa, the following error occurs and pod creation is not possible.
If I manually change the replicas of the deployments, the pods are running normally.
It seems to be a CNI-related problem, and the same phenomenon occurs even if you install 1.7.10 cni for 1.20 cluster with add on .
200 IPs per subnet is sufficient, and the outbound security group is also open.
By default, that issue does not occur when the number of pods is scaled via kubectl .
7s Warning FailedCreatePodSandBox pod/b4c-ms-test-develop-5f64db58f-bm2vc Failed to create pod sandbox: rpc error: code = Unknown desc = failed to set up sandbox container "7632e23d2f3db8f8b8c0335aaaa6afe1e52ad43cf293bfa6789aa14f5b665cf1" network for pod "b4c-ms-test-develop-5f64db58f-bm2vc": networkPlugin cni failed to set up pod "b4c-ms-test-develop-5f64db58f-bm2vc_b4c-test" network: CNI failed to retrieve network namespace path: cannot find network namespace for the terminated container "7632e23d2f3db8f8b8c0335aaaa6afe1e52ad43cf293bfa6789aa14f5b665cf1"
Region: eu-west-1
Cluster Name: dev-pangaia-b4c-eks
For AWS VPC CNI issue, have you attached node logs?: No
For DNS issue, have you attached CoreDNS pod log?:

Fargate task stops about 10s after is starts with no log output

My Fargate task keeps stopping after it's started and doesn't output any logs (awslog driver is selected).
The container does start up and stay running when i execute docker locally.
Docker-compose file:
version: '2'
services:
asterisk:
build: .
container_name: asterisk
restart: always
ports:
- 10000-10099:10000-10099/udp
- 5060:5060/udp
Dockerfile:
FROM debian:10.7
RUN {stuff-that-works-is-here}
# Keep Asterisk running in the foreground
ENTRYPOINT ["asterisk", "-f"]
# SIP port
EXPOSE 5060:5060/udp
# RTP ports
EXPOSE 10000-10099:10000-10099/udp
my task execution role has full cloudwatch access for debugging.
Click on the ECS task instance, expand the container section, the error should be shown there. I have attached a screen shot of it. Here is a screenshotScrenshot
The AWS log driver alone is not enough.
Unfortunately, Fargate doesn't create the log group for you unless you tell it to
See Creating a log group at https://docs.aws.amazon.com/AmazonECS/latest/developerguide/using_awslogs.html
I had a similar problem, and the cause was the Health Check.
ECS dont have Health Check for UDP, so when you open a UDP port if you use Docker for the deploy (docker compose), it create a Health Check pointing to a TCP port, and since there was no open TCP ports for that range, the container reset itself due to Health Check.
I had to add a custom Resource to docker-compose:
x-aws-cloudformation:
Resources:
AsteriskUDP5060TargetGroup:
Type: "AWS::ElasticLoadBalancingV2::TargetGroup"
Properties:
HealthCheckProtocol: TCP
HealthCheckPort: 8088
Basically I have a Health Check for a UDP port pointing to a TCP port. Its a "hack" to bypass this problem when the deploy is made with Docker.

EKS: Unhealthy nodes in the kubernetes cluster

I’m getting an error when using terraform to provision node group on AWS EKS.
Error: error waiting for EKS Node Group (xxx) creation: NodeCreationFailure: Unhealthy nodes in the kubernetes cluster.
And I went to console and inspected the node. There is a message “runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker network plugin is not ready: cni config uninitialized”.
I have 5 private subnets and connect to Internet via NAT.
Is someone able to give me some hint on how to debug this?
Here are some details on my env.
Kubernetes version: 1.18
Platform version: eks.3
AMI type: AL2_x86_64
AMI release version: 1.18.9-20201211
Instance types: m5.xlarge
There are three workloads set up in the cluster.
coredns, STATUS (2 Desired, 0 Available, 0 Ready)
aws-node STATUS (5 Desired, 5 Scheduled, 0 Available, 0 Ready)
kube-proxy STATUS (5 Desired, 5 Scheduled, 5 Available, 5 Ready)
go inside the coredns, both pods are in pending state, and conditions has “Available=False, Deployment does not have minimum availability” and “Progress=False, ReplicaSet xxx has timed out progressing”
go inside the one of the pod in aws-node, the status shows “Waiting - CrashLoopBackOff”
Add pod network add-on
kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/2140ac876ef134e0ed5af15c65e414cf26827915/Documentation/kube-flannel.yml

Fail to init aws cluster (kubeadm init) with the message "could not init cloud provider "aws": error finding instance ... timeout

The issue I have is that kubeadm will never fully initialize. The output:
...
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
[kubelet-check] Initial timeout of 40s passed.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused.
...
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get "http://localhost:10248/healthz": dial tcp [::1]:10248: connect: connection refused.
...
and journalctl -xeu kubelet shows the following interesting info:
Dec 03 17:54:08 ip-10-83-62-10.ec2.internal kubelet[14709]: W1203 17:54:08.017925 14709 plugins.go:105] WARNING: aws built-in cloud provider is now deprecated. The AWS provider is deprecated. The AWS provider is deprecated and will be removed in a future release
Dec 03 17:54:08 ip-10-83-62-10.ec2.internal kubelet[14709]: I1203 17:54:08.018044 14709 aws.go:1235] Building AWS cloudprovider
Dec 03 17:54:08 ip-10-83-62-10.ec2.internal kubelet[14709]: I1203 17:54:08.018112 14709 aws.go:1195] Zone not specified in configuration file; querying AWS metadata service
Dec 03 17:56:08 ip-10-83-62-10.ec2.internal kubelet[14709]: F1203 17:56:08.332951 14709 server.go:265] failed to run Kubelet: could not init cloud provider "aws": error finding instance i-03e00e9192370ca0d: "error listing AWS instances: \"RequestError: send request failed\\ncaused by: Post \\\"https://ec2.us-east-1.amazonaws.com/\\\": dial tcp 10.83.60.11:443: i/o timeout
The context is: it's a fully private AWS VPC. There is a proxy that is propagated to k8s manifests.
the kubeadm.yaml config is pretty innocent and looks like this
---
apiVersion: kubeadm.k8s.io/v1beta2
kind: ClusterConfiguration
apiServer:
extraArgs:
cloud-provider: aws
clusterName: cdspidr
controlPlaneEndpoint: ip-10-83-62-10.ec2.internal
controllerManager:
extraArgs:
cloud-provider: aws
configure-cloud-routes: "false"
kubernetesVersion: stable
networking:
dnsDomain: cluster.local
podSubnet: 10.83.62.0/24
---
apiVersion: kubeadm.k8s.io/v1beta2
kind: InitConfiguration
nodeRegistration:
name: ip-10-83-62-10.ec2.internal
kubeletExtraArgs:
cloud-provider: was
I'm looking for help to figure out a couple of things here:
why does kubeadm use this address (https://ec2.us-east-1.amazonaws.com) to retrieve availability zones? It does not look correct. IMO, it should be something like http://169.254.169.254/latest/dynamic/instance-identity/document
why does it fail? With the same proxy settings, a curl request from the terminal returns the web page.
To workaround it, how can I specify availability zones on my own in kubeadm.yaml or via a command like for kubeadm?
I would appreciate any help or thoughts.
You can create a VPC endpoint for accessing Ec2 (service name - com.amazonaws.us-east-1.ec2), this will allow the kubelet to talk to Ec2 without internet and fetch the required info.
While creating the VPC endpoint please make sure to enable private DNS resolution option.
Also from the error it looks like that kubelet is trying to fetch the instance not just availability zone. ("aws": error finding instance i-03e00e9192370ca0d: "error listing AWS instances).

Filebeat Loadbalancing not working

I have setup a Logstash Cluster in Google Cloud that sits behind a Load Balancer and uses Autoscaling (-> when the load gets to high new instances are started up automatically).
Unfortunately this does not work properly with Filebeat. Filebeat only hits those Logstash Vms that existed prior to starting up Filebeat.
Example:
Lets assume I initially have those 3 Logstash hosts running:
Host1
Host2
Host3
When I startup Filebeat, it correctly distributes the messages to Host1, Host2 and Host3.
Now the Autoscaling kicks and and spins up 2 more instances, Host4 and Host5.
Unfortunately Filebeat still only sends messages to Host1, Host2 and Host3. The new hosts, Host4 and Host5, are ignored.
When I now restart Filebeat it sends messages to all 5 hosts!
So it seems Filebeat only sends messages to those hosts that have been running when Filebeat starts up.
My filebeat.yml looks like this:
filebeat.inputs:
- type: log
paths:
...
...
output.logstash:
hosts: ["logstash-loadbalancer:5044", "logstash-loadbalancer:5044"]
worker: 1
ttl: 2s
loadbalance: true
I have added the same host (the loadbalancer) twice because I've read in the forums that otherwise Filebeat won't loadbalance messages -> I can confirm that.
But still loadbalancing seems to not work properly, e.g. TTL seems not to be respected because it always targets the same connections.
Is my configuration wrong? Bug in Filebeat?
Hope you already resolved this problem. In case you haven't, you should set the pipelining to 0 as below: (ttl only works if pipelining is set to 0)
output.logstash:
hosts: ["logstash-loadbalancer:5044", "logstash-loadbalancer:5044"]
worker: 1
ttl: 2s
loadbalance: true
pipelining: 0