Kubernetes: Cluster running but unresponsive to changes, cannot retrieve logs - amazon-web-services

I have an existing cluster running k8s version 1.12.8 on AWS EC2. The cluster contains several pods - some serving web traffic and others configured as scheduled CronJobs. The cluster has been running fine in it's current configuration for at least 6 months, with CronJobs running every 5 minutes.
Recently, the CronJobs simply stopped. Viewing the pods via kubectl shows all the scheduled CronJobs last run was at roughly the same time. Logs sent to AWS Cloudwatch show no error output, and stop at the same time kubectl shows for the last run.
In trying to diagnose this issue I have found a broader pattern of the cluster being unresponsive to changes, eg: I cannot retrieve logs or nodes via kubectl.
I deleted Pods in Replica Sets and they never return. I've set autoscale values on Replica Sets and nothing happens.
Investigation of the kubelet logs on the master instance revealed repeating errors, coinciding with the time the failure was first noticed:
I0805 03:17:54.597295 2730 kubelet.go:1928] SyncLoop (PLEG): "kube-scheduler-ip-x-x-x-x.z-west-y.compute.internal_kube-system(181xxyyzz)", event: &pleg.PodLifecycleEvent{ID:"181xxyyzz", Type:"ContainerDied", Data:"405ayyzzz"}
...
E0805 03:18:10.867737 2730 kubelet_node_status.go:378] Error updating node status, will retry: failed to patch status "{\"status\":{\"$setElementOrder/conditions\":[{\"type\":\"NetworkUnavailable\"},{\"type\":\"OutOfDisk\"},{\"type\":\"MemoryPressure\"},{\"type\":\"DiskPressure\"},{\"type\":\"PIDPressure\"},{\"type\":\"Ready\"}],"conditions\":[{\"lastHeartbeatTime\":\"2020-08-05T03:18:00Z\",\"type\":\"OutOfDisk\"},{\"lastHeartbeatTime\":\"2020-08-05T03:18:00Z\",\"type\":\"MemoryPressure\"},{\"lastHeartbeatTime\":\"2020-08-05T03:18:00Z\",\"type\":\"DiskPressure\"},{\"lastHeartbeatTime\":\"2020-08-05T03:18:00Z\",\"type\":\"PIDPressure\"},{\"lastHeartbeatTime\":\"2020-08-05T03:18:00Z\",\"type\":\"Ready\"}]}}" for node "ip-172-20-60-88.eu-west-2.compute.internal": Patch https://127.0.0.1/api/v1/nodes/ip-x-x-x-x.z-west-y.compute.internal/status?timeout=10s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
...
E0805 03:18:20.869436 2730 kubelet_node_status.go:378] Error updating node status, will retry: error getting node "ip-172-20-60-88.eu-west-2.compute.internal": Get https://127.0.0.1/api/v1/nodes/ip-172-20-60-88.eu-west-2.compute.internal?timeout=10s: context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Running docker ps on the master node shows that both k8s_kube-controller-manager_kube-controller-manager and k8s_kube-scheduler_kube-scheduler containers were started 6 days ago, where the other k8s containers are at 8+ months.
tl;dr
A container on my main node (likely kube-scheduler, kube-controller-manager or both) died. The containers have come back up but are unable to communicate with the existing nodes - this is preventing any scheduled CronJobs or new deployments from being satisfied.
How can re-configure kubelet and associated services on the master node to communicate again with the worker nodes?

From the docs on Troubleshoot Clusters
Digging deeper into the cluster requires logging into the relevant machines. Here are the locations of the relevant log files. (note that on systemd-based systems, you may need to use journalctl instead)
Master Nodes
/var/log/kube-apiserver.log - API Server, responsible for serving the API
/var/log/kube-scheduler.log - Scheduler, responsible for making scheduling decisions
/var/log/kube-controller-manager.log - Controller that manages replication controllers
Worker Nodes
/var/log/kubelet.log - Kubelet, responsible for running containers on the node
/var/log/kube-proxy.log - Kube Proxy, responsible for service load balancing
Another way to get logs is to use docker ps to get containerid and then use docker logs containerid
If you have (which you should) a monitoring system setup using prometheus and Grafana you can check metrics such as high cpu load on API Server pods

Related

Airflow Scheduler - Ephemeral Storage - Evicted

I've been runnning into what should be a simple issue with my airflow scheduler. Every couple of weeks, the scheduler becomes Evicted. When I run a describe on the pod, the issue is because The node was low on resource: ephemeral-storage. Container scheduler was using 14386916Ki, which exceeds its request of 0.
The question is two fold. First, why is the scheduler utilizing ephemeral-storage? And second, is it possible to do add ephemeral-storage when running on eks?
Thanks!
I believe Ephemeral Storage is not Airflow's question but more of the configuration of your K8S cluster.
Assuming we are talking about OpenShift' ephemeral storage:
https://docs.openshift.com/container-platform/4.9/storage/understanding-ephemeral-storage.html
This can be configured in your cluster and it wil make "/var/log" ephemeral.
I think the problem is that /var/logs gets full. Possibly some of the system logs (not from airlfow but from some other processes running in the same container). I think a solution will be to have a job that cleans that system log periodically.
For example we have this script that cleans-up Airlfow logs:
https://github.com/apache/airflow/blob/main/scripts/in_container/prod/clean-logs.sh

AWS ECS fargate task stopping and restarting somewhat randomnly

One of my ECS fargate tasks is stopping and restarting in what seems to be a somewhat random fashion. I started the task in Dec 2019 and it has stopped/restarted three times since then. I've found that the task stopped and restarted from its 'Events' log (image below) but there's no info provided as to why it stopped..
So what I've tried to do to date to debug this is
Checked the 'Stopped' tasks inside the cluster for info as to why it might have stopped. No luck here as it appears 'Stopped' tasks are only held there for a short period of time.
Checked CloudWatch logs for any log messages that could be pertinent to this issue, nothing found
Checked CloudTrail event logs for any event pertinent to this issue, nothing found
Confirmed the memory and CPU utilisation is sufficient for the task, in fact the task never reaches 30% of it's limits
Read multiple AWS threads about similar issues where solutions mainly seem to be connected to using an ELB which I'm not..
Any have any further debugging device or ideas what might be going on here?
I ran into the same issue and found this from aws
https://docs.aws.amazon.com/AmazonECS/latest/userguide/task-maintenance.html
When AWS determines that a security or infrastructure update is needed
for an Amazon ECS task hosted on AWS Fargate, the tasks need to be
stopped and new tasks launched to replace them.
Also a github post on storing stopped tasks info in cloudwatch logs:
https://github.com/aws/amazon-ecs-agent/issues/368

AWS ECS Spring Boot Task killed and restarted on background work

I have a Spring Boot web application running on AWS ECS service on Fargate with a desired count of 1. It's configured with a LB in front for SSL termination and healthchecks.
Each night via #scheduled I run a batch job that does some recalculations. At various points either during or shortly after that job runs my task is killed and a new one is spun up. During the task running I notice a few things:
CPU on the service (via cloud watch) spikes to above 60%
My health checks from the load balancer still respond in a good amount of time
There are no errors in my spring boot logs
In the ECS service events I see service sname-app-lb deregistered 1 targets in target-group ecs-sname-app-lb
I'm trying to figure out how to tell exactly why the task is being killed. Any tips on how to debug / fix this would be greatly appreciated.
So, i have have similar experience in the past. This is what you need to do:
1. Make sure you are streaming the application logs to the cloudwatch using the awslogs driver in the task definition (if you are not doing it already).
2. Put a delay in the app as a catch/handler wherever it can fail. This delay will make sure that the application logs are sent to cw logs the event of an exception, and thus prevent an abrupt exit of the task.
I initially thought as a fargate issue, but the above really help understand the underlying issue. All the best.
If you are running your Spring application inside Docker in AWS Fargate, if it hits the memory limit, your application could get killed.
More information: https://developers.redhat.com/blog/2017/03/14/java-inside-docker/

Why do Kubernetes worker nodes become NodeNotReady?

Worker nodes were unexpectedly dropped from cluster by master, for unknown reason.
The cluster has the following setup:
AWS
Multi-az configured
Clustered masters, clusters (across AZs)
Flannel networking
Provisioned using CoreOS's kube-aws
An incident of unknown origin occurred, wherein during the span of seconds, all worker nodes were dropped from the master. The only relevant log entry that we could find was for kube-controller-manager:
I0217 14:19:11.432691 1 event.go:217] Event(api.ObjectReference{Kind:"Node", Namespace:"", Name:"ip-XX-XX-XX-XX.ec2.internal", UID:"XXX", APIVersion:"", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'NodeNotReady' Node ip-XX-XX-XX-XX.ec2.internal status is now: NodeNotReady
The nodes returned to "ready" approximately 10 minutes later.
We have yet to locate the cause of why the node transitioned to NodeNotReady.
We have so far looked through logs of various system components including:
flannel
kubelet
etcd
controller-manager
One potential noteworthy item is that the active master of the cluster currently resides in a different AZ from the nodes. This should be OK, but could be the source of network connectivity problems. That being said, we have seen no indication in logs / monitoring of inter-AZ connection problems.
Checking kubelet logs, there was no clear logging event of the nodes changing their state to "not ready or otherwise. Additionally no clear indication of any fatal events either.
One item that could be noteworthy, is that all kubelets logged after the outage:
Error updating node status, will retry: error getting node "ip-XX-XX-XX-XX.ec2.internal": Get https://master/api/v1/nodes?fieldSelector=metadata.name%3Dip-XX-XX-XX-XX.ec2.internal&resourceVersion=0: read tcp 10.X.X.X:52534->10.Y.Y.Y:443: read: no route to host".
Again please note, these log messages were logged after the nodes had re-joined the cluster (there was a clear ~10min window between cluster collapse and nodes rejoining).

Why every time Elastic Beanstalk issues a command to its instance it always timed out?

I have a PHP application deployed to Amazon Elastic Beanstalk. But I notice a problem that every time I push my code changes via git aws.push to the Elastic Beanstalk, the application deployed didn't picked up the changes. I checked the events log on my application Beanstalk environment and notice that every time the Beanstalk issues:
Deploying new version to instance(s)
it's always followed by:
The following instances have not responded in the allowed command timeout time (they might still finish eventually on their own):
[i-d5xxxxx]
The same thing happens when I try to request snapshot logs. The Beanstalk issues:
requestEnvironmentInfo is starting
then after a few minutes it's again followed by:
The following instances have not responded in the allowed command timeout time (they might still finish eventually on their own): [i-d5xxxxx].
I had this problem a few times. It seems to affect only particular instances. So it can be solved by terminating the EC2 instance (done via the EC2 page on the Management Console). Thereafter, Elastic Beanstalk will detect that there are 0 healthy instances and automatically launch a new one.
If this is a production environment and you have only 1 instance and you want minimal down time
configure minimum instances to 2, and Beanstalk will launch another instance for you.
terminate the problematic instance via EC2 tab, Beanstalk will launch another instance for you because minimum instance is 2
configure minimum instance back to 1, Beanstalk will remove one of your two instances.
By default Elastic Beanstalk "throws a timeout exception" after 8 minutes (480 seconds defined in settings) if your commands did not complete in time.
You can set an higher time up to 30 minutes (1800 seconds).
{
"Namespace": "aws:elasticbeanstalk:command",
"OptionName": "Timeout",
"Value": "1800"
}
Read here: http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/command-options.html
Had the same issue here (single t1.micro instance).
Did solve the problem by rebooting the EC2 instance via the EC2 page on the Management Console (and not from EB page).
Beanstalk deployment (and other features like Get Logs) work by sending SQS commands to instances. SQS client is deployed to instances and checks queue about every 20 secs (see /var/log/cfn-hup.log):
2018-05-30 10:42:38,605 [DEBUG] Receiving messages for queue https://sqs.us-east-2.amazonaws.com/124386531466/93b60687a33e19...
If SQS Client crashes or has network problems on t1/t2 instances then it will not be able to receive commands from Beanstalk, and deployment would time out. Rebooting instance restarts SQS Client, and it can receive commands again.
An easier way to fix SQS Client is to restart cfn-hup service:
sudo service cfn-hup restart
In the case of deployment, an alternative to shutting down the EC2 instances and waiting for Elastic Beanstalk to react, or messing about with minimum and maximum instances, is to simply perform a Rebuild environment on the target environment.
If a previous deployment failed due to timeout then the new version will still be registered against the environment, but due to the timeout it will not appear to be operational (in my experience the instance appears to still be running the old version).
Rebuilding the environment seems to reset things with the new version being used.
Obviously there's the downside with that of a period of downtime.
I think is the correct way to deal with this.
I think the correct way to deal with this is to figure out the cause of the timeout by doing what this answer suggests.
chongzixin's answer is what needs to be done if you need this fixed ASAP before investigating the reason for a timeout.
However, if you do need to increase timeout, see the following:
Add configuration files to your source code in a folder named .ebextensions and deploy it in your application source bundle.
Example:
option_settings:
"aws:elasticbeanstalk:command":
Timeout: 2400
*"value" represents the length of time before timeout in seconds.
Reference: https://serverfault.com/a/747800/496353
"Restart App Server(s)" from the "Actions" menu in Elastic Beanstalk management dashboard followed by eb deploy fixes it for me.
Visual cue for the first instruction
After two days of checking random issues, I restarted both EC2 instances one after another to make sure there is no downtime. Site worked fine but after a while, website started throwing error 504.
When I checked the http server, nginx was off and "Out of HDD space" was thrown. "Increased the HDD size", elastic beanstalk created new instances and the issue was fixed.
For me, the problem was my VPC security group rules. According to the docs, you need to allow outbound traffic on port 123 for NTP to work. I had the port closed, so the clock was drifting, and so the EC2's were becoming unresponsive to commands from the Elastic Beanstalk environment, taking forever to deploy (only to time out) failing to get logs, etc.
Thank you #Logan Pickup for the hint in your comment.