Determining the cause of kubernetes pod restart

Determining the cause of kubernetes pod restart - c++

I'm running into issues with our Kubernetes deployment. Recently we are running into a problem with one of the pods being restarted frequently.
The service inside is using C++, with Google Logging and should dump a stacktrace on a crash (it does do that when run locally).
Unfortunately, the only log message I was able to find, related to the pod restart is from containerd, just saying "shim reaped".
Do I need to turn on some extra logging/monitoring to have the reasons for restart retained?

Your can check crashed pod log by running
$ kubectl logs -f <pod name> -n <namespace> --previous

The pod could have been terminated for reasons like out of memory. Use kubectl describe pod <podname> which contains the information.
There should be output like this (could also be a different reason than OOM):
Last State: Terminated
Reason: OOMKilled

Related

My GKE pods stoped with error "no command specified: CreateContainerError"

Everything was Ok and nodes were fine for months, but suddenly some pods stopped with an error
I tried to delete pods and nodes but same issues.

Try below possible solutions to resolve your issue:
Solution 1 :
Check a malformed character in your Dockerfile and cause it to crash.
When you encounter CreateContainerError is to check that you have a valid ENTRYPOINT in the Dockerfile used to build your container image. However, if you don’t have access to the Dockerfile, you can configure your pod object by using a valid command in the command attribute of the object.
So workaround is to not specify any workerConfig explicitly which makes the workers inherit all configs from the master.
Refer to Troubleshooting the container runtime, similar SO1, SO2 & Also check this similar github link for more information.
Solution 2 :
Kubectl describe pod podname command provides detailed information about each of the pods that provide Kubernetes infrastructure. With the help of this you can check for clues, if Insufficient CPU follows the solution below.
The solution is to either:
1)Upgrade the boot disk: If using a pd-standard disk, it's recommended to upgrade to pd-balanced or pd-ssd.
2)Increase the disk size.
3)Use node pool with machine type with more CPU cores.
See Adjust worker, scheduler, triggerer and web server scale and performance parameters for more information.
If you still have the issue, you can then update the GKE version for your cluster Manually upgrading the control planeto one of the fixed versions.
Also check whether you have updated it in the last year to use the new kubectl authentication coming in the GKE v1.26 plugin?
Solution 3 :
If you're having a pipeline on GitLab that deploys an image to a GKE cluster: Check the version of the Gitlab runner that handles the jobs of your pipeline .
Because it turns out that every image built through a Gitlab runner running on an old version causes this issue at the container start. Simply deactivate them and only let Gitlab runners running last version in the pool, replay all pipelines.
Check the gitlab CI script using an old docker image like docker:19.03.5-dind, update to docker:dind helps the kubernetes to start the pod again.

How to keep a container running in ECS

We are deploying an application in ECS, which is exiting due to some error.
We need to log in to the container and check the logs, however the container stops when the application exits after the error.
How can I keep the container running so that I can ssh in to it?
I tried using tail -f /dev/null in the startup.sh script, which is run at the container startup.
I need to run the startup.sh script to configure the SSH, etc.
However, it looks like executing tail -f /dev/null at the end of the scripts does not seem to keep the container running.
Appreciate any advice on how to keep an ECS container running.

JupyterHub notebook server returning 500 error, pod stuck in "terminating" state

I have an AWS EKS cluster (kubernetes version 1.14) which runs JupyterHub application.
One of the users notebook servers is returning a 500 error
500 : Internal Server Error
Redirect loop detected. Notebook has JupyterHub version unknown (likely < 0.8), but the hub expects 0.9.6. Try installing JupyterHub==0.9.6 in the user environment if you continue to have problems.
You can try restarting your server from the homepage.
Only one user is experiencing this issue, others are not. When I do "kubectl get pod", this users pod shows that it is in state "terminating" (it appears to be stuck in this state).

I was able to fix it, but I can't say this is the right approach. (I would have preferred to diagnose the root cause)
First, I tried deleting the pod kubectl delete pod <pod_name> -- it did not work
Second, I tried force deleting the pod kubectl delete pod <pod_name> --grace-period=0 --force -- it worked, but it turns out this only deletes the handle, the pod resources are then orphaned on the cluster
I checked the node status kubectl get node and noticed one node was stuck in NotReady state. I recycled this node -- still did not work, the user notebook server was still stuck and returning 500 err
Finally, I simply deleted the user notebook server from the jupyter hub admin page. This fixed it....

Cloud Composer GKE Node upgrade results in Airflow task randomly failing

The problem:
I have a managed Cloud composer environment, under a 1.9.7-gke.6 Kubernetes cluster master.
I tried to upgrade it (as well as the default-pool nodes) to 1.10.7-gke.1, since an upgrade was available.
Since then, Airflow has been acting randomly. Tasks that were working properly are failing for no given reason. This makes Airflow unusable, since the scheduling becomes unreliable.
Here is an example of a task that runs every 15 minutes and for which the behavior is very visible right after the upgrade:
airflow_tree_view
On hover on a failing task, it only shows an Operator: null message (null_operator). Also, there is no log at all for that task.
I have been able to reproduce the situation with another Composer environment in order to ensure that the upgrade is the cause of the dysfunction.
What I have tried so far :
I assumed the upgrade might have screwed up either the scheduler or Celery (Cloud composer defaults to CeleryExecutor).
I tried restarting the scheduler with the following command:
kubectl get deployment airflow-scheduler -o yaml | kubectl replace --force -f -
I also tried to restart Celery from inside the workers, with
kubectl exec -it airflow-worker-799dc94759-7vck4 -- sudo celery multi restart 1
Celery restarts, but it doesn't fix the issue.
So I tried to restart the airflow completely the same way I did with airflow-scheduler.
None of these fixed the issue.
Side note, I can't access Flower to monitor Celery when following this tutorial (Google Cloud - Connecting to Flower). Connecting to localhost:5555 stay in 'waiting' state forever. I don't know if it is related.
Let me know if I'm missing something!

1.10.7-gke.2 is available now [1]. Can you further upgrade to 1.10.7-gke.2 to see if the issue persists?
[1] https://cloud.google.com/kubernetes-engine/release-notes

Running Docker container randomly disappears on AWS EC2 Ubuntu

I'm running Docker on a t2.micro AWS EC2 instance with Ubuntu.
I'm running several containers. One of my long-running containers (always the same) just disappeared after running about 2-5 days for the third time right now. It is just gone with no sign of a crash.
The machine has not been restarted (uptime says 15 days).
I do not use the --rm flag: docker run -d --name mycontainer myimage.
There is no exited zombie of this container when running docker ps -a.
There is no log, i.e. docker logs mycontainer does not find any container.
There is no log entry in journalctl -u docker.service within the time frame
where the container disappears. However, there are some other log entries
regarding another container (let's call it othercontainer) which are
occuring repeatedly about every 6 minutes (it's a cronjob, don't know if relevant):
could not remove cluster networks: This node is not a swarm manager. Use
"docker swarm init" or "docker swarm join" to connect this node to swarm
and try again
Handler for GET /v1.24/networks/othercontainer_default returned error:
network othercontainer_default not found
Firewalld running: false
Even if there would be e.g. an out-of-memory issue or if my application just exits, I would still have an exited Docker container zombie in the ps -a overview, probably with exist status 0 or != 0, right?
I also don't want to --restart automatically, I just want to see the exited container.
Where can I look for more details to trace the issue?
Versions:
OS: Ubuntu 16.04.2 LTS (Kernel: 4.4.0-1013-aws)
Docker: Docker version 17.03.1-ce, build c6d412e

Thanks to a hint to look at dmesg or maybe the general journalctl I think I finally found the issue.
Somehow, one of the cronjobs has been running docker system prune -f at its end every 5 minutes. This command basically seems to remove everything unused and non-running.
I didn't know about this command before but certainly this has to be the way how my exited containers got removed without me knowing how it happened.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js