How to troubleshoot long pod kill time for GKE?

How to troubleshoot long pod kill time for GKE? - google-cloud-platform

When using helm upgrade --install I'm every so often running into timeouts. The error I get is:
UPGRADE FAILED
Error: timed out waiting for the condition
ROLLING BACK
If I look in the GKE cluster logs on GCP, I see that when this happens its because this step takes an unusually long time to execute:
Killing container with id docker://{container-name}:Need to kill Pod
I've seen it range from a few seconds to 9 minutes. If I go into the log message's metadata to find the specific container and look at its logs, there is nothing in them suggesting a difference between it and a quickly killed container.
Any suggestions on how to keep troubleshooting this?

You could refer this troubleshooting guide for general issues connected with Google Kubernetes Engine.
As mentioned there, you may need to use the 'Troubleshooting Application' guide for further debugging the application pods or its controller objects.
I am assuming that you checked the logs(1) of the container that resides in the respective pod OR described(2)( look at the reason for termination) it using the below commands. If not, you can try these as well to get more valuable information.
1. kubectl logs POD_NAME -c CONTAINER_NAME -p
2. kubectl describe pods POD_NAME
Note: I saw a similar discussion thread reported in github.com about helm upgrade failure. You can have a look over there as well.

Related

how to view details of a particular pod that has been exited using kubectl

I wish to view the logs of a particular pod (where I know the specific pod name from another logging application my company uses) using kubectl to determine the reason why it has continually been exiting with exitCode 143. However, when I run kubectl get pods, I am unable to see the specific pod I am looking for and only the pods that are running normally are listed. Would anyone know how I can get the details (and thus view the logs) for a specific pod name, even when it's no longer running?
EDIT: I have run kubectl logs <podname> but I cannot seem to find anything related to sigterm/exitCode 143 in the log output - is there another command I should be using?

Try using this command
kubectl logs <podname> --previous
This will show you the logs of the last run of the pod before it crashed. It is a handy feature in case you want to figure out why the pod crashed in the first place
Within Kubernetes Explorer, the easiest way to get back to logs from former/previous pods may be to use the events tab. There you can see which pods shutdown with the timestamp along with a brief reason and message. Find the previous pod of interest, select it, then in the detail pane there is an option to view logs.
If u want to see details of deleted pod:
Get a list of recently deleted pod names - up to 1 hour in the past unless you changed the ttl for kubernetes events - by running:
kubectl get event -o custom-columns=NAME:.metadata.name | cut -d "." -f1
You can then investigate further issues within your logging pipeline if you have one in place.
For exit code 143 refer to this doc.

As far as I know, you cannot get the logs of terminated pods.

How to run docker task with Amazon ECS - getting error `STOPPED (CannotStartContainerError: Error response from dae)`

My goal is to execute a benchmark deployed as a docker image. While doing so, I had too many issues, so I decided to first make something extremely trivial work.
So I decided to follow the guide in https://docs.aws.amazon.com/AmazonECS/latest/developerguide/create-task-definition.html
and use the "ping" example - it should just ping a domain couple of times, and stop.
The problem is, I always receive this message in the task status:
STOPPED (CannotStartContainerError: Error response from dae)
I tried it with various subnets and security groups, but the result is always the same - the task starts, and after a minute or two fails with the message above.
I even tried it on a fresh new AWS account, using these steps:
in https://us-east-2.console.aws.amazon.com/ecs/ created new cluster (networking only)
in task definitions, created a taskdef
with docker image alpine:latest, command ping -c 4 google.com
then I select the cluster, switch to "tasks" tab, and enter the run dialog
with one of pre-created subnets
After executing:
the task appears in the cluster's tasks list in PENDING state
it takes couple of minutes
eventually (using refresh button), it changes to the mentioned message - STOPPED (CannotStartContainerError: Error response from dae)
My guess is that the reason is:
either the task cannot download the image
or the instance cannot reach outside net
What can I be doing wrong? How to fix?

In my case too the log group was the problem. The one I had configured wasnt working. Hence I enabled the "Auto-configure CloudWatch Logs" option in the "Log Configuration" of the container settings.
Also if you open the stopped task, navigate to the container section, expand it, under the Details section you can see a detailed error message. Screenshot below

It could be a problem with the entry point as pointed in the comments of the question (in the task definition) Entrypoint: ["sh","-c"]
It could also be a bad reference, for example a wrong log group in the LogConfiguration or something similar.

I just create de group log in my cloudwatch console because it have not created, and now everything is going well.

Cloud composer tasks fail without reason or logs

I run Airflow in a managed Cloud-composer environment (version 1.9.0), whic runs on a Kubernetes 1.10.9-gke.5 cluster.
All my DAGs run daily at 3:00 AM or 4:00 AM. But sometime in the morning, I see a few Tasks failed without a reason during the night.
When checking the log using the UI - I see no log and I see no log either when I check the log folder in the GCS bucket
In the instance details, it reads "Dependencies Blocking Task From Getting Scheduled" but the dependency is the dagrun itself.
Although the DAG is set with 5 retries and an email message it does not look as if any retry took place and I haven't received an email about the failure.
I usually just clear the task instance and it run successfully on the first try.
Has anyone encountered a similar problem?

Empty logs often means the Airflow worker pod was evicted (i.e., it died before it could flush logs to GCS), which is usually due to an out of memory condition. If you go to your GKE cluster (the one under Composer's hood) you will probably see that there is indeed a evicted pod (GKE > Workloads > "airflow-worker").
You will probably see in "Tasks Instances" that said tasks have no Start Date nor Job Id or worker (Hostname) assigned, which, added to no logs, is a proof of the death of the pod.
Since this normally happens in highly parallelised DAGs, a way to avoid this is to reduce the worker concurrency or use a better machine.
EDIT: I filed this Feature Request on your behalf to get emails in case of failure, even if the pod was evicted.

GoCD Custom Command

I am trying to run a very simple custom command "echo helloworld" in GoCD as per the Getting Started Guide Part 2 however, the job does not finish with the Console saying Waiting for console logs and raw output saying Console log for this job is unavailable as it may have been purged by Go or deleted externally.
My job looks like the following which was taken from typing "echo" in the Lookup Command (which is different to the Getting Started example which I tried first with the same result)

Judging from the screenshot, the problem seems to be that no agent is assigned to the task. For an agent to be assigned, it must satisfy all of these conditions:
An agent must be running, and connected to the server
The agent must be enabled on the "Agents" page
If you use environments, the job and the agent need to be in the same environment
The agent needs to have all of the resources assigned that are configured in the job

Found the issue.
The Pipelines have to be in the same Environment to work.

Kubernetes pods without any affinity suddenly stop scheduling because of MatchInterPodAffinity predicate

Without any knows changes in our Kubernetes 1.6 cluster all new or restarted pods are not scheduled anymore. The error I get is:
No nodes are available that match all of the following predicates:: MatchInterPodAffinity (10), PodToleratesNodeTaints (2).
Our cluster was working perfectly before and I really cannot see any configuration changes that have been made before that occured.
Things I already tried:
restarting the master node
restarting kube-scheduler
deleting affected pods, deployments, stateful sets
Some of the pods do have anti-affinity settings that worked before, but most pods do not have any affinity settings.
Cluster Infos:
Kubernetes 1.6.2
Kops on AWS
1 master, 8 main-nodes, 1 tainted data processing node
Is there any known cause to this?
What are settings and logs I could check that could give more insight?
Is there any possibility to debug the scheduler?

The problem was that a Pod got stuck in deletion. That caused kube-controller-manager to stop working.
Deletion didn't work because the Pod/RS/Deployment in question had limits that conflicted with the maxLimitRequestRatio that we had set after the creation. A bug report is on the way.
The solution was to increase maxLimitRequestRatio and eventually restart kube-controller-manager.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to troubleshoot long pod kill time for GKE? - google-cloud-platform

Related

how to view details of a particular pod that has been exited using kubectl

How to run docker task with Amazon ECS - getting error `STOPPED (CannotStartContainerError: Error response from dae)`

Cloud composer tasks fail without reason or logs

GoCD Custom Command

Kubernetes pods without any affinity suddenly stop scheduling because of MatchInterPodAffinity predicate

Categories

Resources