how to view details of a particular pod that has been exited using kubectl - kubectl

I wish to view the logs of a particular pod (where I know the specific pod name from another logging application my company uses) using kubectl to determine the reason why it has continually been exiting with exitCode 143. However, when I run kubectl get pods, I am unable to see the specific pod I am looking for and only the pods that are running normally are listed. Would anyone know how I can get the details (and thus view the logs) for a specific pod name, even when it's no longer running?
EDIT: I have run kubectl logs <podname> but I cannot seem to find anything related to sigterm/exitCode 143 in the log output - is there another command I should be using?

Try using this command
kubectl logs <podname> --previous
This will show you the logs of the last run of the pod before it crashed. It is a handy feature in case you want to figure out why the pod crashed in the first place
Within Kubernetes Explorer, the easiest way to get back to logs from former/previous pods may be to use the events tab. There you can see which pods shutdown with the timestamp along with a brief reason and message. Find the previous pod of interest, select it, then in the detail pane there is an option to view logs.
If u want to see details of deleted pod:
Get a list of recently deleted pod names - up to 1 hour in the past unless you changed the ttl for kubernetes events - by running:
kubectl get event -o custom-columns=NAME:.metadata.name | cut -d "." -f1
You can then investigate further issues within your logging pipeline if you have one in place.
For exit code 143 refer to this doc.

As far as I know, you cannot get the logs of terminated pods.

Related

How to troubleshoot long pod kill time for GKE?

When using helm upgrade --install I'm every so often running into timeouts. The error I get is:
UPGRADE FAILED
Error: timed out waiting for the condition
ROLLING BACK
If I look in the GKE cluster logs on GCP, I see that when this happens its because this step takes an unusually long time to execute:
Killing container with id docker://{container-name}:Need to kill Pod
I've seen it range from a few seconds to 9 minutes. If I go into the log message's metadata to find the specific container and look at its logs, there is nothing in them suggesting a difference between it and a quickly killed container.
Any suggestions on how to keep troubleshooting this?
You could refer this troubleshooting guide for general issues connected with Google Kubernetes Engine.
As mentioned there, you may need to use the 'Troubleshooting Application' guide for further debugging the application pods or its controller objects.
I am assuming that you checked the logs(1) of the container that resides in the respective pod OR described(2)( look at the reason for termination) it using the below commands. If not, you can try these as well to get more valuable information.
1. kubectl logs POD_NAME -c CONTAINER_NAME -p
2. kubectl describe pods POD_NAME
Note: I saw a similar discussion thread reported in github.com about helm upgrade failure. You can have a look over there as well.

How to run docker task with Amazon ECS - getting error `STOPPED (CannotStartContainerError: Error response from dae)`

My goal is to execute a benchmark deployed as a docker image. While doing so, I had too many issues, so I decided to first make something extremely trivial work.
So I decided to follow the guide in https://docs.aws.amazon.com/AmazonECS/latest/developerguide/create-task-definition.html
and use the "ping" example - it should just ping a domain couple of times, and stop.
The problem is, I always receive this message in the task status:
STOPPED (CannotStartContainerError: Error response from dae)
I tried it with various subnets and security groups, but the result is always the same - the task starts, and after a minute or two fails with the message above.
I even tried it on a fresh new AWS account, using these steps:
in https://us-east-2.console.aws.amazon.com/ecs/ created new cluster (networking only)
in task definitions, created a taskdef
with docker image alpine:latest, command ping -c 4 google.com
then I select the cluster, switch to "tasks" tab, and enter the run dialog
with one of pre-created subnets
After executing:
the task appears in the cluster's tasks list in PENDING state
it takes couple of minutes
eventually (using refresh button), it changes to the mentioned message - STOPPED (CannotStartContainerError: Error response from dae)
My guess is that the reason is:
either the task cannot download the image
or the instance cannot reach outside net
What can I be doing wrong? How to fix?
In my case too the log group was the problem. The one I had configured wasnt working. Hence I enabled the "Auto-configure CloudWatch Logs" option in the "Log Configuration" of the container settings.
Also if you open the stopped task, navigate to the container section, expand it, under the Details section you can see a detailed error message. Screenshot below
It could be a problem with the entry point as pointed in the comments of the question (in the task definition) Entrypoint: ["sh","-c"]
It could also be a bad reference, for example a wrong log group in the LogConfiguration or something similar.
I just create de group log in my cloudwatch console because it have not created, and now everything is going well.

How to debug failed fargate task initialization

I have a fargate task which I have scheduled to run with CloudWatch Event rules, and output a timestamp to a database on a successful run. It also outputs a logfile to CloudWatch for every time it runs.
However, there was 1 time where the log file was not created, and the database not updated. I suspect the task was never even started, or had failed to start.
In CloudWatch, the event rule shows trigger and invocation at the time I expected the task to run, so I assume the task at least attempted to start.
My question is: is there any way I can debug or log information about the cluster failing to start a task?
Please let me know if I need to provide more information.
Edit: I should specify I'm looking for a way to read this information in a log file somewhere. I know I can see failed task reason in the web console, but that's only for relatively recent tasks.
I have posted the same question here: https://www.reddit.com/r/aws/comments/adtqvt/debugging_failed_fargate_task_initialization/ and StackOverflow: https://forums.aws.amazon.com/thread.jspa?messageID=884638&#884638
Go to the cluster and choose the Tasks tab
In the lower pane, choose Stopped for the Desired Task Status value
Locate the desired Task and click it's GUID
Scroll down to the Containers section and expand the relevant containers that are experiencing errors
You'll see some kind of Status reason for the error. In my case it was:
CannotStartContainerError: API error (500): failed to initialize logging driver: Cannot determine region for awslogs driver
Edit: I can't really take credit for figuring this out - found it here:
https://github.com/aws/amazon-ecs-agent/issues/1654#issuecomment-437178282
Try going to "CloudWatch -> Logs -> Insights" and click on "Run Query":
I just faced this problem and the lack of logs did make it quite difficult to resolve.
The problem in my case was the security group used for the task had been deleted. Hope this helps if any one has a similar issue.

Concurrent workflow not starting from PMCMD Command

I have a requirement to start workflow concurrently with multiple instances, all instances need to run in parallel. When I run an instance it is running and related param file is being picked up. But when I start another instance to run in parallel with previous instance, it is giving below Error.
"Start Workflow Advanced: ERROR: Workflow [wf_name]: Could not start execution of this workflow because the current run on this Integration Service has not completed yet."
I tried doing this using PMCMDcommand like below. It's starting without any param file and without instance name. But PMCMD log is showing the the workflow is started for the given instance successfully.
pmcmd startworkflow -sv 'INT_......' -d 'DOM_......' -u 'venkat' -p MyPass.... -f 'MyFold...' -nowait -rin $inst_name $wf_name
This is working fine in our test environment. But not working in QA. Is there a configuration setting to avoid this behavior.
Please make sure the workflow is properly configured to allow multiple executions: the Configure Concurrent Execution has to be enabled and Allow concurrent run... needs to be correctly set. If you run with same instance name, the Allow concurent run with same instance name must be chosen. Otherwise, choose the Allow concurent run only with unique instance name, add the instance name and desired parameter file to the list below.
In your command I don't see the parameterfile, so I assume the latter should be the proper setup.
The issue is resolved by restarting the integration service. We did not restart integration service to fix this issue. But that resolved this issue. When we contacted informatica support for resolution, below KB link is provided by them. https://kb.informatica.com/solution/23/Pages/59/501120.aspx
Please find the thread I have opened in Informatica network.
https://network.informatica.com/thread/83540

GoCD Custom Command

I am trying to run a very simple custom command "echo helloworld" in GoCD as per the Getting Started Guide Part 2 however, the job does not finish with the Console saying Waiting for console logs and raw output saying Console log for this job is unavailable as it may have been purged by Go or deleted externally.
My job looks like the following which was taken from typing "echo" in the Lookup Command (which is different to the Getting Started example which I tried first with the same result)
Judging from the screenshot, the problem seems to be that no agent is assigned to the task. For an agent to be assigned, it must satisfy all of these conditions:
An agent must be running, and connected to the server
The agent must be enabled on the "Agents" page
If you use environments, the job and the agent need to be in the same environment
The agent needs to have all of the resources assigned that are configured in the job
Found the issue.
The Pipelines have to be in the same Environment to work.