How to run docker task with Amazon ECS - getting error `STOPPED (CannotStartContainerError: Error response from dae)` - amazon-web-services

My goal is to execute a benchmark deployed as a docker image. While doing so, I had too many issues, so I decided to first make something extremely trivial work.
So I decided to follow the guide in https://docs.aws.amazon.com/AmazonECS/latest/developerguide/create-task-definition.html
and use the "ping" example - it should just ping a domain couple of times, and stop.
The problem is, I always receive this message in the task status:
STOPPED (CannotStartContainerError: Error response from dae)
I tried it with various subnets and security groups, but the result is always the same - the task starts, and after a minute or two fails with the message above.
I even tried it on a fresh new AWS account, using these steps:
in https://us-east-2.console.aws.amazon.com/ecs/ created new cluster (networking only)
in task definitions, created a taskdef
with docker image alpine:latest, command ping -c 4 google.com
then I select the cluster, switch to "tasks" tab, and enter the run dialog
with one of pre-created subnets
After executing:
the task appears in the cluster's tasks list in PENDING state
it takes couple of minutes
eventually (using refresh button), it changes to the mentioned message - STOPPED (CannotStartContainerError: Error response from dae)
My guess is that the reason is:
either the task cannot download the image
or the instance cannot reach outside net
What can I be doing wrong? How to fix?

In my case too the log group was the problem. The one I had configured wasnt working. Hence I enabled the "Auto-configure CloudWatch Logs" option in the "Log Configuration" of the container settings.
Also if you open the stopped task, navigate to the container section, expand it, under the Details section you can see a detailed error message. Screenshot below

It could be a problem with the entry point as pointed in the comments of the question (in the task definition) Entrypoint: ["sh","-c"]
It could also be a bad reference, for example a wrong log group in the LogConfiguration or something similar.

I just create de group log in my cloudwatch console because it have not created, and now everything is going well.

Related

how to view details of a particular pod that has been exited using kubectl

I wish to view the logs of a particular pod (where I know the specific pod name from another logging application my company uses) using kubectl to determine the reason why it has continually been exiting with exitCode 143. However, when I run kubectl get pods, I am unable to see the specific pod I am looking for and only the pods that are running normally are listed. Would anyone know how I can get the details (and thus view the logs) for a specific pod name, even when it's no longer running?
EDIT: I have run kubectl logs <podname> but I cannot seem to find anything related to sigterm/exitCode 143 in the log output - is there another command I should be using?
Try using this command
kubectl logs <podname> --previous
This will show you the logs of the last run of the pod before it crashed. It is a handy feature in case you want to figure out why the pod crashed in the first place
Within Kubernetes Explorer, the easiest way to get back to logs from former/previous pods may be to use the events tab. There you can see which pods shutdown with the timestamp along with a brief reason and message. Find the previous pod of interest, select it, then in the detail pane there is an option to view logs.
If u want to see details of deleted pod:
Get a list of recently deleted pod names - up to 1 hour in the past unless you changed the ttl for kubernetes events - by running:
kubectl get event -o custom-columns=NAME:.metadata.name | cut -d "." -f1
You can then investigate further issues within your logging pipeline if you have one in place.
For exit code 143 refer to this doc.
As far as I know, you cannot get the logs of terminated pods.

No CloudWatch logs for ECS task with reason "Essential container in task exited"

A task is running for a few seconds before terminating, I don't know why, and it's not pushing any logs.
I'm using the "awslogs" driver and the log group exists in CloudWatch.
The "Logs" tab is empty. The log-stream is created in CW but it's devoid of actual log events. There are also no results under Insights for that stream.
The task role has the permissions mentioned at https://docs.aws.amazon.com/AmazonECS/latest/developerguide/using_cloudwatch_logs.html .
Any idea what the deal is with the logs?
The command wasn't valid nor was it comma-separated. It was terminating too early in the workflow to log anything, but yet after any other deployment issue would be identified. So, it was looking like it was successful but in reality wasn't yet even running. Interestingly, it would still take around a minute to terminate, so maybe this includes the overhead of pulling the image.
Timestamps indicate that task started and exited after some seconds. awslogs will send logs if container has been successfully started, so, in this case it may not be helping. You can follow step 6 of documentation to diagnose. Specifically, if you have a container that has stopped, expand the container and inspect the Status reason row to see what caused the task state to change. In most cases, that will lead you to actual cause

How to debug failed fargate task initialization

I have a fargate task which I have scheduled to run with CloudWatch Event rules, and output a timestamp to a database on a successful run. It also outputs a logfile to CloudWatch for every time it runs.
However, there was 1 time where the log file was not created, and the database not updated. I suspect the task was never even started, or had failed to start.
In CloudWatch, the event rule shows trigger and invocation at the time I expected the task to run, so I assume the task at least attempted to start.
My question is: is there any way I can debug or log information about the cluster failing to start a task?
Please let me know if I need to provide more information.
Edit: I should specify I'm looking for a way to read this information in a log file somewhere. I know I can see failed task reason in the web console, but that's only for relatively recent tasks.
I have posted the same question here: https://www.reddit.com/r/aws/comments/adtqvt/debugging_failed_fargate_task_initialization/ and StackOverflow: https://forums.aws.amazon.com/thread.jspa?messageID=884638&#884638
Go to the cluster and choose the Tasks tab
In the lower pane, choose Stopped for the Desired Task Status value
Locate the desired Task and click it's GUID
Scroll down to the Containers section and expand the relevant containers that are experiencing errors
You'll see some kind of Status reason for the error. In my case it was:
CannotStartContainerError: API error (500): failed to initialize logging driver: Cannot determine region for awslogs driver
Edit: I can't really take credit for figuring this out - found it here:
https://github.com/aws/amazon-ecs-agent/issues/1654#issuecomment-437178282
Try going to "CloudWatch -> Logs -> Insights" and click on "Run Query":
I just faced this problem and the lack of logs did make it quite difficult to resolve.
The problem in my case was the security group used for the task had been deleted. Hope this helps if any one has a similar issue.

GoCD Custom Command

I am trying to run a very simple custom command "echo helloworld" in GoCD as per the Getting Started Guide Part 2 however, the job does not finish with the Console saying Waiting for console logs and raw output saying Console log for this job is unavailable as it may have been purged by Go or deleted externally.
My job looks like the following which was taken from typing "echo" in the Lookup Command (which is different to the Getting Started example which I tried first with the same result)
Judging from the screenshot, the problem seems to be that no agent is assigned to the task. For an agent to be assigned, it must satisfy all of these conditions:
An agent must be running, and connected to the server
The agent must be enabled on the "Agents" page
If you use environments, the job and the agent need to be in the same environment
The agent needs to have all of the resources assigned that are configured in the job
Found the issue.
The Pipelines have to be in the same Environment to work.

Hello World PipeLine with ShelCommandlActivity

I'm trying to create a simple dataFlow pipeline with a single Activity of ShellCommandActivity type. I've attached the configuration of the activity and ec2 resource.
When I execute this the Ec2Resource sits in the WAITING_ON_DEPENDENCIES state then after sometime changes to TIMEDOUT. The ShellCommandActivity is always in the CANCELED state. I see the instance launch and very quicky changes to the terminated stated.
I've specified a s3 log file url, but that never gets updated.
Can anyone give me any pointers? Also is there any guidance out there on debugging this?
Thanks!!
You are currently forcing your instance to shut down after 1 minute which gives the TIMEOUT status if it can't execute in that time. Try increasing it to 50 minutes.
Also make sure you are using an AMI that runs Amazon Linux and that you are using full absolute paths in your scripts.
S3 log files are written as:
s3://bucket/folder/