StoppedReason in ECS Fargate is truncated - amazon-web-services

In ECS Fargate, when a task fails, there is a "Stopped Reason" field which gives some useful logging. However I have noticed that it gets truncated after 255 symbols (screenshot below).
I checked the network tab and tracked the JSON of the http response, and it is truncated even there (so server-side). Is there any way to get the complete message?
I find this thread where they discuss the same problem.
How can I see the whole, untruncated error message?

I found the whole error message in CloudTrail eventually. I searched by "Username", and entered the Task GUID as username. This narrowed down the amount of events I had to sift through. The full error message was in a "GetParameters" event.
Just FYI for anyone who reads this answer the task GUID is the ID at the end of the taskArn or if you go to Task in the console it will be the ID that you see in Task : fc0398a94d8b4d4d9218a6d870914a80 –
Emmanuel N K
Jun 21 at 13:21

Related

Is there a way to see the actual error message in AWS X-Ray instead of State?

I have enabled X-Ray for my StepFunction state machine, and in X-Ray trace map, I can see in the Subsegment section, I can locate which step in the statefunction has caught an error, but it only says States.TaskFailed but with no actual error message: (screenshot shown below)
However, if I navigate to stepfunction execution event history, I can locate in 'TashStateExited', and I see something like:
"name": "xxxxxxx",
"output": {
"Error": "States.TaskFailed",
"Cause": xxxxxxxxxxx (the actual error message)
I wonder if there is a way that I can see this error message directly in X-Ray without navigating to the specific execution event history? Since X-Ray is supposed to make monitoring easier and help us debug, how come it's not showing the actual error message in trace map?
I've only been able to do this manually by running my code in a try/except block which traps the error and then using a subsegment.addError() call to add the exception information to the trace segment before re-throwing the exception. I'm not sure of a way to get X-Ray to do that automatically... here's a thread on the AWS forums that provides a bit of background: https://forums.aws.amazon.com/thread.jspa?threadID=282800&tstart=0
Step Function sends all the collected Errors and Causes to X-Ray. But they may not appear in the subsegment for the state but in the one for task.
1- In X-Ray, check "Raw data" tab of the trace. Does the error appear in the JSON there?
2- In the Timeline tab you should be able to see the error under the task:
This is the state subsegment:
And this is the task subsegment:
If you can't still find the error in X-Ray, please post the state machine definition and the Raw trace JSON.

AWS Lambda Stops Running Randomly

Has anyone ever seen a Lambda function stop randomly?
I've got a method that subscribes to an SNS topic that is published to every hour. I recently had 4 messages come through to the subscriber Lambda and 3 of the four worked perfectly.
CloudWatch gives me all of the console logs I have logged, I get responses from all of the APIs the method reaches out to, end with a success message, but the fourth message logs the console log to CloudWatch and then I get the "Request End" log immediately following. None of the following: console.logs, no error from lambda, no insight as to why it would have stopped, no timeout error, it just stopped.
I've never seen this kind of behavior before and have yet to have a Lambda function stop working without logging out the error (everything that runs is written in a try/catch that has reliably logged the errors until now).
Has anyone ever come across this and have any insight as to what it may be?

How to debug failed fargate task initialization

I have a fargate task which I have scheduled to run with CloudWatch Event rules, and output a timestamp to a database on a successful run. It also outputs a logfile to CloudWatch for every time it runs.
However, there was 1 time where the log file was not created, and the database not updated. I suspect the task was never even started, or had failed to start.
In CloudWatch, the event rule shows trigger and invocation at the time I expected the task to run, so I assume the task at least attempted to start.
My question is: is there any way I can debug or log information about the cluster failing to start a task?
Please let me know if I need to provide more information.
Edit: I should specify I'm looking for a way to read this information in a log file somewhere. I know I can see failed task reason in the web console, but that's only for relatively recent tasks.
I have posted the same question here: https://www.reddit.com/r/aws/comments/adtqvt/debugging_failed_fargate_task_initialization/ and StackOverflow: https://forums.aws.amazon.com/thread.jspa?messageID=884638&#884638
Go to the cluster and choose the Tasks tab
In the lower pane, choose Stopped for the Desired Task Status value
Locate the desired Task and click it's GUID
Scroll down to the Containers section and expand the relevant containers that are experiencing errors
You'll see some kind of Status reason for the error. In my case it was:
CannotStartContainerError: API error (500): failed to initialize logging driver: Cannot determine region for awslogs driver
Edit: I can't really take credit for figuring this out - found it here:
https://github.com/aws/amazon-ecs-agent/issues/1654#issuecomment-437178282
Try going to "CloudWatch -> Logs -> Insights" and click on "Run Query":
I just faced this problem and the lack of logs did make it quite difficult to resolve.
The problem in my case was the security group used for the task had been deleted. Hope this helps if any one has a similar issue.

Resending Notification on error in Error Reporting

This is regarding re-sending of notifications on error of same kind.
In my current project, my errors are being grouped.
Like for eg: If it is an sql error for first time, I receive a notification but when it occurs after 2 or 3 hours it is grouped under same log and 'no notification is sent'.
On what basis does error reporting group the erorrs ?
I tried to randomise the error message in order to distinguish messages but still they are being grouped under the same category. (For eg: messages be like - service unavailable - 12, service unavailable - 23 etc.. )
I want to receive notification for each and every error irrespective of its type or repitition.
Suggest a solution ?
What you're describing is alerting based on a log based metric: https://cloud.google.com/logging/docs/logs-based-metrics/charts-and-alerts#creating_a_simple_alerting_policy_on_a_counter_metric

Cloudwatch logs - No Event Data after time elapses

I've looked on the AWS forums and elsewhere but haven't found a solution. I have a lambda function that, when invoked, creates a log stream which populates with log events. After about 12 hours or so, the log stream is still present, but when I open it, I see the following:
The link explains how to start sending event data, but I already have this set up, and I am sending event data, it just disappears after a certain time period.
I'm guessing there is some setting somewhere (either for max storage allowed or for whether logs get purged), but if there is, I haven't found it.
Another reason for missing data in the log stream might be a corrupted agent-state file. First check your logs
vim /var/log/awslogs.log
If you find something like "Caught exception: An error occurred (InvalidSequenceTokenException) when calling the PutLogEvents operation: The given sequenceToken is invalid. The next expected sequenceToken is:" you can regenerate the agent-state file as follows:
sudo rm /var/lib/awslogs/agent-state
sudo service awslogs stop
sudo service awslogs start
TL;DR: Just use the CLI. See Update 2 below.
This is really bizarre but I can replicate it...
I un-checked the "Expire Events After" box, and lo and behold I was able to open older log streams. What seems REALLY odd is that if I choose to display the "Stored Bytes" data, many of the files are listed at 0 bytes even though they have log events:
Update 1:
This solution no longer works as I can only view the log events in the first two log streams. What's more is that the Stored Bytes column displays different (and more accurate) data:
This leads me to believe that AWS made some kind of update.
UPDATE 2:
Just use the CLI. I've verified that I can retrieve log events from the CLI that I cannot retrieve via the web console.
First install the CLI (if you haven't already) and use the following command:
aws logs get-log-events --log-group-name NAME-OF-LOGGROUP --log-stream-name LOG-STREAM-NAME // be sure to escape special characters such as /, [, $ etc