I have created ECS tasks but due to somereason its failing and for the alert service I have used SNS integerated with my slack channel. When a container fails to start ...it gives alert and it has the property that it keeps on restarting everytime until it is stopped manually, and everytime it restarts ..it gives alerts ...I want t reduce these alerts ,,is there someway I can do that. like putting code in the eventrule or adding conditional loops in the same. please help...below are the codes used for two cases:
The task and container are being replaced, not restarted. I don't think there's any way to reduce the number of alerts that will be sent out, since it's a new task each time. At least not with EventBridge/SNS directly.
You could look at creating a CloudWatch Alert that monitors the number of running tasks, and have it send an alert to your SNS topic when the count is 0 (or below some threshold). A CloudWatch Alert has settings like evaluation period that you can adjust to prevent too many alerts from occurring, and it also would only send an alert when the count changed, not on every ECS task deployment attempt.
Related
I've been trying to create an AWS CloudWatch Custom Dashboard to indicate the number of Times a Task running in a ECS Service has STOPPED (Fargate Launch Type) due to containers crashing (could be due to a container related issue or application level issue).
I tried to retrieve the expected metric by writing a CloudWatch Logs Insights Query to query the /aws/ecs/containerinsights/{my-cluster-name}/performance Log Group. The query I used for this is written as below,
fields TaskId, KnownStatus, LaunchType, #timestamp
| filter ServiceName = "my-first-service" and KnownStatus = "STOPPED"
| limit 20
However, this doesn't output the STOPPED instances. Though, it outputs the RUNNING and Pending tasks when filtered by equalling KnownStatus to RUNNING or PENDING.
So my question is, doesn't ECS log STOPPED task logs to CloudWatch?
If it does, appreciate it very much if anyone could provide an example,
to indicate the number of Times a Task had STOPPED?
to indicate the number of times each Container within a Task had STOPPED or crashed?
I also read in the documentation that Amazon EventBridge can be used to detect the STOPPED task events triggered by ECS, which in turn can write its logs back to CloudWatch. Hence, please feel free to advise on what would be the best practice to achieve my requirement.
Thank you.
I have an AWS Step Function with many state transitions that can run for a half hour or more.
There are only a few states, and the application loops through them until it runs out of items to process.
I have a run that failed after about half an hour. I can look at the logging under the "Execution event history". However, since this logs every transition and state, there are thousands of events. I cannot page down to show enough events (clicking the "Load More" button) without hanging my browser window.
There is no way to sort or filter this list that I can see.
How can I find the cause of the failure? Is there a way to export the Execution event history somewhere? Or send it to CloudWatch?
You can use the AWS CLI command aws stepfunctions get-execution-history with the --reverse-order flag in order to get the logs from the most recent (where the errors will be) first.
How do you process your steps? Docker containers on ECS or Fargate? Give us some details on that.
Your tasks should be sending out logs to CloudWatch as they execute.
You can also look at the Docker logs themselves on the physical machine if your run docker on a machine you can SSH to.
I have an instance in AWS that from time to time it's CPU cross the threshold of 90%.
I have created an alert for this, however I saw that I received one notification only and it was during the first 5 minutes while the CPU was at 100% for 2 hours.
How do I set the metric so I will keep getting notifications all the time?
Cloudwatch does not send notifications continuously if the threshold is breached. Cloudwatch can send a Notification only when the state changes.
Alarms invoke actions for sustained state changes only. CloudWatch alarms do not invoke actions simply because they are in a particular state, the state must have changed and been maintained for a specified number of periods.
Ref: AWS Cloudwatch Documentation
One possible solution that I can think of is to create a Multiple Cloudwatch Alarms with Multiple thresholds.
As the above answer already says it is not triggered again, one thing you can do is changing the alarm conditions to a very large value and then the orginal value and the state change will occur again.
I have two elastic beanstalk environments.
One is the 'primary' web server environment and the other is a worker environment that handles cron jobs.
I have 12 cron jobs, setup via a cron.yaml file that all point at API endpoints on the primary web server.
Previously my cron jobs were all running on the web server environment but of course this created duplicate cron jobs when this scaled up.
My new implementation works nicely but where my cron jobs fail to run as expected the cron job repeats, generally within a minute or so.
I would rather avoid this behaviour and just attempt to run the cron job again at the next scheduled interval.
Is there a way to configure the worker environment/SQS so that failed jobs do not repeat?
Simply configure a CloudWatch event to take over your cron, and have it create an SQS message ( either directly or via a Lambda function ).
Your workers will now just have to handle SQS jobs and if needed, you will be able to scale the workers as well.
http://docs.aws.amazon.com/AmazonCloudWatch/latest/events/ScheduledEvents.html
Yes, you can set the Max retries parameter in the Elastic Beanstalk environment and the Maximum Receives parameter in the SQS queue to 1. This will ensure that the message is executed once, and if it fails, it will get sent to the dead letter queue.
With this approach, your instance may turn yellow if there are any failed jobs, because the messages would end up in the dead letter queue, which you can simple observe and ignore, but it may be annoying if you are OCD about needing all environments to be green. You can set the Message Retention Period parameter for the dead letter queue to something short so that it will go away sooner though.
An alternative approach, if you're interested, is to return a status 200 OK in your code regardless of how the job ran. This will ensure that the SQS daemon deletes the message in the queue, so that it won't get picked up again.
Of course, the downside is that you would have to modify your code, but I can see how this would make sense if you don't care about the result.
Here's a link to AWS documentation that explains all of the parameters.
Suppose I have an ec2 instance with service /etc/init/my_service.conf with contents
script
exec my_exec
end script
How can I monitor that ec2 instance such that if my_service stopped running I can act on it?
You can publish a custom metric to CloudWatch in the form of a "heart beat".
Have a small script running via cron on your server checking the
process list to see whether my_service is running and if it is, make
a put-metric-data call to CloudWatch.
The metric could be as simple as pushing the number "1" to your custom metric in CloudWatch.
Set up a CloudWatch alarm that triggers if the average for the metric falls below 1
Make the period of the alarm be >= the period that the cron runs e.g. cron runs every 5 minutes, make the alarm alarm if it sees the average is below 1 for two 5 minute periods.
Make sure you also handle the situation in which the metric is not published (e. g. cron fails to run or whole machine dies). you would want to setup an alert in case the metric is missing. (see here: AWS Cloudwatch Heartbeat Alarm)
Be aware that the custom metric will add an additional cost of 50c to your AWS bill (not a big deal for one metric - but the equation changes drastically if you want to push hundred/thousands of metrics - i.e. good to know it's not free as one would expect)
See here for how to publish a custom metric: http://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/publishingMetrics.html
I am not sure if CloudWatch is the right route for checking if the service is running - it would be easier with Nagios kind of solution.
Nevertheless, you may try the CloudWatch Custom metrics approach. You add Additional lines of code which publishes say an integer 1 to CloudWatch Custom Metrics every 5 mins. Your can then configure CloudWatch alarms to do a SNS Notification / Mail Notification for the conditions like Sample Count or sum deviating your anticipated value.
script
exec my_exec
publish cloudwatch custom metrics value
end script
More Info
Publish Custom Metrics - http://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/publishingMetrics.html