I have some AWS Lambda functions that get run about twenty thousand times a day. So I would like to enable logging/alert to monitor all the errors and exceptions.
The cloudwatch log is giving too much noise, and difficult to see the error.
Now I'm planning to write the log to AWS S3 Bucket, this will have an impact on the performance.
What's the best way you suggest to log and alert the errors?
An alternative would be to leave everything as it is (from application perspective) and check AmazonCloudWatch Logs Filter.
You use metric filters to search for and match terms, phrases, or
values in your log events. When a metric filter finds one of the
terms, phrases, or values in your log events, you can increment the
value of a CloudWatch metric.
If you defined your filter you can create a CloudWatch Alarm on the metric and get notified as soon as your defined threshold is reached :-)
Edit
I didnt check the link from #Renato Gama. Sorry. Just follow the instructions behind the link and your problem should be solved easily...
If you did not try this already I suggest that you try creating CloudWatch alerts based on custom metric filters; Have a look here; https://www.opsgenie.com/blog/2014/08/how-to-use-cloudwatch-to-generate-alerts-from-logs
(Of course you don't have to use OpsGenie service as suggested on the post I linked, you can implement anything that will help you debug the problems)
Related
According to the docs I can define log-based metrics, but I can't seem to find a way to do this. My application logs a message like this:
2022-03-29T10:20:30 [INFO] Some action took 0.23 seconds
What I'm trying to do is extract the 0.23 as a metric that I can put on dashboard and monitor. How can I go about doing this?
Edit
I mostly solved this at the application level (my hope was that no changes to the code would be needed) by using a logger from the client library (Python in my case), and adding my metric to the log statement via structured logging. From there, I created a user defined, log-based metric, which I can now monitor and set alarm on, etc. Only detail is that, as mentioned in the comments, this particular metric is a gauge metric type. I'm extracting it as a distribution, which mostly works out of the box, though, I lose a negligible amount of precision.
In our project total 10 Glue jobs are running daily. I would like to build a dashboard to show last 7 days jobs status it means either succeeded or failure. Tried to achieve it in CloudWatch with metrics, but not able do it. Please give an idea to build this dashboard.
Probably a little late for the original questioner, but maybe helpful for others.
We had a similar task in our project. We have many jobs and need to monitor success and failure. In our experience, the built-in metrics aren't really reliable, nor do they really answer the question of whether a job was successful or not.
But we found a good way for us by generating custom metrics in a generic way for all jobs. This also works for existing jobs afterwards without having to change the code.
I wrote an article about it: https://medium.com/#ettefette/metrics-for-aws-glue-jobs-as-you-know-them-from-lambda-functions-e5e1873c615c
We have set cloudwatch alerts based on these metrics and we use the metrics in our grafana dashboard to monitor the glue jobs.
I'm writing a concurrent tailing utility for watching multiple AWS CloudWatch log groups across many regions simultaneously, and in CloudWatch logs, there are log groups, which contain many log streams that are rotated occasionally. Thus, to tail a log group, one must find the latest log stream, read it in a loop, and occasionally check for a new log stream, and start reading that in a loop.
I can't seem to find any documentation on this, but is there a set of published conditions upon which I can conclude that a log stream has been "closed?" I'm assuming I'll need to have multiple tasks tailing multiple log streams in a group up until a certain cut-off point, but I don't know how to logically determine that a log stream has been completed and to abandon tailing it.
Does anyone know whether such published conditions exist?
I don't think you'll find that published anywhere.
If AWS had some mechanism to know that a log stream was "closed" or would no longer receive log entries, I believe their own console for a stream would make use of it somehow. As it stands, when you view even a very old stream in the console, it will show this message at the bottom:
I know it is not a direct answer to your question, but I believe that is strong indirect evidence that AWS can't tell when a log stream is "closed" either. Resuming auto retry on an old log stream generates traffic that would be needless, so if they had a way to know the stream was "closed" they would disable that option for such streams.
Documentation says
A log stream is a sequence of log events that share the same source.
Since each new "source" will create a new log stream, and since CloudWatch supports many different services and options, there won't be a single answer. It depends on too many factors. For example, with the Lambda service, each lambda container will be a new source, and AWS Lambda may create new containers based on many factors like lambda execution volume, physical work in its data center, outages, changes to lambda code, etc. And that is just for one potential stream source for log streams.
You've probably explored options, but these may give some insights into ways to achieve what you're looking to do:
The CLI has an option to tail that will include all log streams in a group: https://awscli.amazonaws.com/v2/documentation/api/latest/reference/logs/tail.html though if you're building your own utility the CLI won't likely be an option
Some options are discussed at how to view aws log real time (like tail -f) but there are no mentions of published conditions for when a stream is "closed"
When does AWS CloudWatch create new log streams? may yield some insights
AWS documentation is not descriptive enough for figuring out the significance of PendingTasks metrics.
refer : https://docs.aws.amazon.com/amazonswf/latest/developerguide/cw-metrics.html
I wanted to know if these metrics are worth alarming or monitoring ?
When you schedule a SWF workflow, it automatically creates a task list for you. Or you can select an already existing task list to place the worklfow in.
You can see the task lists on your SWF dashboard:
PendingTasks creates a metric for each task list from each workflow domain and displays how many tasks are pending after each minute.
Now, if this metric worth alarming, that can be decided by you depending on your use case. If the number of pending tasks is getting bigger, probably means something got stack or it takes longer than expected. It might worth alarming in that case.
I am trying to find centralized solution to move my application logging from database (RDS).
I was thinking to use CloudWatchLog but noticed that there is a limit for PutLogEvents requests:
The maximum rate of a PutLogEvents request is 5 requests per second
per log stream.
Even if I will break my logs into many streams (based on EC2, log type - error,info,warning,debug) the limit of 5 req. per second is still very restrictive for an active application.
The other solution is to somehow accumulate logs and send PutLogEvents with log records batch, but it means then I am forced to use database to accumulate that records.
So the questions is:
May be I'm wrong and limit of 5 req. per second is not so restrictive?
Is there any other solution that I should consider, for example DynamoDB?
PutLogEvents is designed to put several events by definition (as per it name: PutLogEvent"S") :) Cloudwatch logs agent is doing this on its own and you don't have to worry about this.
However please note: I don't recommend you to generate to much logs (e.g don't run debug mode in prodution), as cloudwatch logs can become pretty expensive as your volume of log is growing.
My advice would be to use a Logstash solution on an AWS instance.
In alternative, you can run logstash on another existing instance or container.
https://www.elastic.co/products/logstash
It is designed for this scope and it does it wonderfully.
Cloudwatch, is not designed mainly for your needs.
I hope this helps somehow.
If you are calling this API directly from your application: the short answer is that you need to batch you log events (it's 5 for PutLogEvents).
If you are writing the logs to disk and after that you are pushing them there is already an agent that knows how to push the logs (http://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/QuickStartEC2Instance.html)
Meta: I would suggest that you prototype this and ensure that it works for the log volume that you have. Also, keep in mind that, because of how the cloudwatch api works, only one application/user can push to a log stream at a time (see the token you have to pass in) - so that you probably need to use multiple stream, one per user / maybe per log type to ensure that your applicaitions are not competing for the log.
Meta Meta: think about how your application behaves if the logging subsystem fails and if you can live with the possibility of losing the logs (ie is it critical for you to always/always have the guarantee that you will get the logs?). this will probably drive what you do / what solution you ultimately pick.