How can I trigger an alert based on log output? - google-cloud-platform

I am using GCP and want to create an alert after not seeing a certain pattern in the output logs of a process.
As an example, my CLI process will output "YYYY-MM-DD HH:MM:SS Successfully checked X" every second.
I want to know when this fails (indicated by no log output). I am collecting logs using the normal GCP log collector.
Can this be done?
I am creating the alerts via the UI at:
https://console.cloud.google.com/monitoring/alerting/policies/create

You can create an alert based on log metric. For that, create a log based metric in Cloud Logging with the log filter that you want.
Then create an alert, aggregate per minute the metrics and set an alert when the value is below 60.
You won't have an alert for each missing message but based on a minute, you will have an alert when the expected value isn't reached.

Related

Trigger alert when expected associated event is not present

I have two events that should appear in the logs to represent the completion of a transaction. I need to set up an alert when a transaction starts but no completion log entry is found.
What I've done so far is created two user-defined log metrics:
order_form_starts denotes the submission of an order.
order_form_sents denotes the successful transmission of items of an order.
This entry should be present 1 or more times per ..._start.
I'm able to query the the events individually but not sure how I can express an alert that should be triggered for the following "BAD" scenarios:
Some examples of possibilities:
GOOD: start(order_id=1), end(order_id=1)
GOOD: start(order_id=1), end(order_id=1), end(order_id=1)
BAD: start(order_id=1)
BAD: start(order_id=1), end(order_id=2)
Start Event:
fetch global::logging.googleapis.com/user/order_form_starts
| group_by [resource.project_id], row_count()
End Event:
fetch global::logging.googleapis.com/user/order_form_sents
| group_by [resource.project_id], row_count()
I don't think that you can express it that simply. If I had to implement this, I would create a cloud function that is querying (recurring) the logs for the events and write the result as a metric (gauge) into cloud monitoring.
From there you can define a threshold for that metric and create an alert.
You also might want to create traces with Google Cloud Trace for your transactions where you can keep track of the transactions.

"Only numeric data" error on logs-base metric GCP

I have use Ops agent to send log to Cloud Logging
Uploaded logs
And then I used these logs to create logs-base metric with field name is jsonPayload.data
Create logs-base metric
After that, I review logs of that metric to make sure input data is correct
Review input data
But finally, the result is Cloud metric show error Only numeric data can be drawn as a line chart. I have checked at "review logs" step and make sure that input data is numeric. Can anyone help me explain that?
Error
Sorry, I'm new to stackoverflow, so I can't upload image directly.
You can see the Metric by changing the aligner to percentile.

Is it possible to set up CloudWatch Alarm for 3 or 4 mins period?

I need to receive a notification each time a certain message does not appear in logs for 3-4 minutes. It is a clear sign that the system is not working properly.
But it is only possible to choose 1 min or 5 mins. Is there any workaround?
"does not appear in logs for 3-4 minutes. It is a clear sign that the system is not working properly."
-- I know what you mean, CloudWatch Alarm on a metric which is not continuously pushed might behave a bit differently.
You should consider using Alarm's M out of N option with 3 out 4 option.
https://aws.amazon.com/about-aws/whats-new/2017/12/amazon-cloudwatch-alarms-now-alerts-you-when-any-m-out-of-n-metric-datapoints-in-an-interval-are-above-your-threshold/
Also, if the metric you are referring to was created using a metric filter on a CloudWatch Log Group, you should edit the metric to include a default value so that each time a log is pushed and the metric filter expression does not match it still pushes a default value (of say 0) thus making metric have more continuous datapoint.
If you describe an cloudwatch alarm using AWS Cli it is possible to input the period in seconds.Only the web interface limits the period to set of values.
https://docs.aws.amazon.com/cli/latest/reference/cloudwatch/describe-alarms.html

Cloud watch logs prepending timestamp to each line

We have cloud watch log agent setup and the logs streamed are appending a timestamp to beginning of each line which we could see after export.
2017-05-23T04:36:02.473Z "message"
Is there any configuration on cloud watch log agent setup that helps not appending this timestamp to each log entry?
Is there a way to export cloud watch logs only the messages of log events? We dont want the timestamp on our exported logs.
Thanks
Assume that you are able to retrieve those logs using your Lambda function (Python 3.x).
Then you can use Regular Expression to identify the timestamp and write a function to strip it from the event log.
^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{3}Z\t
The above will identify the following timestamp: 2019-10-10T22:11:00.123Z
Here is a simple Python function:
def strip(eventLog):
timestamp = "r'^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}.\d{3}Z\t'"
result = re.sub(timestamp, "", eventLog)
return result
I don't think it's possible, I needed the same exact behavior you are asking for and looks like it's not possible unless you implement a man in the middle processor to remove the timestamp from every log message as suggested in the other answer
Checking the CloudWatch Logs Client API in the first place, it's required to send the timestamp with every log message you send to CloudWatch Logs (API reference)
And the export logs to S3 task API also has no parameters to control this behavior (API reference)

Filter AWS Cloudwatch Lambda's Log

I have a Lambda function and its logs in Cloudwatch (Log group and Log Stream). Is it possible to filter (in Cloudwatch Management Console) all logs that contain "error"? For example logs containing "Process exited before completing request".
In Log Groups there is a button "Search Events". You must click on it first.
Then it "changes" to "Filter Streams":
Now you should just type your filter and select the beginning date-time.
So this is kind of a side issue, but it was relevant for us. (I posted this to another answer on StackOverflow but thought it would be relevant to this conversation too)
We've noticed that tailing and searching logs gets really slow after a log group has a lot of Log Streams in it, like when an AWS Lambda Function has had a lot of invocations. This is because "tail" type utilities and searching need to connect to each log stream to run. Log Events get expired and deleted due to the policy you set on the Log Group itself, but the Log Streams never get cleaned up. I made a few little utility scripts to help with that:
https://github.com/four43/aws-cloudwatch-log-clean
Hopefully that save you some agony over waiting for those logs to get searched.
You can also use CloudWatch Insights (https://aws.amazon.com/about-aws/whats-new/2018/11/announcing-amazon-cloudwatch-logs-insights-fast-interactive-log-analytics/) which is an AWS extension to CloudWatch logs that gives a pretty powerful query and analytics tool. However it can be slow. Some of my queries take up to a minute. Okay, if you really need that data.
You could also use a tool I created called SenseLogs. It downloads CloudWatch data to your browser where you can do queries like you ask about. You can use either full text and search for "error" or if your log data is structured (JSON), you can use a Javascript like expression language to filter by field, eg:
error == 'critical'
Posting an update as CloudWatch has changed since 2016:
In the Log Groups there is a Search all button for a full-text search
Then just type your search: