GCP Stackdriver: Create Alert in absence of specific text in logs

GCP Stackdriver: Create Alert in absence of specific text in logs - google-cloud-platform

Is there a way in Stackdriver to create an alert based on the absence of a specific line in the logs for a specific timeframe (say 1 hour) ?
I am trying to have a way to monitor (and be notified) whether a GKE CronJob did not run in the last hour. (was not able to come up with any other way of achieving this)

You can create a log based metric in regards to a specific log entry following the steps here. Once that is created, you can create an alert based off of that log based metric following the instructions here.
You could configure the Alert to trigger when its below a certain threshold for a certain amount of time; however, you cannot define a certain time frame for the alert policy to run. The alert policy will run until it is deleted.

Related

creating an alert when no data was uploaded to BigQuery table in GCP

I have a requirement to send an email notification whenever there is no data getting inserted into my BigQuery table. For this, I am using the Logging and Alerting mechanism But still, I am not able to receive any email. Here are the steps I followed:
I had written a Query in Logs explorer as below:
Now I had created a metric for those logs with Metric type COUNTER and in the filter section obviously I have given the above query.
Now I created a policy in ALERTING under the MONITORING domain. And here is the screenshot attached. The alerting policy which I had selected is for the logging metrics which I had created before.
And then a trigger as below:
And in the Notification channel, added my Email ID.
Can someone please help me if I am missing something? My requirement is to receive an alert when there is no data inserted into a Bigquery table for more than a day.
And also, I could see in Metrics Explorer, the metric which I created is not ACTIVE. Why so?

As mentioned in GCP docs:
Metric absence conditions require at least one successful measurement — one that retrieves data — within the maximum duration window after the policy was installed or modified.
For example, suppose you set the duration window in a metric-absence policy to 30 minutes. The condition isn't met if the subsystem that writes metric data has never written a data point. The subsystem needs to output at least one data point and then fail to output additional data points for 30 minutes.
Meaning, you will need at least 1 datapoint (insert job) to have an incident created for the metric to be missing.
There are two options:
Create an artificial log entry to get the metric started and have at least one time series and data point.
Run an insert job that would match the log-based metric that was created to get the metric started.
With regards to your last question, the metric you created is not active because there hasn't been any written data points to it within the previous 24 hours. As mentioned above, the metric must have at least 1 datapoint written to it.
Refer to custom metrics quota for more info.

Receiving alerts for GCP activities

Is it possible to create alerts for configuration activities?
On the dashboard of my GCP project, I'm able to see the history of activities. However, for security reasons, I would like to be able to receive notifications when certain activities happen, e.g. Set IAM policy on project, deleting instance of project, etc. Is this possible?
I have looked into "metric-based alerting policies", but I'm only able to create alerts for uptime checks. Not sure what else to look for.

You are on the right path. You need to create a log-based metric and then to create an alert when the counter cross a threshold (1 for example)

Now a more straightforward solution is available: In one step, you can use log-based alerts. It allows to set alerts on any log type and content. This new feature is on preview and was announced few days ago.

Compute Engine VM Creation Notification

I wanted to get notified if/when there is/are any VM creation in my infra on GCP.
I see a google library that can give me list of VM.
I can create a function to use this code (probably)
Schedule the above function. And check for difference.
But do storage like triggers available for Compute.
Also if there is any other solution.

You have a third solution. You can use Cloud Run instead of Cloud Functions (the migration is very easy, let me know if you have issues).
With Cloud Run, you can use the trigger (eventArc feature), a new feature (still in preview) based on the auditLog logs. It's very similar to the first solution proposed by LundinCast, but it's automatically set up by Cloud Run Trigger feature.
So, deploy your service on Cloud Run. Then configure a trigger on v1.compute.instancs.insert API, select your region or make the trigger global and that's all!! Your service will be triggered when a new instance will be created.
As you can see in my screenshot, you will be asked to activate the auditLog to be able to use this feature. Because it's built-in, it's done automatically for you!

Using Logging sink and a PubSub-triggered Cloud Function
First, export the relevant logs to a PubSub topic of your choice by creating a Logging sink. Include the logs created automatically during VM creation with the following log filter:
resource.type="gce_instance"
protoPayload.methodName="beta.compute.instances.insert"
protoPayload.methodName="compute.instances.insert"
Next, create a Cloud Function that'll trigger every time a new log is set to the PubSub topic. You can process this new message as per your needs.
Note that with this option you'll have to handle to notification yourself (for example, by sending an email). It is useful though if you want to send different notification based on some condition or if you want to perform additional actions apart from the notification.
Using a log-based metric and a Cloud Monitoring alert
You can use a Log-based metric filtering logs for Compute Engine VM creation and set an alert on that metric to get notified.
First create a counter log-based metric with a log filter similar to the one in the previous method, which will report a data point to Cloud monitoring every time a new VM instance is created.
Then go to Cloud Monitoring and create an alert based on that metric that trigger every time a metric is reported.
This option is the easiest to set up and supports various notification channels out-of-the-box.

Going along with LudninCast's answer.
Cloud Run --
Would have used it if it had not been zone issue for me. Though I conclude this from POC I did
Easy setup.
Containerised Apps. Probably more code to maintain.
Public URL for app.
Out of box support for the requirements like mine.
Cloud Function --
Sink setups for triggers can be time consuming for first timer
Easy coding and maintainance.

CloudWatch Cost - Data Processing

I'd like to know if possible to discover which resource is behind this cost in my Cost Explorer, grouping by usage type I can see it is Data Processing bytes, but I don't know which resource would be consuming this amount of data.
Have some any idea how to discover it on CloudWatch?

This is almost certainly because something is writing more data to CloudWatch than previous months.
As stated this AWS Support page about unexpected CloudWatch logs bill increases:
Sudden increases in CloudWatch Logs bills are often caused by an
increase in ingested or storage data in a particular log group. Check
data usage using CloudWatch Logs Metrics and review your Amazon Web
Services (AWS) bill to identify the log group responsible for bill
increases.
Your screenshot identifies the large usage type as APS2-DataProcessing-Bytes. I believe that the APS2 part is telling you it's about the ap-southeast-2 region, so start by looking in that region when following the instructions below.
Here's a brief summary of the steps you need to take to find out which log groups are ingesting the most data:
How to check how much data you're ingesting
The IncomingBytes metric shows you how much data is being ingested in your CloudWatch log groups in near-real time. This metric can help you to determine:
Which log group is the highest contributor towards your bill
Whether there's been a spike in the incoming data to your log groups or a gradual increase due to new applications
How much data was pushed in a particular period
To query a small set of log groups:
Open the Amazon CloudWatch console.
In the navigation pane, choose Metrics.
For each of your log groups, select the IncomingBytes metric, and then choose the Graphed metrics tab.
For Statistic, choose Sum.
For Period, choose 30 Days.
Choose the Graph options tab and choose Number.
At the top right of the graph, choose custom, and then choose Absolute. Select a start and end date that corresponds with the last 30 days.
For more details, and for instructions on how to query hundreds of log groups, read the full AWS support article linked above.

Apart from the steps which Gabe mentioned what helped me identify the resource which was creating large number of logs was by:
heading over to Cloudwatch
selecting the region which showed in Cost explorer
Selecting Log Groups
From settings under Log Groups, Enabling column Stored bytes to be visible
This showed me which service was causing a lot of logs to be written to Cloudwatch.

how to create alert per error in stackdriver

Having created log-based metrics in cloud console, I then want to create alerts so that every time there is a new matching log entry, the alert triggers.
In trying to create a suitable metric, the most likely looking options seem to be threshold or rate of change, but I don't think either will work for a policy of 1 log message => 1 alert.
Help appreciated.

Yes, today the only way to alert on the log message is to create a threshold condition on the log metric with a very small threshold (0.001) and a duration of 1 minute.
Thanks for using Stackdriver.

You can use another alert triggering software (like PagerDuty) which is pinged by emails sent by Stackdriver. PagerDuty is able to filter all those emails that has the word RESOLVE in its subject. They can be absolutely thrown away in our case if you'd like to avoid autoresolving. Of course Stackdriver and PagerDuty alerts will diverge from each other (states will be inconsistent) but you should consider PD as single source of truth this time. It could be a possible workaround.

With log-based alerts, we can create alerts from the logs, incident will be creatied for each matching entry.
https://cloud.google.com/blog/products/operations/create-logs-alerts-preview

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js