I have several Alerts in GCP for specific causes/action.
Like for myFunction:
I get an alert (slack/mail) if it fails (msg: "failed!"). The alert works for the specific text-msg "failed!"
But how to create alert if my function not started during an hour (msg: "started!")?
Any suggestions?
Create an alerting policy with custom log based metric to look for msg: "started!" and in Configuration section, set the condition to: Is absent and select time of 1 hr
Related
I am running a Vertex AI custom training job (machine learnin training using custom container) on GCP. I would like to create a Pub/Sub message when the job failed so I can post a message on some chat like Slack. Logfile (Cloud Logging) is looking like that:
{
insertId: "xxxxx"
labels: {
ml.googleapis.com/endpoint: ""
ml.googleapis.com/job_state: "FAILED"
}
logName: "projects/xxx/logs/ml.googleapis.com%2F1113875647681265664"
receiveTimestamp: "2021-07-09T15:05:52.702295640Z"
resource: {
labels: {
job_id: "1113875647681265664"
project_id: "xxx"
task_name: "service"
}
type: "ml_job"
}
severity: "INFO"
textPayload: "Job failed."
timestamp: "2021-07-09T15:05:52.187968162Z"
}
I am creating a Logs Router Sink with the following query:
resource.type="ml_job" AND textPayload:"Job failed" AND labels."ml.googleapis.com/job_state":"FAILED"
The issue I am facing is that Vertex AI will retry the job 3 times before declaring the job as a failure but in the logfile the message is identical. Below you have 3 examples, only the last one that failed 3 times really failed at the end.
In the logfile, I don't have any count id for example. Any idea how to solve this ? Creating a BigQuery table to keep track of the number of failure per resource.labels.job_id seems to be an overkill if I need to do that in all my project. Is there a way to do a group by resource.labels.job_id and count within Logs Router Sink ?
The log sink is quite simple: provide a filter, it will publish in a PubSub topic each entry which match this filter. No group by, no count, nothing!!
I propose you to use a combination of log-based metrics and Cloud monitoring.
Firstly, create a log based metrics on your job failed log entry
Create an alert on this log based metrics with the following key values
Set the group by that you want, for example, the jobID (i don't know what is the relevant value for VertexAI job)
Set an alert when the threshold is equal or above 3
Add a notification channel and set a PubSub notification (still in beta)
With this configuration, the alert will be posted only once in PubSub when 3 occurrences of the same jobID will occur.
I have an application that I'm setting up logs-based monitoring for. The application will log whenever it completes a certain task. I want to ensure that the application completes this at least once every 6 hours.
I have tried to replicate this rule by configuring monitoring to fire an alert when the metric stays below 1 for the given amount of time.
Unfortunately, when the logs-based metric doesn't receive any logs, it appears to act that there is "no data" instead of a value of 0.
Is it possible to treat segments when no logs are received as a 0 so that the alert will fire?
Screenshot of my metric graph:
Screenshot of alert definition:
You can see that we receive a log for one time frame, but right afterwards the line disappears and an alert isn't triggered.
Try using absent_for and MQL based Alert.
The absent_for table operation generates a table with two value columns, active and signal. The active column is true when there is data missing from the table input and false otherwise. This is useful for creating a condition query to be used to alert on the absence of inputs.
Example:
fetch gce_instance :: compute.googleapis.com/instance/cpu/usage_time
| absent_for 8h
I've enabled the notifications in Stackdriver and I'm getting notification e-mails for exceptions just fine.
The problem is that I don't get any notification for timeouts.
Is there any way to be notified when a Google Cloud Function is killed by timeout?
Even though a timeout is not reported as an error, you can still set up a metric for timeout log entries, and then an alert on the metric exceeding a zero threshold.
From the GCP console, go to the Stackdriver Logging viewer (/logs/viewer), and build a filter like this:
resource.type="cloud_function"
resource.labels.function_name="[YOUR_FUNCTION_NAME_HERE]"
"finished with status: timeout"
The third line is a "contains" text filter. Timeout messages consistently contain this text. You can add other things or modify as needed.
Click Create Metric. Give the metric a name like "Function timeouts", and make sure the type is counter. You can leave the optional fields blank. Submit the form, and you should be redirected to /logs/metrics.
Under User-defined Metrics, you should see your new metric. Click the three-dot button on the right and select Create alert from metric.
Give the alert policy a meaningful name. Under target, you may also get some red text about being unable to produce a line plot. Click the helpful link to switch the aligner to mean and the aggregator to none. Then under Configuration, set the condition to "is above," threshold to "0", and for "most recent value."
Proceed with building the notification and documentation as desired. Make sure you add a notification channel so you get alerted. The UI should include hints on each field.
More detail is in the official documentation.
Navigate to "Create alerting policy" using the search box at the top of the dashboard.
Under "What do you want to track?" click "Add Condition."
Configure the new condition like so:
Click "Add."
Click "Next."
Select a notification channel or create a new one.
I unchecked "Notify on incident resolution."
Click "Next."
Provide a descriptive alert name and optional documentation.
Click "Save."
Ensure that at the top of the policy you see the word "Enabled" along with a green checkmark.
Came up with a workaround for this by forcing an error before Cloud Functions times out. In terms of workflow, I think this is much easier to control and be able to consolidate all the errors in one place, rather than having to configure settings elsewhere.
Basically something like the code snippet below:
exports.cloudFunction = async (event, context, callback) => {
try {
const timeout = setTimeout(function(){
throw new Error(`Timeout: ${event}`);
}, 58000); // 2sec buffer off the default 60s timeout
// DO SOMETHING
clearTimeout(timeout);
callback();
} catch(e) {
// HANDLE ERROR
callback(e);
}
}
Is it possible to setup an alert based on the status of a custom service. For example, stackdriver-agent service crashed at one point. When running 'service stackdriver-agent status" I receive an 'Active: inactive (dead)' response.
Is it possible to setup an alert based on the condition above? The stackdriver-agent service is just an example. In theory, I would like to setup this alert condition on any service.
The answer is yes. In Stackdriver you can set up an alarm for any process in your machine. Selecting the option Add Process Health Condition you can configure alarms to receive notifications if your process starts or stops. Bear in mind that you first have to set up the Stackdriver Agent in your machine and that this option is only available in Stackdriver premium.
Thrahir's answer is a good one, though the UI has changed since then (click the right arrow next to "Metric" and "Uptime Check" to see other condition types; "Process Health" is the very last one).
If your service is a server, you might rather use an uptime check (https://cloud.google.com/monitoring/uptime-checks/) to monitor its state; that gives you a better analog to what the service's users will see than directly monitoring your processes does.
Aaron Sher, Stackdriver engineer
Setup
Note: Using pseudo-code instance notation: ObjectType("<name>", | <attr>: <attr-value>]).
We have a Container:
Container("k8s-snapshots") in a Pod("k8s-snapshots-0") in a `StatefulSet("k8s-snapshots", spec.replicas: 1)
We expect at most 1 Pod to run at any point in time.
We have a Logs-based Counter Metric("k8s-snapshots/snapshot-created") with the filter:
resource.type="container"
resource.labels.cluster_name="my-cluster"
logName="projects/my-project/logs/k8s-snapshots"
jsonPayload.event:"snapshot.created"
We have a Stackdriver Policy:
Policy(
Name: "snapshot metric absent",
Condition: Condition(
Metric("k8s-snapshots/snapshot-created"),
is absent for: "more than 30 minutes"
)
)
In order to monitor if Container("k8s-snapshots") has stopped creating snapshots.
Expected result
An alert is triggered if no instance of Pod("k8s-snapshots-0") has logged any event matching Metric("k8s-snapshots/snapshot-created").
Result
Policy(Name: "snapshot metric absent") is violated each time Pod("k8s-snapshots-0") is rescheduled.
It seems like a sub-metric of the main logs-based metric is created for each instance of Pod("k8s-snapshots"), and Stackdriver alerts for each sub-metric.
Are you still experiencing the issue? WithStackdriver API you have the ability to aggregate metrics (You can have custom metrics) which the UI does not have until now. You can also visit this link