GCP Uptime Metric is giving unreliable alerts - google-cloud-platform

Trying to get an alert when the GCE VM is in down state by creating Alerting Policy.
Metric: compute.googleapis.com/instance/uptime
Resource : VM instance
And made the configuration that in order to trigger an alert when this condition is absent for 3 minutes.
To simulate this above behavior , I have stopped the VM but it is not triggering an alert , meanwhile data is not visible in graph of the alerting policy
Have attached trigger configuration

None of the metrics are giving reliable alerts when the VM is in stopped state,which are compute.googleapis.com/instance/uptime or uptime of the monitoring agent or cpu utilization metrics until you create alerting poilicy with MQL - Monitoring Query language.
"metrics associated with TERMINATED or DELETED Google Cloud resources are not considered for metric-absence policies. This means you can't use metric-absence policies to test for TERMINATED or DELETED Google Cloud VMs."
https://cloud.google.com/monitoring/alerts/types-of-conditions#metric-absence
So as per the above statement we cannot use metic absence policy for stopped vm - As It goes to terminated state after it stopped for sometime.The reason is , it calculates the instance stop time only when it becomes running state again.
But when you configure the same condition with MQL with the same set of metrics , Metric-absence policies works without any issues.
Sample:
Instead of configuring the condition by selecting resource & metric , go to Query Editor and type the below query for getting the alert when the Development environment VM is not in running state for 3 minutes.
fetch gce_instance
| metric 'compute.googleapis.com/instance/uptime'
| filter (metadata.user_labels.env == 'dev')
| group_by 1m, [value_uptime_aggregate: aggregate(value.uptime)]
| every 1m
| absent_for 180s
Not sure this is the bug or not , but this is limitation when we configure the alerting condition in a traditional way and we can resolve this by leveraging MQL.

Behavior you're describing is unusual.
I reproduced your case and created the exact alerting policy using the same metric compute.googleapis.com/instance/uptime with the same settings. I forwarded all alerts to my e-mail.
Unfortunatelly I wasn't able to reproduce this behavior. After playing with various settings (agregation, absence time) and I was getting alerting emails.
Try maybe setting the alerting policy again. If your goal is just to monitor the state of the VM (responding or no) then you can use any other metrics such as cpu usage which will be absent when the VM is off (or unresponsive).
Finally you can try installing monitoring agent on your VM which will give you more metrics available thus more information on the machine.
Have a look at how to manage alerting policies documentation which may be usefull to you. Additionally this documentation describes alerting policies types and how to choose apropriate one for you use case.
Ultimately try creating another VM and set up alerting policy for it. If that doesn't work your best shot is to go to Google IssueTracker and file a new bug report and it will be handled by the product team - however this may take some (or a lot) of time depending on the issue.

Related

How to setup cloudwatch alarm for beanstalk environment memory

I'm trying to setup the Cloudwatch Alarm for memory on all instances of an AWS Elastic Beanstalk environment. I've setup capability to get Memory usage on Cloudwatch using the following tutorial:
https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/customize-containers-cw.html
Now I want to setup an alarm that would trigger if the MemoryUtilization of any of these instances go beyond a certain threshold. I can select all and setup alert on each of those separately, but I want to make sure that even if Beanstalk scales up the cluster or swaps an instance, the alert doesn't have to be reconfigured.
Is there a way I can setup alarm for a condition where Instance Name = "env-name" and Metric is MemoryUtilization?
What I understand from your question are the following requirements:
You have multiple metrics and want to use a logical OR condition when configuring an alarm, e.g. (avg metric1 > x || avg metric2 > y) ==> set alarm state to ALARM
You want the alarm to consider new metrics as they become available when new instances are launched by elastic beanstalk during scale out.
You want old metrics to not be considered as soon as elastic beanstalk scales in.
I think this is currently not possible.
There is an ongoing discussion on aws discussion forums [1] which reveals that at least (1) is possible using Metric Math. The Metric Math feature supports max. 10 metrics.
Solution
What you need to do is, to create a single metric which transports the information whether the alarm should be triggered or not ('computed metric'). There are multiple ways to achieve this:
For complex metrics you could write a bash script and run it on an EC2 instance using cron. The script would first query existing metrics using a dimension filter ('list-metrics'), then gather each metric ('get-metric-data'), aggregate it and then push the computed metric data point ('put-metric-data').
If the metric is rather simple, you could try the aggregate option of the AWS put-metric-data script [2]:
option_settings:
"aws:elasticbeanstalk:customoption" :
CloudWatchMetrics : "--mem-util --mem-used --mem-avail --disk-space-util --disk-space-used --disk-space-avail --disk-path=/ --auto-scaling --aggregated"
The documentation for the aggregated option says:
Adds aggregated metrics for instance type, AMI ID, and overall for the region.
References
[1] https://forums.aws.amazon.com/thread.jspa?threadID=94984
[2] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/mon-scripts.html#put-metric-data
In the Elastic Beanstalk console for your environment:
Click the Monitoring link in the left-hand side navigation links.
Underneath the Overview, in the Monitoring section, click the Edit button.
Choose AWSEBAutoScalingGroup for the Resource.
Choose MemoryUtilization under CloudWatch Metric.
Modify Statistic and Description as desired.
Click the Add button, and then click the Save button in the Monitoring section.
Scroll down to find the new panel that was added. Click the bell icon in the upper right hand corner of the panel. This will take you to the settings to set up a new alarm.
If you do not see the MemoryUtilization metric available, verify that you have correctly set up the collection of the memory metrics.
Cloudwatch cannot create alarms in a generic way. There are only 2 ways to accomplish the task.
1) Create a startup script in your AMI. When a new instance is launched, it is responsible for its own Cloudwatch alarms. I used this a long time ago, and the approach is solid. However, running scripts on termination isn't reliable, so you'll have to periodically clean out the old alarms.
2) Use a tool that has decent capabilities (ahem.... not Cloudwatch). I recommend Blue Matador. With them, you don't even have to setup the alarms or thresholds, the machine learning automatically baselines your resources and creates alerts for you.
If you got here and don't know Beanstalk or Cloudwatch well enough to contribute, start here: How to Monitor AWS Elastic Beanstalk with CloudWatch

get alert when instance is not active in GCE

For Compute Engine of GCE, I use stackdriver monitoring for monitoring and alert.
For most of the general metrics like CPU, disk IO, memory ... etc is available and can set alert for those metrics based or dead-or-alive by process name.
However I cannot find any metrics related to status of GCE instance itself.
My use-case is so simply. I'd like to know if the instance id down or not.
Any suggestion appreciated.
thanks.
think the instance status not a monitoring metric; there's just instance/uptime available.
(and I have no clue what it would return when it is terminated, possibly worth a try).
but one can check for servers with Uptime Checks and then report the Incident.
and one can get the instance status with gcloud compute instances describe instance01.

How to extract an instance uptime based on incidents?

On stackdriver, creating an Uptime Check gives you access to the Uptime Dashboard that contains the uptime % of your service:
My problem is that uptime checks are restricted to http/tcp checks. I have other services running and those services report their health in different ways (say, for example, by a specific process running). I have incident policies already set up for this services, so if the service is not running I get notified.
Now I want to be able to look back and know how long the service was down for the last hour. Is there a way to do that?
There's no way to programmatically retrieve alerts at the moment, unfortunately. Many resource types expose uptime as a metric, though (e.g., instance/uptime on GCE instances) - could you pull those and do the math on them? Without knowing what resource types you're using, it's hard to give specific suggestions.
Aaron Sher, Stackdriver engineer

I want to send metric alert (in group setting) of AWS instance with stackdriver monitoring

My question is setting when monitoring AWS metrics with stackdriver.
I'm tried thing below but, alert(policy) is not working.
How do I send alert(policy) with group settings?
I dont want is single monitoring, I do want is group settings.
I completed stackdriver monitoring setting for aws accounts by role settings. for next, I settinged group settings alert(policy) metrics is below.
load average > 5
disk usage > 80%
there target is some ec2 instances, these is group settings.
I complete settings for these. for next, did test of stress.
I looked at the metrics. Then the graph exceeded the threshold.
but not sended alert(policy), and not opened incidents.
below is details.
Alert(Policy) Creation
go to [Alerting/ Policies/ TARGET POLICY]
[Add Condition], for next select to [Metric Threshold]
RESOURCE TYPE is Instance(EC2)
APPLIES TO is Group
Select group. This group is Including EC2 Instances.
CONDITION TRIGGERS IF: Any Member Violates
IF METRIC is [CPU Load Average(past 1m)
CONDITION is above
THRESHOLD is 5 load
FOR is 1 minutes
Write by name and Push [Save Policy]
Test of Stress
ssh to target instances.
Execute stress test.
Confim the Load Average above reached 5.
but not sended alert(policy)
Confirm the Stackdriver
Confirm the above Load Average reached 5, with alert settings page.
But not opened Incidents.
I Tried other settings
For GCP instances, alerts will work correctly. It is both group setting and single setting.
Alerts will work for AWS instances in single configuration, but not for group settings.
Version info
stackdriver
stackdriver-agent version: stackdriver-agent.x86_64 5.5.2-366.amzn1
aws
OS: Amazon Linux
VERSION: 2016.03
ID_LIKE: rhel fedora
more detail is please comments.
If the agent wasn't configured correctly and is sending metrics to the wrong project, this could lead to the behavior described. This works for single instances but doesn't for group of instances. This might work for GCP because it's zero setup for monitoring GCE Instances. This causes any alerts which use group filters to not work.
https://cloud.google.com/monitoring/agent/troubleshooting#verify-project
"If you are using an Amazon EC2 VM instance, or if you are using private-key credentials on your Google Compute Engine instance, then the credentials could be invalid or they could be from the wrong project. For AWS accounts, the project used by the agent must be the AWS connector project, typically named "AWS Link..."."
These instructions at https://cloud.google.com/monitoring/agent/troubleshooting#verify-running help verify that agent is sending metrics correctly.

Using a stop alarm with a g2.2xlarge instance on Amazon's ec2 aws

While working with a g2.2xlarge spot instance, I have tried to set up an alarm that will notify me when the average CPU usage over a two hour period has dropped below 5% and will then automatically stop the instance. Here's a link to a nice article Amazon wrote up on how to use the stop/start instance feature. The AWS alarms seem to allow you to do this however after the trigger goes off I get this reply:
Dear AWS customer,
We are unable to execute the 'Stop' action on Amazon EC2 instance i-e60e21ec that you specified in the Amazon CloudWatch alarm awsec2-i-e60e21ec-Low-CPU-Utilization.
You may want to check the alarm configuration to ensure that it is compatible with your instance configuration. You can also attempt to execute the action manually.
These are some possible reasons for this failure and steps you can try to resolve it:
Incompatible action selected:
Your instance’s configuration may not be compatible with the selected action.
To execute the 'Terminate' action, your instance may have Termination Protection enabled. Disable this feature if you want to terminate your instance. Once you do that, the alarm will execute the action after the next applicable alarm state change.
To execute the 'Stop' action, your instance’s root device type must be an EBS volume. If the root device type is the instance store, select the 'Terminate' action instead. Once you do that, the alarm will execute the action after the next applicable alarm state change.
Temporary service interruption: There may have been an issue with Amazon CloudWatch or Amazon EC2. We have retried the action without success. You can try to execute the action manually, or wait for the next applicable alarm state change.
Sincerely, Amazon Web Services
Stop seems to be an option for the free micro instance but not for these other instances. When I try to change the shutdown behavior to stop in actions it says:
An error occurred while changing the shutdown behavior of this instance.
Modifying 'instanceInitiatedShutdownBehavior is not supported for spot instances.
Is there another way to get around this problem or will we have to wait until Amazon makes this feature available?
Use standard instances instead of spot instances. Spot instances allow you to bid on extra capacity within ec2. However, they may automatically shut down if the spot price exceeds your bid.
Its not really intended for an always on instance.