Stackdriver Alert based on service status - google-cloud-platform

Is it possible to setup an alert based on the status of a custom service. For example, stackdriver-agent service crashed at one point. When running 'service stackdriver-agent status" I receive an 'Active: inactive (dead)' response.
Is it possible to setup an alert based on the condition above? The stackdriver-agent service is just an example. In theory, I would like to setup this alert condition on any service.

The answer is yes. In Stackdriver you can set up an alarm for any process in your machine. Selecting the option Add Process Health Condition you can configure alarms to receive notifications if your process starts or stops. Bear in mind that you first have to set up the Stackdriver Agent in your machine and that this option is only available in Stackdriver premium.

Thrahir's answer is a good one, though the UI has changed since then (click the right arrow next to "Metric" and "Uptime Check" to see other condition types; "Process Health" is the very last one).
If your service is a server, you might rather use an uptime check (https://cloud.google.com/monitoring/uptime-checks/) to monitor its state; that gives you a better analog to what the service's users will see than directly monitoring your processes does.
Aaron Sher, Stackdriver engineer

Related

When I get 'services has reached steady state', in Amazon ECS does it means some tasks had stopped?

Does this means that my service tasks are stopping or it's ok to get these log messages?
actually opposite this. The service scheduler reports status periodically. A normal state indicates that there is nothing for it to do -- all tasks are healthy, there are no scaling requests or deployments.
No it doesn't mean that any of your tasks had stopped. If a task stops you will see an event that clearly states so and will include a link to the specific task that was stopped. For example you will get something like this "service xxx has stopped 1 running tasks: task xxx."
If no tasks have been created or stopped in the last six hours the ECS console will duplicate the last event message to let you know that everything works as expected.
From the ECS docs:
"To ensure that this event view is helpful, we only show the 100 most recent events and duplicate event messages are omitted until either the cause is resolved or six hours passes. If the cause is not resolved within six hours, you will receive another service event message for that cause."
https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-event-messages.html
Check this thread here on the aws forums. https://forums.aws.amazon.com/thread.jspa?threadID=182793
This sounds like normal behavior. The service scheduler reports status periodically. A normal state indicates that there is nothing for it to do -- all tasks are healthy, there are no scaling requests or deployments. Are you seeing any issues?

Cloud composer tasks fail without reason or logs

I run Airflow in a managed Cloud-composer environment (version 1.9.0), whic runs on a Kubernetes 1.10.9-gke.5 cluster.
All my DAGs run daily at 3:00 AM or 4:00 AM. But sometime in the morning, I see a few Tasks failed without a reason during the night.
When checking the log using the UI - I see no log and I see no log either when I check the log folder in the GCS bucket
In the instance details, it reads "Dependencies Blocking Task From Getting Scheduled" but the dependency is the dagrun itself.
Although the DAG is set with 5 retries and an email message it does not look as if any retry took place and I haven't received an email about the failure.
I usually just clear the task instance and it run successfully on the first try.
Has anyone encountered a similar problem?
Empty logs often means the Airflow worker pod was evicted (i.e., it died before it could flush logs to GCS), which is usually due to an out of memory condition. If you go to your GKE cluster (the one under Composer's hood) you will probably see that there is indeed a evicted pod (GKE > Workloads > "airflow-worker").
You will probably see in "Tasks Instances" that said tasks have no Start Date nor Job Id or worker (Hostname) assigned, which, added to no logs, is a proof of the death of the pod.
Since this normally happens in highly parallelised DAGs, a way to avoid this is to reduce the worker concurrency or use a better machine.
EDIT: I filed this Feature Request on your behalf to get emails in case of failure, even if the pod was evicted.

uknown reason for google cloud compute engine VM shutdown?

One of the vm instances on google cloud compute was shutdown, with an event log in stackdriver without ip or actor (user or service or system) which initiated the event. The instance has onHostMaintenance set to migrate and automaticRestart set to true. This particular instance has migrated on maintenance without error before. The stackdriver event log looks like
{
actor: {
user: ""
},
event_subtype: "compute.instances.stop",
event_timestamp_us: "1531781734907624",
event_type: "GCE_API_CALL",
ip_address: "",
}
The user and ip_address fields are NOT redacted. They have empty values on actual log.
is this common? how does one identify the cause for shutdown in these peculiar cases ?
According to your event type, I dive in the doc of Activity logs,
Compute Engine API calls - GCE_API_CALL events are API calls that change the state of a resource.
It seems like someone may use API calls to shutdown your VM. Looking your settings and logs on the VM instance, it can't be a hostError or Maintenance event. As far as you turn on the onHostMaintenance set to migrate and automaticRestart set to true. Your VM will always be migrated to another hardware.
Out of curiosity, did you restart your VM? did it shutdown again? How often does it happen?
Seems like app engine had an outage during that time, incident .

Is there an AWS / Pagerduty service that will alert me if it's NOT notified

We've got a little java scheduler running on AWS ECS. It's doing what cron used to do on our old monolith. it fires up (fargate) tasks in docker containers. We've got a task that runs every hour and it's quite important to us. I want to know if it crashes or fails to run for any reason (eg the java scheduler fails, or someone turns the task off).
I'm looking for a service that will alert me if it's not notified. I want to call the notification system every time the script runs successfully. Then if the alert system doesn't get the "OK" notification as expected, it shoots off an alert.
I figure this kind of service must exist, and I don't want to re-invent the wheel trying to build it myself. I guess my question is, what's it called? And where can I go to get that kind of thing? (we're using AWS obviously and we've got a pagerDuty account).
We use this approach for these types of problems. First, the task has to write a timestamp to a file in S3 or EFS. This file is the external evidence that the task ran to completion. Then you need an http based service that will read that file and calculate if the time stamp is valid ie has been updated in the last hour. This could be a simple php or nodejs script. This process is exposed to the public web eg https://example.com/heartbeat.php. This script returns a http response code of 200 if the timestamp file is present and valid, or a 500 if not. Then we use StatusCake to monitor the url, and notify us via its Pager Duty integration if there is an incident. We usually include a message in the response so a human can see the nature of the error.
This may seem tedious, but it is foolproof. Any failure anywhere along the line will be immediately notified. StatusCake has a great free service level. This approach can be used to monitor any critical task in same way. We've learned the hard way that critical cron type tasks and processes can fail for any number of reasons, and you want to know before it becomes customer critical. 24x7x365 monitoring of these types of tasks is necessary, and helps us sleep better at night.
Note: We always have a daily system test event that triggers a Pager Duty notification at 9am each day. For the truly paranoid, this assures that pager duty itself has not failed in some way eg misconfiguratiion etc. Our support team knows if they don't get a test alert each day, there is a problem in the notification system itself. The tech on duty has to awknowlege the incident as per SOP. If they do not awknowlege, then it escalates to the next tier, and we know we have to have a talk about response times. It keeps people on their toes. This is the final piece to insure you have robust monitoring infrastructure.
OpsGene has a heartbeat service which is basically a watch dog timer. You can configure it to call you if you don't ping them in x number of minutes.
Unfortunately I would not recommend them. I have been using them for 4 years and they have changed their account system twice and left my paid account orphaned silently. I have to find a new vendor as soon as I have some free time.

Break out of loop in AWS SWF activity

I'm running permanent loop in SWF Activity. Say like a web crawler crawling a website www.example1.com. However, I don't want to wait until it finishes crawling, but at certain time I want to terminate the activity and switch it to craw website www.example2.com instead.
I have tried to use 'try-cancel', 'terminate', workflow by workflow-id. It seems like it just sends signal to SWF to indicate that the task is finished in the AWS console, but the Activity process on worker is still running.
Any solution for this?
When activity is cancelled a heartbeat call returns flag that indicates that. So your activity loop should include heartbeating code to support cancellation. See "activity heartbeat" section from "error handling" page of AWS Flow Framework for Java
Developer Guide for an example.