Google Cloud Monitoring - Get uptime check current status - google-cloud-platform

I created an uptime check for my website. Then, I found this documentation page that shows how to extract information regarding the uptime check with C#.
After running the code:
public static object GetUptimeCheckConfig(string configName)
{
var client = UptimeCheckServiceClient.Create();
UptimeCheckConfig config = client.GetUptimeCheckConfig(configName);
if (config == null)
{
Console.Error.WriteLine(
"No configuration found with the name {0}", configName);
return -1;
}
Console.WriteLine("Name: {0}", config.Name);
Console.WriteLine("Display Name: {0}", config.DisplayName);
Console.WriteLine("Http Path: {0}", config.HttpCheck.Path);
return 0;
}
I found that this method provides information only about the configuration of the check. I want to get information about its current status (working good \ broken). Seems like this information is missing.
I also tried this REST call helper - the requested information is missing there too.
Is this possible to extract the current health status of the resource?
Or I need to choose a more complex way to extract the data (e.g. via Webhooks)?

From GCP metrics docs:
To monitor the availability of a service, create an uptime check. These checks monitor the monitoring.googleapis.com/uptime_check/check_passed metric type. Don't configure an alerting policy to track a metric type such as compute.googleapis.com/instance/uptime if your goal is to monitor the availability of a service.
And then at uptime check docs:
To determine the status of your uptime checks using the API, monitor the metric monitoring.googleapis.com/uptime_check/check_passed. See Google Cloud metrics list for details.
Original answer:
Instead of GetUptimeCheckConfig you want to use timeSeries API.
You can try it in API explorer at https://cloud.google.com/monitoring/api/ref_v3/rest/v3/projects.timeSeries/query
Request args:
projects/YOUR_PROJECT_ID
Request body:
{
"query": "fetch uptime_url::monitoring.googleapis.com/uptime_check/request_latency | filter check_id = 'YOUR_CHECK_ID' | group_by [checker_location]"
}
* just make sure you replace YOUR_PROJECT_ID and YOUR_CHECK_ID with actual ids

Related

GCP BigTable Metrics - what do 404 requests mean?

We switched to BigTable some time ago and since then there is a number of "404 requests" and also a high number of errors in the GCP Metrics console.
We see no errors in our logs and even data storage/retrieval seems to work as expected.
What is the cause for these errors and how is it possible to find out what is causing them?
As mentioned previously 404 means resource is not found. The relevant resource here is the Bigtable table (which could mean that either the instance id or table id are misconfigured in your application).
I'm guessing that you are looking at the metrics under APIs & Services > Cloud Bigtable API. These metrics show the response code from the Cloud Bigtable Service. You should be able to see this error rate under Monitoring > Metrics Explorer > metric:bigtable.googleapis.com/server/error_count and grouping by instance, method, error_code and app_profile. This will tell which instance and which RPC is causing the errors. Which let you grep your source code for incorrect usages.
A significantly more complex approach is that you can install an interceptor in Bigtable client that:
dumps the resource name of the RPC
once you identify the problematic table name, logs the stack trace of the caller
Something along these lines:
BigtableDataSettings.Builder builder = BigtableDataSettings.newBuilder()
.setProjectId("...")
.setInstanceId("...");
ConcurrentHashMap<String, Boolean> seenTables = new ConcurrentHashMap<>();
builder.stubSettings().setTransportChannelProvider(
EnhancedBigtableStubSettings.defaultGrpcTransportProviderBuilder()
.setInterceptorProvider(() -> ImmutableList.of(new ClientInterceptor() {
#Override
public <ReqT, RespT> ClientCall<ReqT, RespT> interceptCall(
MethodDescriptor<ReqT, RespT> methodDescriptor, CallOptions callOptions,
Channel channel) {
return new ForwardingClientCall.SimpleForwardingClientCall<ReqT, RespT>(channel.newCall(methodDescriptor, callOptions)) {
#Override
public void sendMessage(ReqT message) {
Message protoMessage = (Message) message;
FieldDescriptor desc = protoMessage.getDescriptorForType()
.findFieldByName("table_name");
if (desc != null) {
String tableName = (String) protoMessage.getField(desc);
if (seenTables.putIfAbsent(tableName, true) == null) {
System.out.println("Found new tableName: " + tableName);
}
if ("projects/my-project/instances/my-instance/tables/my-mispelled-table".equals(
tableName)) {
new RuntimeException(
"Fake error to get caller location of mispelled table id").printStackTrace();
}
}
delegate().sendMessage(message);
}
};
}
}))
.build()
);
Google Cloud Support here,
Without more insight I won’t be able to provide valid information about this 404 issue.
The issue must be either a typo or with the configuration, but cannot confirm with the shared data.
In order to provide more meaningful support, I would suggest you to open a Public Issue Tracker or a Google Cloud Support ticket.

How to create a Logs Router Sink when a Vertex AI training job failed (after 3 attempts)?

I am running a Vertex AI custom training job (machine learnin training using custom container) on GCP. I would like to create a Pub/Sub message when the job failed so I can post a message on some chat like Slack. Logfile (Cloud Logging) is looking like that:
{
insertId: "xxxxx"
labels: {
ml.googleapis.com/endpoint: ""
ml.googleapis.com/job_state: "FAILED"
}
logName: "projects/xxx/logs/ml.googleapis.com%2F1113875647681265664"
receiveTimestamp: "2021-07-09T15:05:52.702295640Z"
resource: {
labels: {
job_id: "1113875647681265664"
project_id: "xxx"
task_name: "service"
}
type: "ml_job"
}
severity: "INFO"
textPayload: "Job failed."
timestamp: "2021-07-09T15:05:52.187968162Z"
}
I am creating a Logs Router Sink with the following query:
resource.type="ml_job" AND textPayload:"Job failed" AND labels."ml.googleapis.com/job_state":"FAILED"
The issue I am facing is that Vertex AI will retry the job 3 times before declaring the job as a failure but in the logfile the message is identical. Below you have 3 examples, only the last one that failed 3 times really failed at the end.
In the logfile, I don't have any count id for example. Any idea how to solve this ? Creating a BigQuery table to keep track of the number of failure per resource.labels.job_id seems to be an overkill if I need to do that in all my project. Is there a way to do a group by resource.labels.job_id and count within Logs Router Sink ?
The log sink is quite simple: provide a filter, it will publish in a PubSub topic each entry which match this filter. No group by, no count, nothing!!
I propose you to use a combination of log-based metrics and Cloud monitoring.
Firstly, create a log based metrics on your job failed log entry
Create an alert on this log based metrics with the following key values
Set the group by that you want, for example, the jobID (i don't know what is the relevant value for VertexAI job)
Set an alert when the threshold is equal or above 3
Add a notification channel and set a PubSub notification (still in beta)
With this configuration, the alert will be posted only once in PubSub when 3 occurrences of the same jobID will occur.

Google cloud alert for not receiving message during an hour?

I have several Alerts in GCP for specific causes/action.
Like for myFunction:
I get an alert (slack/mail) if it fails (msg: "failed!"). The alert works for the specific text-msg "failed!"
But how to create alert if my function not started during an hour (msg: "started!")?
Any suggestions?
Create an alerting policy with custom log based metric to look for msg: "started!" and in Configuration section, set the condition to: Is absent and select time of 1 hr

How can I avoid "IN_USED_ADDRESSES" error when starting multiple Dataflow jobs from the same template?

I have created a Dataflow template which allows me to import data from CSV file in Cloud Storage into BigQuery. I use Cloud Function for Firebase to create jobs from this template at certain time everyday. This is the code in the Function (with some irrelevant parts removed).
const filePath = object.name?.replace(".csv", "");
// Exit function if file changes are in temporary or staging folder
if (
filePath?.includes("staging") ||
filePath?.includes("temp") ||
filePath?.includes("templates")
)
return;
const dataflow = google.dataflow("v1b3");
const auth = await google.auth.getClient({
scopes: ["https://www.googleapis.com/auth/cloud-platform"],
});
let request = {
auth,
projectId: process.env.GCLOUD_PROJECT,
location: "asia-east1",
gcsPath: "gs://my_project_bucket/templates/csv_to_bq",
requestBody: {
jobName: `csv-to-bq-${filePath?.replace(/\//g, "-")}`,
environment: {
tempLocation: "gs://my_project_bucket/temp",
},
parameters: {
input: `gs://my_project_bucket/${object.name}`,
output: biqQueryOutput,
},
},
};
return dataflow.projects.locations.templates.launch(request);
This function is triggered every time any file is written in Cloud Storage. I am working with sensors so at least I have to import 89 different data i.e. different CSV files within 15 minutes.
The whole process works fine if there are only 4 jobs working at the same time. However, when the function tried to create the fifth job, the API returned many different types of errors.
Error 1 (not exact since somehow I cannot find the error anymore):
Error Response: [400] The following quotas were exceeded: IN_USE_ADDRESSES
Error 2:
Dataflow quota error for jobs-per-project quota. Project *** is running 25 jobs.
Please check the quota usage via GCP Console.
If it exceeds the limit, please wait for a workflow to finish or contact Google Cloud Support to request an increase in quota.
If it does not, contact Google Cloud Support.
Error 3:
Quota exceeded for quota metric 'Job template requests' and limit 'Job template requests per minute per user' of service 'dataflow.googleapis.com' for consumer 'project_number:****'.
I know I can space out starting jobs to avoid Error 2 and 3. However, I don't know how to start jobs in a way that won't fill up the addresses. So, how do I avoid that? If I cannot, then what approach should I use?
I had answered this in another post here - Which Compute Engine quotas need to be updated to run Dataflow with 50 workers (IN_USE_ADDRESSES, CPUS, CPUS_ALL_REGIONS ..)?.
Let me know if that helps.
This is a GCP external IP quota issue and the best solution is not to use any public IPs for dataflow jobs as long as your pipeline resources stay within GCP networks.
To enable public IP in dataflow jobs:
Create or update your subnetwork to allow Private google access. this is fairly simple to do using the console - VPC > networks > subnetworks > tick enable private google access
In the parameters of your Cloud Dataflow job, specify --usePublicIps=false and --network=[NETWORK] or --subnetwork=[SUBNETWORK].
Note: - For internal IP IN_USED errors just change your subnet CIDR range to accommodate more addresses like 20.0.0.0/16 will give you close to 60k internal IP address.
By this, you will never be exceeding your internal IP ranges

Get service Name of Task under aws fargate

We need to get the service name under which a fargate task runs so we can perform some per service configuration (we have one service per customer, and use the service name to identify them).
By knowing the service discovery namespace for our cluster and the Task IP address, we are able to do find out the service by doing the following.
Get the task ip address, eaither by calling http://169.254.170.2/v2/metadata endpoint or by using the ECS_ENABLE_CONTAINER_METADATA method in my follow-up answer.
With the cluster namespace we call AWS.ServiceDiscovery.listNamespaces
From there we extract the nameSpace id.
We pass that to AWS.ServiceDiscovery.listServices
We pass the id of each service to AWS.ServiceDiscovery.listInstances
We flat map the results of that and look for an instance that matches our IP address above.
Voilà! that record gives us the service name.
Works fine, it just seems like a super circuitous path! I'm just wondering whether there is some shorter way to get this information.
Here's a working C# example in two steps. It gets the taskARN from the metadata to retrieve the task description, and then reads its Group property, which contains the name of the service. It uses AWSSDK.ECS to get the task description and Newtonsoft.Json to parse the JSON.
private static string getServiceName()
{
// secret keys, should be encoded in license configuration object
var ecsClient = new AmazonECSClient( ACCESS_KEY, SECRET_KEY );
var request = new DescribeTasksRequest();
// need cluster here if not default
request.Cluster = NAME_OF_CLUSTER;
request.Tasks.Add( getTaskArn() );
var asyncResponse = ecsClient.DescribeTasksAsync( request );
// probably need this synchronously for application to proceed
asyncResponse.Wait();
var response = asyncResponse.Result;
string group = response.Tasks.Single().Group;
// group returned in the form "service:[NAME_OF_SERVICE]"
return group.Remove( 0, 8 );
}
private static string getTaskArn()
{
// special URL for fetching internal Amazon information for ECS instances
string url = #"http://169.254.170.2/v2/metadata";
string metadata = getWebRequest( url );
// use JObject to read the JSON return
return JObject.Parse( metadata )[ "TaskARN" ].ToString();
}
private static string getWebRequest( string url )
{
HttpWebRequest request = (HttpWebRequest)WebRequest.Create( url );
request.AutomaticDecompression = DecompressionMethods.GZip;
using HttpWebResponse response = (HttpWebResponse)request.GetResponse();
using Stream stream = response.GetResponseStream();
using StreamReader reader = new StreamReader( stream );
return reader.ReadToEnd();
}
You can get the service name from startedBy property of the task definition. Using boto sdk you can call a describe_tasks (or its equivalent in aws-cli: aws ecs describe-tasks) which will provide a
'startedBy': 'string'
The tag specified when a task is started. If the task is started by an Amazon ECS service, then the startedBy parameter contains the deployment ID of the service that starts it.
From:
boto3 ecs client
aws-cli ecs client
Hope it helps.
The answer above requires reading the container metadata that appears if you set the ECS_ENABLE_CONTAINER_METADATA environment variable in the task. The work flow is then:
Read the container metadata file ecs-container-metadata.json to get the taskArn
Call the aws.ecs.describe-tasks function to get the startedBy property
Call aws.servicediscover.get-service.
Two steps instead of three, if you don't count reading the metadata file. Better, to be sure, but I'm probably not going to change the way we do it for now.