Google PubSub - Counting messages in topic - google-cloud-platform

I've looked over the documentation for Google's PubSub, and also tried looking in Google Cloud Monitoring, but couldn't find any means of figuring out what's the queue size in my topics.
Since I plan on using PubSub for analytics, it's important for me to monitor the queue count, so I could scale up/down the subscriber count.
What am I missing?

The metric you want to look at is "undelivered messages." You should be able to set up alerts or charts that monitor this metric in Google Cloud Monitoring under the "Pub/Sub Subscription" resource type. The number of messages that have not yet been acknowledged by subscribers, i.e., queue size, is a per-subscription metric as opposed to a per-topic metric. For info on the metric, see pubsub.googleapis.com/subscription/num_undelivered_messages in the GCP Metrics List (and others for all of the Pub/Sub metrics available).

This might help if you're looking into a programmatic way to achieve this:
from google.cloud import monitoring_v3
from google.cloud.monitoring_v3 import query
project = "my-project"
client = monitoring_v3.MetricServiceClient()
result = query.Query(
client,
project,
'pubsub.googleapis.com/subscription/num_undelivered_messages',
minutes=60).as_dataframe()
print(result['pubsub_subscription'][project]['subscription_name'][0])

The answer to your question is "no", there is no feature for PubSub that shows these counts. The way you have to do it is via log event monitoring using Stackdriver (it took me some time to find that out too).
The colloquial answer to this is do the following, step-by-step:
Navigate from GCloud Admin Console to: Monitoring
This opens a new window with separate Stackdriver console
Navigate in Stackdriver: Dashboards > Create Dashboard
Click the Add Chart button top-right of dashboard screen
In the input box, type num_undelivered_messages and then SAVE

Updated version based on #steeve's answer. (without pandas dependency)
Please note that you have to specify end_time instead of using default utcnow().
import datetime
from google.cloud import monitoring_v3
from google.cloud.monitoring_v3 import query
project = 'my-project'
sub_name = 'my-sub'
client = monitoring_v3.MetricServiceClient()
result = query.Query(
client,
project,
'pubsub.googleapis.com/subscription/num_undelivered_messages',
end_time=datetime.datetime.now(),
minutes=1,
).select_resources(subscription_id=sub_name)
for content in result:
print(content.points[0].value.int64_value)

Here is a java version
package com.example.monitoring;
import static com.google.cloud.monitoring.v3.MetricServiceClient.create;
import static com.google.monitoring.v3.ListTimeSeriesRequest.newBuilder;
import static com.google.monitoring.v3.ProjectName.of;
import static com.google.protobuf.util.Timestamps.fromMillis;
import static java.lang.System.currentTimeMillis;
import com.google.monitoring.v3.ListTimeSeriesRequest;
import com.google.monitoring.v3.TimeInterval;
public class ReadMessagesFromGcp {
public static void main(String... args) throws Exception {
String projectId = "put here";
var interval = TimeInterval.newBuilder()
.setStartTime(fromMillis(currentTimeMillis() - (120 * 1000)))
.setEndTime(fromMillis(currentTimeMillis()))
.build();
var request = newBuilder().setName(of(projectId).toString())
.setFilter("metric.type=\"pubsub.googleapis.com/subscription/num_undelivered_messages\"")
.setInterval(interval)
.setView(ListTimeSeriesRequest.TimeSeriesView.FULL)
.build();
var response = create().listTimeSeries(request);
for (var subscriptionData : response.iterateAll()) {
var subscription = subscriptionData.getResource().getLabelsMap().get("subscription_id");
var numberOrMessages = subscriptionData.getPointsList().get(0).getValue().getInt64Value();
if(numberOrMessages > 0) {
System.out.println(subscription + " has " + numberOrMessages + " messages ");
}
}
}
}
<dependency>
<groupId>com.google.cloud</groupId>
<artifactId>google-cloud-monitoring</artifactId>
<version>3.3.2</version>
</dependency>
<dependency>
<groupId>com.google.protobuf</groupId>
<artifactId>protobuf-java-util</artifactId>
<version>4.0.0-rc-2</version>
</dependency>
output
queue-1 has 36 messages
queue-2 has 4 messages
queue-3 has 3 messages

There is a way to count all messages published to a topic using custom metrics.
In my case I am publishing messages to a Pub/Sub topic via a Cloud Composer (Airflow) Dag that runs a python script.
The python script returns logging information about the ran Dag.
logging.info(
f"Total events in file {counter-1}, total successfully published {counter - error_counter -1}, total errors publishing {error_counter}. Events sent to topic: {TOPIC_PATH} from filename: {source_blob_name}.",
{
"metric": "<some_name>",
"type": "completed_file",
"topic": EVENT_TOPIC,
"filename": source_blob_name,
"total_events_in_file": counter - 1,
"failed_published_messages": error_counter,
"successful_published_messages": counter - error_counter - 1,
}
I then have a Distribution custom metric which filters on resource_type, resource_lable, jsonPayload.metric and jsonPayload.type. The metric also has the Field Name set to jsonPayload.successful_published_messages
Custom metric filter:
resource.type=cloud_composer_environment AND resource.labels.environment_name={env_name} AND jsonPayload.metric=<some_name> AND jsonPayload.type=completed_file
That custom metric is then used in a Dashboard with the MQL setting of
fetch cloud_composer_environment
| metric
'logging.googleapis.com/user/my_custom_metric'
| group_by 1d, [value_pubsub_aggregate: aggregate(value.pubsub)]
| every 1d
| group_by [],
[value_pubsub_aggregate_sum: sum(value_pubsub_aggregate)]
Which to get to I first setup an Icon chart with resource type: cloud composer environment, Metric: my_custom metric, Processing step: to no preprocessing step, Alignment function: SUM, period 1, unit day, How do you want it grouped group by function: mean.
Ideally you would just select sum for the Group by function but it errors and that is why you then need to sqitch to MQL and manually enter sum instead of mean.
This will now count your published messages for up to 24 months which is the retention period set by Google for the custom metrics.

Related

GCP terraform - alerts module based on log metrics

As per subject, I have set up log based metrics for a platform in gcp i.e. firewall, audit, route etc. monitoring.
enter image description here
Now I need to setup alert policies tied to these log based metrics, which is easy enough to do manually in gcp.
enter image description here
However, I need to do it via terraform thus using this module:
https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_alert_policy#nested_alert_strategy
I might be missing something very simple but finding it hard to understand this as the alert strategy is apparently required but yet does not seem to be supported?
I am also a bit confused on which kind of condition I should be using to match my already setup log based metric?
This is my module so far, PS. I have tried using the same filter as I did for setting up the log based metric as well as the name of the log based filter:
resource "google_monitoring_alert_policy" "alert_policy" {
display_name = var.display_name
combiner = "OR"
conditions {
display_name = var.display_name
condition_matched_log {
filter = var.filter
#duration = "600s"
#comparison = "COMPARISON_GT"
#threshold_value = 1
}
}
user_labels = {
foo = "bar"
}
}
var filter is:
resource.type="gce_route" AND (protoPayload.methodName:"compute.routes.delete" OR protoPayload.methodName:"compute.routes.insert")
Got this resolved in the end.
Turns out common issue:
https://issuetracker.google.com/issues/143436657?pli=1
Had to add this to the filter parameter in my terraform module after the metric name - AND resource.type="global"

How to get Airflow user who manually trigger a DAG?

In the Airflow UI, one of the log events available under "Browser > Logs" is the event "Trigger" along with the DAG ID and Owner/User who's responsible for triggering this event. Is this information easily obtainable programmatically?
The use case is, I have a DAG that allows a subset of users to manually trigger the execution. Depending on the user who triggers the execution of this DAG, the behavior of code execution from this DAG will be different.
Thank you in advance.
You can directly fetch it from the Log table in the Airflow Metadata Database as follows:
from airflow.models.log import Log
from airflow.utils.db import create_session
with create_session() as session:
results = session.query(Log.dttm, Log.dag_id, Log.execution_date, Log.owner, Log.extra).filter(Log.dag_id == 'example_trigger_target_dag', Log.event == 'trigger').all()
# Get top 2 records
results[2]
Output:
(datetime.datetime(2020, 3, 30, 23, 16, 52, 487095, tzinfo=<TimezoneInfo [UTC, GMT, +00:00:00, STD]>),
'example_trigger_target_dag',
None,
'admin',
'[(\'dag_id\', \'example_trigger_target_dag\'), (\'origin\', \'/tree?dag_id=example_trigger_target_dag\'), (\'csrf_token\', \'IjhmYzQ4MGU2NGFjMzg2ZWI3ZjgyMTA1MWM3N2RhYmZiOThkOTFhMTYi.XoJ92A.5q35ClFnQjKRiWwata8dNlVs-98\'), (\'conf\', \'{"message": "kaxil"}\')]')
I will correct the previous answer a little:
with create_session() as session:
results = session.query(Log.dttm, Log.dag_id, Log.execution_date,
Log.owner, Log.extra)\
.filter(Log.dag_id == 'dag_id', Log.event ==
'trigger').order_by(Log.dttm.desc()).all()

How to read and parse data from PubSub topic into a beam pipeline and print it

I have a program which creates a topic in pubSub and also publishes messages to the topic. I also have an automated dataflow job(using a template) which saves these messages into my BigQuery table. Now I intend to replace the template based job with a python pipeline where my requirement is to read data from PubSub, apply transformations and save the data into BigQuery/publish to another PubSub topic. I started writing the script in python and did a lot of trial and error to achieve it but to my dismay, I could not achieve it. The code looks like this:
import apache_beam as beam
from apache_beam.io import WriteToText
TOPIC_PATH = "projects/test-pipeline-253103/topics/test-pipeline-topic"
OUTPUT_PATH = "projects/test-pipeline-253103/topics/topic-repub"
def run():
o = beam.options.pipeline_options.PipelineOptions()
p = beam.Pipeline(options=o)
print("I reached here")
# # Read from PubSub into a PCollection.
data = (
p
| "Read From Pub/Sub" >> beam.io.ReadFromPubSub(topic=TOPIC_PATH)
)
data | beam.io.WriteToPubSub(topic=OUTPUT_PATH)
print("Lines: ", data)
run()
I will really appreciate if I can get some help at the earliest.
Note: I have my project set up on google cloud and I have my script running locally.
Here the working code.
import apache_beam as beam
TOPIC_PATH = "projects/test-pipeline-253103/topics/test-pipeline-topic"
OUTPUT_PATH = "projects/test-pipeline-253103/topics/topic-repub"
class PrintValue(beam.DoFn):
def process(self, element):
print(element)
return [element]
def run():
o = beam.options.pipeline_options.PipelineOptions()
# Replace this by --stream execution param
standard_options = o.view_as(beam.options.pipeline_options.StandardOptions)
standard_options.streaming = True
p = beam.Pipeline(options=o)
print("I reached here")
# # Read from PubSub into a PCollection.
data = p | beam.io.ReadFromPubSub(topic=TOPIC_PATH) | beam.ParDo(PrintValue()) | beam.io.WriteToPubSub(topic=OUTPUT_PATH)
# Don't forget to run the pipeline!
result = p.run()
result.wait_until_finish()
run()
In summary
You miss to run the pipeline. Indeed, Beam is a Graph programming model. So, in your previous code, you built your graph but you never run it. Here, at the end, run it (not blocking call) and wait the end (blocking call)
When you start your pipeline, Beam mention that PubSub work only in streaming mode. Thus, you can start your code with --streaming param, or do it programmatically as shown in my code
Be careful, streaming mode means to listen indefinitively on PubSub. If you run this on Dataflow, your pipeline will be always up, until you stop it. This can be cost expensive if you have few message. Be sure that is the target model
An alternative is to use your pipeline for a limited period of time (you use scheduler for starting it, and another one for stopping it). But, at this moment, you have to stack message. Here you use a Topic as entry of the pipeline. This option force Beam to create a temporary subscription and to listen message on this subscription. This means that the message publish before this subscription creation won't be received and processed.
The idea is to create a subscription, by this way the message will be stacked in it (up to 7 days, by default). Then, use the subscription name in entry of your pipeline beam.io.ReadFromPubSub(subscription=SUB_PATH). The messages will be unstacked and processed by Beam (Order not guaranteed!)
Based on the Beam programming guide, you simply have to add a Transform step in your pipeline. Here an example or transform:
class PrintValue(beam.DoFn):
def process(self, element):
print(element)
return [element]
Add it to your pipeline
data | beam.ParDo(PrintValue()) | beam.io.WriteToPubSub(topic=OUTPUT_PATH)
You can add the number of transforms that you want. You can test the value and set the elements in tagged PCollection (for having multiple output) for fan out, or use side input for fan in PCollection.

Trying to disable all the Cloud Watch alarms in one shot

My organization is planning for a maintenance window for the next 5 hours. During that time, I do not want Cloud Watch to trigger alarms and send notifications.
Earlier, when I had to disable 4 alarms, I have written the following code in AWS Lambda. This worked fine.
import boto3
import collections
client = boto3.client('cloudwatch')
def lambda_handler(event, context):
response = client.disable_alarm_actions(
AlarmNames=[
'CRITICAL - StatusCheckFailed for Instance 456',
'CRITICAL - StatusCheckFailed for Instance 345',
'CRITICAL - StatusCheckFailed for Instance 234',
'CRITICAL - StatusCheckFailed for Instance 123'
]
)
But now, I was asked to disable all the alarms which are 361 in number. So, including all those names would take a lot of time.
Please let me know what I should do now?
Use describe_alarms() to obtain a list of them, then iterate through and disable them:
import boto3
client = boto3.client('cloudwatch')
response = client.describe_alarms()
names = [[alarm['AlarmName'] for alarm in response['MetricAlarms']]]
disable_response = client.disable_alarm_actions(AlarmNames=names)
You might want some logic around the Alarm Name to only disable particular alarms.
If you do not have the specific alarm arns, then you can use the logic in the previous answer. If you have a specific list of arns that you want to disable, you can fetch names using this:
def get_alarm_names(alarm_arns):
names = []
response = client.describe_alarms()
for i in response['MetricAlarms']:
if i['AlarmArn'] in alarm_arns:
names.append(i['AlarmName'])
return names
Here's a full tutorial: https://medium.com/geekculture/terraform-structure-for-enabling-disabling-alarms-in-batches-5c4f165a8db7

get all metrics which have alarms boto

I am new to boto and trying to get all the metrics that have alarms. Can some one please guide me how to do that? Here is what I am trying to do. I can get all the metrics in the following way.
import boto.ec2.cloudwatch
conn = boto.ec2.cloudwatch.connect_to_region('ap-southeast-1')
metrics = conn.list_metrics()
for metric in metrics:
print metric.name, metric.namespace
I know that there is a function "describe_alarms_for_metric" that returns the alarms for a metric. However it is not working for me and gives me an empty list. Here is what I am trying.
for metric in metrics:
print conn.describe_alarms_for_metric(metric.name, metric.namespace)
I can also see the list of all alarms using "describe_alarms" but I dont know which alarm is for what metric.
alarms = conn.describe_alarms()
for alarm in alarms:
print alarm
describe_alarms() returns a list of boto.ec2.cloudwatch.alarm objects, which can be inspected to find out the metric and other details about the alarm.
alarms = conn.describe_alarms()
for alarm in alarms:
print alarm.name
print alarm.metric
print alarm.namespace
For Boto3 apparently describe_alarms_for_metric() doesn't work unless you also supply a dimension - see the documentation:
Dimensions (list) -- The dimensions associated with the metric. If the
metric has any associated dimensions, you must specify them in order
for the call to succeed.
(dict) -- Expands the identity of a metric.
Name (string) -- [REQUIRED] The name of the dimension.
Value (string) -- [REQUIRED] The value representing the dimension
measurement.
With that requirement I'm not sure what the point of this API is. An alternative is to use describe_alarms() through the paginator then specify a filter.
You can use the example here as a base:
import boto3
# Create CloudWatch client
cloudwatch = boto3.client('cloudwatch')
# List alarms of insufficient data through the pagination interface
paginator = cloudwatch.get_paginator('describe_alarms')
for response in paginator.paginate(StateValue='INSUFFICIENT_DATA'):
print(response['MetricAlarms'])
Then modify it to add a filter:
paginator = cloudwatch.get_paginator('describe_alarms')
page_iterator = paginator.paginate()
filtered_iterator = page_iterator.search("MetricAlarms[?MetricName==`CPUUtilization` && Namespace==`AWS/EC2`]")
for alarm in filtered_iterator:
print(alarm)
More information in the API docs here and here.