Is there a 'value in time' concept in metrics, or how do I create one? - clojure

I am using metrics-clojure http://metrics-clojure.readthedocs.io/en/latest/ lists gauges, counters, meters, timers and histograms.
What I want is instead to report a number.
Very much like counter, but with a set! operation instead of just inc!/dec! or a meter that accepts a value.
One use case is processing batches of events. I can create a meter to watch the batches, but I would prefer to include the batch size such that the reporting end can use the correct units (so I can plot the number of events processed instead of the number of batches).
Another use case is wanting to produce a plot of some number that changes over time. Say again I was processing events, and I wanted to plot per event how many unique combinations of events I'd seen so far, how can I do this?
I can fake this a little bit with a gauge. I can create an atom, and have the gauge report the atom value, and set the atom value in the code... but I can't control when the gauge will report the value. So the value will only be plotted at points whenever the gauge happened to be queried, but I might want to record the values at more specific points (like the end of a batch, at intervals in a batch, or on every event).
And it seems convoluted.
Any suggestions?

Related

How do I query Prometheus for the timeseries that was updated last?

I have 100 instances of a service that use one database. I want them to export a Prometheus metric with the number of rows in a specific table of this database.
To avoid hitting the database with 100 queries at the same time, I periodically elect one of the instances to do the measurement and set a Prometheus gauge to the number obtained. Different instances may be elected at different times. Thus, each of the 100 instances may have its own value of the gauge, but only one of them is “current” at any given time.
What is the best way to pick only this “current” value from the 100 gauges?
My first idea was to export two gauges from each instance: the actual measurement and its timestamp. Then perhaps I could take the max(timestamp), then and it with the actual metric. But I can’t figure out how to do this in PromQL, because max will erase the instance I could and on.
My second idea was to reset the gauge to −1 (some sentinel value) at some time after the measurement. But this looks brittle, because if I don’t synchronize everything tightly, the “current” gauge could be reset before or after the “new” one is set, causing gaps or overlaps. Similar considerations go for explicitly deleting the metric and for exporting it with an explicit timestamp (to induce staleness).
I figured out the first idea (not tested yet):
avg(my_rows_count and on(instance) topk(1, my_rows_count_timestamp))
avg could as well be max or min, it only serves to erase instance from the final result.
last_over_time should do the trick
last_over_time(my_rows_count[1m])
given only one of them is “current” at any given time, like you said.

Google my business graph showing in red with this output REDUCE_PERCENTILE_99

I am using the Google business api, suddenly the entire api request stopped, the graph shows this:
REDUCE_PERCENTILE_99
Any idea what does this means please?
Thank you.
After research, I found that a Reducer operation describes how to aggregate data points from multiple time series into a single time series, where the value of each data point in the resulting series is a function of all the already aligned values in the input time series.
REDUCE_PERCENTILE_99: Reduce by computing the 99th percentile of data points across time series for each alignment period. This reducer is valid for GAUGE and DELTA metrics of numeric and distribution type. The value of the output is DOUBLE.

Cumulative sum of AWS Cloudwatch Metric

AWS Cloudwatch receives a count of 1 every time I start an image download. I am downloading 1,000s of images (on a cluster of EC2 instances) and would like to track the total progress.
I can't find any documentation on how to plot the cumulative sum of a metric. The AWS Cloudwatch Math Expressions looked promising, but they do not have an integrate function.
Currently, I can plot the sum of the started image downloads but only for periods, as seen below. Ideally, I'd like to plot the integral of this plot:
You can get a cumulative sum over the current range by using the SUM() function that is operated over the original range containing only the number One (1). Remember, you're looking for a single number in the end, so it's not much of a graph, but you need to turn the single value sum back into a time-series.
Define m1 as your metric. This is the metric you will want to use SUM() on.
Define an expression e1 as m1/m1. This results in a time-series with every value equal to 1. This is what will allow you convert that SUM back to a time-series.
Define an expression e2 as SUM(m1) / e1. This is, effectively, the cumulative sum of m1 divided by one for every data-point in the original time-series. It will be a horizontal line on the graph, which will have every point on that horizontal line being the cumulative sum of metric m1. This is required because Cloudwatch can only plot a time-series on the chart, not a single value.
Make m1 and e1 invisible. You need them, but you don't need to see them.
Finally, change the chart type from Line to Number, since you only wanted the cumulative sum anyway.
The reason you can't use SUM() directly is because it is a single value. By dividing by a time-series containing all 1's, the entire graph is the result of the SUM(). Then, changing the chart to a Number effectively hides all the math and presents only the "final result".
Looks like RUNNING_SUM() has been added that does what your need:
Graph with RUNNING_SUM
You can find RUNNING_SUM() under "Add math"->"All functions"
You are correct. All Amazon CloudWatch metrics are for a defined period.
The maximum period for a metric is one day, so this is not suitable for a cumulative counter that you wish to continue beyond one day.
You would need to find an alternate method of storing the count, such as an Amazon DynamoDB table. Use an atomic counter via UpdateItem to increment the count.
You can also use a very long period.
Change your stat to SUM, and set your metric's period to 7 days. You'll get a time series of 1 point with the cumulative sum of all the downloads.
If you give each download a unique dimension value, you can keep your queries separate.

Mailgun: algorithm for event polling

We are implementing support for tracking of Mailgun events in our application. We reviewed the proposed event polling algorithm but find ourselves not quite comfortable with it. First, we would prefer not to discard the data that we have already fetched and then retry from scratch after a pause. It is not very efficient and leaves a door open for a long loop of retries, as it is not clear when the loop is supposed to end. Second, the "threshold age" seems to be the key to determine "trustworthiness", but its value is not defined, only a very large "half an hour" is suggested.
It is our understanding that the events become "trustworthy" after some threshold delay, let us call it D_max, when the events are guaranteed to reside in the event storage. If so, we can implement this algorithm in a different way, so that we do not fetch the data that we know are not "trustworthy" and make use of all data which have been fetched.
We would be fetching data periodically, and on each iteration we would:
Make a request to the events API specifying an ascending time range from T_1 to T_2 = now() - D_max. For the first iteration, T_1 can be set to some time in the past, "e.g., half an hour ago". For the subsequent iterations, T_1 is set to the value of T_2 from the previous iteration.
Fetch all pages one by one while the next page URL is returned.
Use all fetched events, as they are all "trustworthy".
My questions are:
Q1: Are there any problems with this approach?
Q2: What is the minimum realistic value of D_max? Obviously, we can use "half an hour" for it, but we would like to be more agile in tracking events, so it would be great to know what is the minimum value we can set it to and still reliably fetch all events.
Thanks!
1: I see no problems with this solution (in fact I'm doing something very similar). I'm also storing ID's of the events to validate I'm not inserting duplicate entries.
2: I've been working through this similar process. Right now I am testing with D_max at 10 minutes.
Additionally, While going through a testing process I'm running an additional task nightly that goes back over the entire day to validate a few things:
Am I missing existing metrics?
Diagnose if there is a problem with the assumptions I've made about D_max.

RRD graphs in Zenoss showing NaN on large time ranges

I am trying to create COMMAND JSON datasource to monitor some values, for example from such script:
print json.dumps({
'values': {
'': {'random': random()},
},
'events': []
})
And when i just starting zencommand, appropriate rrd file is created, but cur, avg and max values on graph shows me NaN. That NaNs is replaced by actual numbers when I zoom in to a current point in time, which is not very far from start of monitoring.
Why it don't show correct min, max and avg values before I zoom in? Is that somehow related to consolidation? I read http://www.vandenbogaerdt.nl/rrdtool/min-avg-max.php, but that page don't tell anything about NaN values.
And is any way to quicker zoom in to the current timestamp to see some data faster?
When you are zoomed out, you'll be looking at the lower-granularity RRAs (Round Robin Archives). These do not get populated until enough data are in the higher-granularity ones; so, for example, if you have a 5min-granularity RRA, a 1hr-granularity RRA, and a 1day-granularity RRA, and have collected data for the last 45min, then you will see ~8 data points in your 'daily' graph (which uses the 5min RRA), but nothing in your 'monthly' (which will use the 1hr RRA) or your 'yearly' (which uses the 1day RRA).
This applies to any RRA; AVG, LAST, MAX, etc. Until the consolidated time window is complete, and the full complement of Primary Data Points has been collected for consolidation, the consolidated data point value is undefined.
RRDTool picks the RRA to use based on the requested graph data width and pixel width, as well as the requested consolidation functions. Although there are ways to force RRDtool to use a higher-granularity RRA than it needs to, and to consolidate on the fly, this is inefficient and slow. It also makes having the lower-granularity RRA pointless and throws away one of the major benefits of RRDtool (that it performs consolidation at update time making graphing faster)