Getting average value for a period with RRDTool

Getting average value for a period with RRDTool - rrdtool

I am using RRDTool to fetch data from RRD DB's, but have problem to get average/max number for a period (e.g. 12 hours). I want only one number representins the average/max of the period as GPRINT does in the graph function.

the trick is to use the graph function but use PRINT instead of GPRINT ... you can even leave out any functions that actually draw to the graph and use the graph function to calculate just the numbers ...
hth
tobi

Related

RRDTool data values (e.g. max value) are different in different time resolutions

currently I'm experimenting a bit with RRDTool. I'm aware that the accuracy gets lower the longer the time periods are selected. But I thought I could bypass this with my datasource settings.
For example temperature and humidity from my house, resoultion 1h:
And now with the resolution of 1d:
As you could see, there is a great difference for the max. value of the blue line.
I created my datasources and archives with this values:
"rrdtool create temp.rrd --step 30",
"DS:temp:GAUGE:60:U:U",
"DS:humidity:GAUGE:60:U:U",
"RRA:AVERAGE:0.5:1:1051200",
"RRA:MAX:0.5:1:1051200",
"RRA:MIN:0.5:1:1051200",
I thought that 1051200 (1 year = 31536000 / 30 s (resoulution) = 1051200) is correct for saving every value for a year and that there should be no need for interpolating.
Is it possible to get the exact values displayed even if the resolution changes (for example the max humidity (Luftfeuchtigkeit) at 99.9%)?
Here are my values for image creation:
"--start" => "-1h", (-1d etc-)
"--title" => "Haustemperatur",
"--vertical-label" => "°C / % RLF",
"--width" => 800,
"--height" => 600,
"--lower-limit" => "-5",
"DEF:temperatur=$rrdFile:temperatur:LAST",
"DEF:humidity=$rrdFile:humidity:LAST",
"LINE1:temperatur#33CC33:Temperatur",
"GPRINT:temperatur:LAST:\t\tAktuell\: %4.2lf °C",
"GPRINT:temperatur:AVERAGE:Schnitt\: %4.2lf °C",
"GPRINT:temperatur:MAX:Maximum\: %4.2lf °C\j",
"LINE1:humidity#0000FF:Relative Luftfeuchtigkeit",
"GPRINT:humidity:LAST:Aktuell\: %4.2lf %%",
"GPRINT:humidity:AVERAGE:Schnitt\: %4.2lf %%",
"GPRINT:humidity:MAX:Maximum\: %4.2lf %%\j",
Thanks for your help and any suggestions.
P.S. I'm using a library to generate the graphs and the database, please do not be surprised about possible syntax errors.

Your problem is that you are causing the values to be rolled-up on the fly at graph time, but have not correctly specified which rollup function to use. Your second graph is showing the MAXIMUM of the LAST in the interval, not the true Maximum.
There are a few issues to explain with this configuration:
Firstly, your RRD is defined using 3 RRAs with 1cdp=1pdp and different consolidation functions (AVG, MIN, MAX). This means they are functionally identical, but they do not save you any time at graphing as they have not done any pre-rollup for you! You should definitely consider having just one of these (probably AVG) and adding others at lower resolution to help speed up graphing when you have a bigger time window.
Secondly, you need to specify the on-the-fly rollup function. When graphing, RRDTool will work out the best RRA to use based on your DEF lines, and will perform any additional consolidation required on the fly. This can take a long time if the only available RRA is too high-granularity.
Your graph request uses DEF:temperatur=$rrdFile:temperatur:LAST but you do not actually have a LAST type RRA, so RRDTool will grab the last average. Your RRA data points are at 30s interval, but your second graph has (approx) 5min per pixel, meaning that RRDTool needs to grab the 10 entries from the RRA, and print the last. Looking at the data in the top graph, it seems that the last in that interval was the 66 value, though previous ones were 100.
So you have a choice. Do you want the graph to show the average for the time period, the maximum, or both? Do you want the figures at the bottom to show the maximum of the average, or the maximum of everything?
For example
"DEF:temperatur=$rrdFile:temperatur:AVERAGE",
"DEF:humidity=$rrdFile:humidity:AVERAGE",
"DEF:temperaturmax=$rrdFile:temperatur:MAX;reduce=MAX",
"DEF:humiditymax=$rrdFile:humidity:MAX;reduce=MAX",
"LINE1:temperatur#33CC33:Temperatur",
"LINE1:temperaturmax#66EE66:Maximum Temperatur",
"GPRINT:temperatur:LAST:\t\tAktuell\: %4.2lf °C",
"GPRINT:temperatur:AVERAGE:Schnitt\: %4.2lf °C",
"GPRINT:temperaturmax:MAX:Maximum\: %4.2lf °C\j",
"LINE1:humidity#0000FF:Relative Luftfeuchtigkeit",
"LINE1:humiditymax#3333FF:Maximum Luftfeuchtigkeit",
"GPRINT:humidity:LAST:Aktuell\: %4.2lf %%",
"GPRINT:humidity:AVERAGE:Schnitt\: %4.2lf %%",
"GPRINT:humiditymax:MAX:Maximum\: %4.2lf %%\j",
In this case, we define a separate DEF for the maximum data set, so that we can always obtain the highest value even after consolidation. This is also used in the GPRINT so that we get the MAX of the MAX rather than the MAX of the AVERAGE. The Maximum line is now drawn separately to the average line, so that we can see the effect of any rollup of data - the lines will be together at high-resolution but get further apart as the time window widens and resolution decreases.
TheDEF is set to force any rollup function used for the maxima to be MAX rather than AVG, so we can be sure to get the maximum rather than average of maxima.
We are also using AVERAGE rather than LAST in order to get more meaningful data after rollup. Note that we could also use a separate DEF for the LAST as well if we wanted to though it is of less usefulness.
Note that, if you ever expect to be generating graphs over more than a few days, you should definitely consider adding some lower-resolution RRAs for AVERAGE and MAX or else the graphs will generate very slowly. RRDTool is designed with the intention that data will be rolled up over time, rather than (as in a traditional database) every sample kept as-is. So, unless you really need to have 30s resolution data kept for an entire year, you may prefer to keep this high resolution data for only a week, and then have separate RRAs that roll up to 1 hour resolution and keep for longer. Many people keep the 30s for 2 days, then 30min-summary for 2 weeks, 2h-summary for 2 months, and then 1day-summary for 2 years.
For more information, see the RRDTool manual pages.

How to use the output of an array formula in subsequent calculations?

I have a google sheet where I'm getting the duration of a Youtube video as follows:
=REGEXEXTRACT(IMPORTXML(A2,"//*[#itemprop='duration']/#content"),"PT(\d+)M(\d+)S")
This gives me two cells with two values (minutes and seconds). However, I want to perform further calculations on them (multiply the minutes by 60 and add the seconds). How can I 'access' these values within a function, if at all?

You want to retrieve the duration time as the unit of the second.
You want to achieve this using the built-in formulas of Spreadsheet.
If my understanding is correct, how about these sample formulas?
Sample formula:
=VALUE(REGEXREPLACE(IMPORTXML(A2,"//*[#itemprop='duration']/#content"),"PT(\d+)M(\d+)S","00:$1:$2")*24*3600)
In this sample formula, the cell "A2" has the URL like https://www.youtube.com/watch?v=###.
The retrieved duration time is converted to the time format, and the value is retrieved as the second.
For example, when IMPORTXML(A2,"//*[#itemprop='duration']/#content") returns PT1M10S, VALUE(REGEXREPLACE("PT1M10S","PT(\d+)M(\d+)S","00:$1:$2")*24*3600) returns 70.
Even when the time is more than 1 hour, for example, the value like PT123M45S is returned. And =VALUE(REGEXREPLACE("PT123M45S","PT(\d+)M(\d+)S","00:$1:$2")*24*3600) returns 7425.
References:
REGEXREPLACE
VALUE
If I misunderstood your question and this was not the result you want, I apologize.
Added:
As other pattern, if you want to use =REGEXEXTRACT(IMPORTXML(A2,"//*[#itemprop='duration']/#content"),"PT(\d+)M(\d+)S"), how about the following formula?
Sample formula:
=QUERY(ARRAYFORMULA(VALUE(REGEXEXTRACT(IMPORTXML(A2,"//*[#itemprop='duration']/#content"),"PT(\d+)M(\d+)S"))),"SELECT Col1*60+Col2 label Col1*60+Col2 ''")
In this formula, values from the array are used and calculated.

or like this:
=TEXT(VALUE("00:"&SUBSTITUTE(REGEXREPLACE(
IMPORTXML(A1, "//*[#itemprop='duration']/#content"), "PT|S", ), "M",":")), "[ss]")*1
or shortest:
=REGEXREPLACE(IMPORTXML(A1,"//*[#itemprop='duration']/#content"),
"PT(\d+)M(\d+)S", "00:$1:$2")*86400

Cumulative sum of AWS Cloudwatch Metric

AWS Cloudwatch receives a count of 1 every time I start an image download. I am downloading 1,000s of images (on a cluster of EC2 instances) and would like to track the total progress.
I can't find any documentation on how to plot the cumulative sum of a metric. The AWS Cloudwatch Math Expressions looked promising, but they do not have an integrate function.
Currently, I can plot the sum of the started image downloads but only for periods, as seen below. Ideally, I'd like to plot the integral of this plot:

You can get a cumulative sum over the current range by using the SUM() function that is operated over the original range containing only the number One (1). Remember, you're looking for a single number in the end, so it's not much of a graph, but you need to turn the single value sum back into a time-series.
Define m1 as your metric. This is the metric you will want to use SUM() on.
Define an expression e1 as m1/m1. This results in a time-series with every value equal to 1. This is what will allow you convert that SUM back to a time-series.
Define an expression e2 as SUM(m1) / e1. This is, effectively, the cumulative sum of m1 divided by one for every data-point in the original time-series. It will be a horizontal line on the graph, which will have every point on that horizontal line being the cumulative sum of metric m1. This is required because Cloudwatch can only plot a time-series on the chart, not a single value.
Make m1 and e1 invisible. You need them, but you don't need to see them.
Finally, change the chart type from Line to Number, since you only wanted the cumulative sum anyway.
The reason you can't use SUM() directly is because it is a single value. By dividing by a time-series containing all 1's, the entire graph is the result of the SUM(). Then, changing the chart to a Number effectively hides all the math and presents only the "final result".

Looks like RUNNING_SUM() has been added that does what your need:
Graph with RUNNING_SUM
You can find RUNNING_SUM() under "Add math"->"All functions"

You are correct. All Amazon CloudWatch metrics are for a defined period.
The maximum period for a metric is one day, so this is not suitable for a cumulative counter that you wish to continue beyond one day.
You would need to find an alternate method of storing the count, such as an Amazon DynamoDB table. Use an atomic counter via UpdateItem to increment the count.

You can also use a very long period.
Change your stat to SUM, and set your metric's period to 7 days. You'll get a time series of 1 point with the cumulative sum of all the downloads.
If you give each download a unique dimension value, you can keep your queries separate.

Can rrdtool store data for metrics, list of which changes over time, like, for example, top 10 processes consuming CPU?

We need to create a graph with top 10 items, which will change from time to time, for example - top 10 processes consuming CPU or any other top 10 items, we can generate values for on the monitored server, with possibility to have names of the items on the graph.
Please tell me, is there any way to store this information using rrdtool?
Thanks

If you want to store this kind of information with rrdtool, you will have to create a separate rrd database for each item, update them accordingly and finally generate charts picking the 10 'top' rrd files ...
In other words, quite a lot of the magic has to happen in the script you write around rrdtool ... rrdtool will take care of storing the time series data ...

rrdtool: Compute 95th percentile of data within a sliding window

I'm using rrdtool to graph data about CPU usage as produced and stored by Munin. Munin (at least for us) stores each data-series in an .rrd file with 12 RRAs: "MIN", "MAX", and "AVERAGE" over each of the four periods "last 2d in 5m intervals", "last 9d in 30m intervals", "last 270d in 12h intervals", and "last 177y in 144d intervals".
I already know how to use rrdtool graph to produce a trend line indicating where my average CPU usage is going. (For simplicity, we can pretend I'm on a single-CPU system; in real life I have more code to deal with that.)
rrdtool graph /tmp/foo.png \
--start -12w --end +24w \
--lower-limit 0 --upper-limit 100 --rigid \
--title 'cpu usage' --width 620 --height 200 --border 0 \
--vertical-label 'cpu usage' \
DEF:idle=/var/lib/munin/mybox/mybox-cpu-idle-d.rrd:42:AVERAGE \
DEF:iowait=/var/lib/munin/mybox/mybox-cpu-iowait-d.rrd:42:AVERAGE \
CDEF:percent_used=100,idle,-,iowait,- \
AREA:percent_used#00880077:'cpu usage' \
VDEF:fit_m=percent_used,LSLSLOPE \
VDEF:fit_b=percent_used,LSLINT \
CDEF:trendline=percent_used,POP,fit_m,COUNT,*,fit_b,+ \
LINE1:trendline#FFBB00:'Trend since 12w ago'
The problem with this graph is that it shows only the average CPU usage trend. But my workload is spiky: usage is very low 90% of the time and then has brief spikes. What I really care about is the trend of the spikes in CPU usage.
So I could run the same command replacing AVERAGE with MAX... but the actual maxes are so randomly distributed (and usually close to 100%) that they don't produce any useful trend line.
So I'm thinking that the graph I actually want would be a graph of the 95th percentile (or maybe just the 75th percentile... ideally I'd be able to adjust the parameter), where that "percentile" is taken over the data in each consecutive 24-hour period.
Conceptually, I want to boil down our last 9 days of data (48 data points per day) into just 9 data points (1 data point per day — representing the Nth percentile of the 48 original points from that day).
And then I'd fit a line to that data using LSLSLOPE and LSLINT and display it on the same graph as the rest of this stuff.
But I can't figure out how to boil down the data in this way, using rrdtool's RPN facilities.
I know that I can use PERCENTNAN to get the scalar number that is the 95th percentile of my whole data-series, but I want a data-series consisting of 9 numbers, not just one scalar.
I know that I can use TRENDNAN to get a data-series that is the mean of a sliding window of my data-series, which would be good enough if only it gave me the median (50th percentile) instead of the mean, and then allowed me to adjust that parameter from "50" up to "95"... but it doesn't.
Alternatively, I know how to use Python to compute the series I want, using rrdtool first and rrdtool fetch, but then there's no simple way to feed that series back into rrdtool to create the graph.
I'm thinking maybe I could extract usage_today, usage_yesterday, usage_2d, usage_3d,... into nine separate series, use PERCENTNAN on them all individually, and then somehow fit a line to that. But that's mostly desperate handwaving; if someone posted an answer that actually made that approach work, I'd accept it.

RRDTool has 95th percentile functionality built in. Note that the accuracy of the percentail calculations will depend on the granularity of the data available in the requested time period, though... so the bigger your 1-pdp RRA is, the better.
So, for example, to get a horizontal line at the 95th percentile, we can use these directives:
DEF:idlehr=/var/lib/munin/mybox/mybox-cpu-idle-d.rrd:42:AVERAGE:step=1
VDEF:pctidle=idlehr,95,PERCENTNAN
HRULE:pctidle#ff0000:95th_Percentile
The step=1 on the end of the DEF ensures that the highest resolution data available will be selected. This may be computationally intensive, if you're graphing for a full year and high resolution data are avaialable for this time window!
The problem is, though, that you want a graph showing a different value for each day -- in effect, a sliding window of percentile calculations, in the same way as TRED and PREDICT work, but with a step of one day. RRDTool cannot do this.
So, the answer is, you can show a graph for one day with a single value percentile for that day. You cannot create a graph with one data point per day, where that data point is calculated as the percentile for that day.
The only way I can think of to achieve this is to repeatedly call rrdtool xport iteratively to calculate the percentile values for a sequence of days, and then use that data to generate a bar graph in another graphing package.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js