RRDTool data values (e.g. max value) are different in different time resolutions - rrdtool

currently I'm experimenting a bit with RRDTool. I'm aware that the accuracy gets lower the longer the time periods are selected. But I thought I could bypass this with my datasource settings.
For example temperature and humidity from my house, resoultion 1h:
And now with the resolution of 1d:
As you could see, there is a great difference for the max. value of the blue line.
I created my datasources and archives with this values:
"rrdtool create temp.rrd --step 30",
"DS:temp:GAUGE:60:U:U",
"DS:humidity:GAUGE:60:U:U",
"RRA:AVERAGE:0.5:1:1051200",
"RRA:MAX:0.5:1:1051200",
"RRA:MIN:0.5:1:1051200",
I thought that 1051200 (1 year = 31536000 / 30 s (resoulution) = 1051200) is correct for saving every value for a year and that there should be no need for interpolating.
Is it possible to get the exact values displayed even if the resolution changes (for example the max humidity (Luftfeuchtigkeit) at 99.9%)?
Here are my values for image creation:
"--start" => "-1h", (-1d etc-)
"--title" => "Haustemperatur",
"--vertical-label" => "°C / % RLF",
"--width" => 800,
"--height" => 600,
"--lower-limit" => "-5",
"DEF:temperatur=$rrdFile:temperatur:LAST",
"DEF:humidity=$rrdFile:humidity:LAST",
"LINE1:temperatur#33CC33:Temperatur",
"GPRINT:temperatur:LAST:\t\tAktuell\: %4.2lf °C",
"GPRINT:temperatur:AVERAGE:Schnitt\: %4.2lf °C",
"GPRINT:temperatur:MAX:Maximum\: %4.2lf °C\j",
"LINE1:humidity#0000FF:Relative Luftfeuchtigkeit",
"GPRINT:humidity:LAST:Aktuell\: %4.2lf %%",
"GPRINT:humidity:AVERAGE:Schnitt\: %4.2lf %%",
"GPRINT:humidity:MAX:Maximum\: %4.2lf %%\j",
Thanks for your help and any suggestions.
P.S. I'm using a library to generate the graphs and the database, please do not be surprised about possible syntax errors.

Your problem is that you are causing the values to be rolled-up on the fly at graph time, but have not correctly specified which rollup function to use. Your second graph is showing the MAXIMUM of the LAST in the interval, not the true Maximum.
There are a few issues to explain with this configuration:
Firstly, your RRD is defined using 3 RRAs with 1cdp=1pdp and different consolidation functions (AVG, MIN, MAX). This means they are functionally identical, but they do not save you any time at graphing as they have not done any pre-rollup for you! You should definitely consider having just one of these (probably AVG) and adding others at lower resolution to help speed up graphing when you have a bigger time window.
Secondly, you need to specify the on-the-fly rollup function. When graphing, RRDTool will work out the best RRA to use based on your DEF lines, and will perform any additional consolidation required on the fly. This can take a long time if the only available RRA is too high-granularity.
Your graph request uses DEF:temperatur=$rrdFile:temperatur:LAST but you do not actually have a LAST type RRA, so RRDTool will grab the last average. Your RRA data points are at 30s interval, but your second graph has (approx) 5min per pixel, meaning that RRDTool needs to grab the 10 entries from the RRA, and print the last. Looking at the data in the top graph, it seems that the last in that interval was the 66 value, though previous ones were 100.
So you have a choice. Do you want the graph to show the average for the time period, the maximum, or both? Do you want the figures at the bottom to show the maximum of the average, or the maximum of everything?
For example
"DEF:temperatur=$rrdFile:temperatur:AVERAGE",
"DEF:humidity=$rrdFile:humidity:AVERAGE",
"DEF:temperaturmax=$rrdFile:temperatur:MAX;reduce=MAX",
"DEF:humiditymax=$rrdFile:humidity:MAX;reduce=MAX",
"LINE1:temperatur#33CC33:Temperatur",
"LINE1:temperaturmax#66EE66:Maximum Temperatur",
"GPRINT:temperatur:LAST:\t\tAktuell\: %4.2lf °C",
"GPRINT:temperatur:AVERAGE:Schnitt\: %4.2lf °C",
"GPRINT:temperaturmax:MAX:Maximum\: %4.2lf °C\j",
"LINE1:humidity#0000FF:Relative Luftfeuchtigkeit",
"LINE1:humiditymax#3333FF:Maximum Luftfeuchtigkeit",
"GPRINT:humidity:LAST:Aktuell\: %4.2lf %%",
"GPRINT:humidity:AVERAGE:Schnitt\: %4.2lf %%",
"GPRINT:humiditymax:MAX:Maximum\: %4.2lf %%\j",
In this case, we define a separate DEF for the maximum data set, so that we can always obtain the highest value even after consolidation. This is also used in the GPRINT so that we get the MAX of the MAX rather than the MAX of the AVERAGE. The Maximum line is now drawn separately to the average line, so that we can see the effect of any rollup of data - the lines will be together at high-resolution but get further apart as the time window widens and resolution decreases.
TheDEF is set to force any rollup function used for the maxima to be MAX rather than AVG, so we can be sure to get the maximum rather than average of maxima.
We are also using AVERAGE rather than LAST in order to get more meaningful data after rollup. Note that we could also use a separate DEF for the LAST as well if we wanted to though it is of less usefulness.
Note that, if you ever expect to be generating graphs over more than a few days, you should definitely consider adding some lower-resolution RRAs for AVERAGE and MAX or else the graphs will generate very slowly. RRDTool is designed with the intention that data will be rolled up over time, rather than (as in a traditional database) every sample kept as-is. So, unless you really need to have 30s resolution data kept for an entire year, you may prefer to keep this high resolution data for only a week, and then have separate RRAs that roll up to 1 hour resolution and keep for longer. Many people keep the 30s for 2 days, then 30min-summary for 2 weeks, 2h-summary for 2 months, and then 1day-summary for 2 years.
For more information, see the RRDTool manual pages.

Related

Is there a way to just sum a google cloud metric within a time period?

With the Metrics Explorer in the Google Cloud Platform, is it possible to sum a metric within a defined time period? I have custom gauge metrics set up for various data that I care for. And I am just trying to sum them up over the course of days, weeks, and months. It is important that I count everything between the first of the month and the last of the month. But instead, the aligner seems to pick an arbitrary time to align to (e.g. the 4th of the month) and I can't be certain that I am getting the correct values.
For example, if I try to use delta, rate, or sum within a time window of May 1st 00:00 to June 1st 00:00, with an alignment of 31 days, I will see two numbers. One will be for the 14th of May and one for the 14th of June and they will add up to a very large number.
fetch generic_task
| metric 'custom.googleapis.com/go-metrics/request_counter'
| filter (metric.environment == 'prod')
| align delta(31d)
| every 31d
| group_by [metric.environment],
[value_request_counter_aggregate: aggregate(value.request_counter)]
That isn't so bad but if I change alignment, the sum of those numbers don't add up, such as if I tried 7d or 1d instead. Like, the number is twice as much as if it were counting data from outside the time range that I specified. And worse, upon reload of the webpage, it picks different days/times it will align on too.
To get around this, I have been reduced to setting the alignment to a fine amount and just tolerating a small amount of error/inconsistency.

Rrdtool graphing a digital event without jagged edges

I'm reading my energy meter output to keep track of actual energy used, etc. The energy cost is calculated with a tarif, and that changes during the day from one state to another (1 or 2). When I graph the tarif together with the actual usage over time, the sharp edge of the tarif state gets jagged, possibly caused by the averaging.
I have used DS:tarif:GAUGE:... to setup the db, and I'm using DEF:tarif:xx.rrd:tarif:AVERAGE to graph.
How do I record and graph a "digital" signal with sharp edges?
There are two things that you might be referring to here.
First, make sure you are not using --slope-mode. This option uses slopes rather than steps in the graph; this sound like the wrong option for you.
Next, you have the problem that your tariff DS is varying between two values (lets say a and b), but when you look at the higher-level graphs, this starts to average out and looks wrong.
When looking at your high-granularity (IE, close up) graphs, you have one (or more) pixels per RRA data point, and one sample per RRA data point. So, you'll see either a or b. However, when you move up to a lower granularity (IE, further away) graph, RRDtool will start to have multiple data points per pixel. Depending on your RRD file definition, rrdtool will then either consolodate on the fly, or move to using a different RRA with more samples per datapoint.
So, this means that you have multiple samples per pixel, and they need to somehow be combined. By default, RRDTool will average them, which can result in the jagged behaviour.
However, what do you want to happen? If the time interval corresponding to a single pixel has 2 instances of a and three of b, what should the graph show?
Here are a couple of suggestions on how you might do it.
Use background.
Since your tariff has only two values, you could use this to colour the background - eg, make a red or green background depending on tariff, and then draw your usage graph line over the top of this. Background colour can be done by having an area from 0 to inf using a semi-transparent colour like #ff808080
Use a special MAX (or MIN) RRA
Maybe you just want to display the maximum tariff for that period. So, you can create an additional MAX type RRA for each consolodation interval, and graph the MAX. Of course, this means that when you have 1 pixel = 1 day, you'll be seeing just the higher value as a straight line. I suppose a 'median' consolodation would be useful here, but for obvious reasons RRDTool doesn't have this.
A bigger graph
A large graph means more pixels, which means less averaging required.
Display your data differently
What are you trying to visualise? Possibly you don't need to see the average cost per unit over this time window, and a calculated sum of units x tariff would work better - particularly since this would be able to be averaged up without problem.
Don't use RRDTool
RRDTool is designed to progressively normalise, summarise and expire regular time-series data over time, and does so very efficiently. However, if you're interested in having the exact data forever then maybe you need a different database.
Pre-normalise your sampling times
If your data change frequently, then Data Normalisation might be changing them. Make sure you always store the data on a time step boundary - if your RRD step is 300 (5min) then ensure you store the data at timestamps which are a multiple of 300, and don't use N (meaning 'whatever the time is Now').
Thank you for this very elaborate answer. Because I have not been able to find suitable answers to my problem, your summary, and my situation, may hopefully also help others.
Looking back, I should have included more information, and before addressing your points, let me do that right now:
Here is a snippet of my database definition script:
# 5 min sample rate = 300 step rate
# 1D report: 1 step every 5 minutes (5 min sample rate=1), 20 per hr, 24hrs = 480 slots
# 1Wk report: 1 step every hour (20 x 5 min samples=20), 24 hrs, 7 days = 168 slots
# 1M report: 1 step every hour (20 x 5 min samples=20), 24 hrs, 30 days = 720 slots
# 1Y report: 1 step every day (24 Hr * 20 samples=480), 365 days = 365 slots
rrdtool create energy_mon.rrd --step 300 --start 1480943366 \
DS:meter_total:COUNTER:600:U:U \
DS:meter_low:COUNTER:600:U:U \
DS:meter_hi:COUNTER:600:U:U \
DS:energy:GAUGE:600:U:U \
DS:tarif:GAUGE:600:U:U \
RRA:AVERAGE:0.5:1:480 \
RRA:AVERAGE:0.5:20:168 \
RRA:AVERAGE:0.5:20:720 \
RRA:AVERAGE:0.5:480:365
Here is a snippet of my graphing script:
#daily
rrdtool graph $GDIR/energy_daily.png --start -1d \
-w 675 -h 250 \
--vertical-label "KWatt" \
--lower-limit=0 \
--watermark "`date`" \
DEF:energy=$DIR/energy_mon.rrd:energy:AVERAGE \
LINE1:energy$GREEN_COLOR:"Energie" \
DEF:tarif=$DIR/energy_mon.rrd:tarif:AVERAGE \
LINE1:tarif$BLACK_COLOR:"Tarief"
#two days
rrdtool graph $GDIR/energy_2days.png --start -2d \
-w 675 -h 250 \
--vertical-label "KWatt" \
--lower-limit=0 \
--watermark "`date`" \
DEF:energy=$DIR/energy_mon.rrd:energy:AVERAGE \
LINE1:energy$GREEN_COLOR:"Energie" \
DEF:tarif=$DIR/energy_mon.rrd:tarif:AVERAGE \
LINE1:tarif$BLACK_COLOR:"Tarief"
Here is a graph of the "1 day" result:
1 day
Here is a graph of the "last 48 hrs" :
2 day
In order to see my "tarif" on the graphs together with the actual energy usage in KWatts, I multiply the 1 (low) or 2 (normal) setting by 100 and store that number in the database. On weekdays, the tariff changes at 23:00 hrs to "low" and back to normal at 07:00 hrs. On the "daily" graph this is reported correctly, albeit with a 5 min. sample resolution "error".
As you can see, the "last 48 hrs" set by "-2d" in the graphing instructions, should be using the same RRA data from the database, but the tarif line does not switch from 100 to 200, but "sits" at about half way for approx. 1-1/2 hours in time, or more than 30 samples. This is not a rounding based on a pixel level resolution (;-).
The only explanation I have at this moment is that the "-2d" graph uses the second RRA, that is actually intended for a week's worth of data. If this is the answer, then how does one influence the relationship between the "-d/w/m/y" settings in the graphing instructions and the intended RRA's? I don't see a connection between the two, so it's probably selected automatically, and most likely (if it's that smart) based on the amount of data points. The first RRA does not have enough so it switches to use the second RRA. Seems plausible, but is this true?
BTW, using "-1w", "-1m" and "-1y" for the other graphs I use, show the same weird results on the tarif "signal.
In the meantime, I have been able to "fix" the graphs without redefining my database by adding a CDEF. Here is what I use for all graphs other than the "1d".
DEF:tarif=$DIR/energy_mon.rrd:tarif:AVERAGE \
CDEF:normaal=tarif,100,GT,200,100,IF \
LINE1:normaal$BLACK_COLOR:"Tarief"
To explain the CDEF, IF the tariff is > 100 (or actually 1=low), set the result to "normaal" which is the high tariff. IF not, set the tariff to "dal" or low.
Your advice to use the background color for tariff is a very good one, and because I can use that on the current database, I have tried that too.
So is your MIN/MAX solution, I thought about that earlier but put that off because that will require setting up a new database and filling that with previous data. (which I have in .csv form, and I have gone through that tedious exercise before - Which is why I have the --start in the rddtool create)
But, I would like to get to the bottom of this issue.
Do you or anybody else have any more words of wisdom to share or can verify/explain the working of the RRA <==> DEF relationship or selection?

rrdtool: Compute 95th percentile of data within a sliding window

I'm using rrdtool to graph data about CPU usage as produced and stored by Munin. Munin (at least for us) stores each data-series in an .rrd file with 12 RRAs: "MIN", "MAX", and "AVERAGE" over each of the four periods "last 2d in 5m intervals", "last 9d in 30m intervals", "last 270d in 12h intervals", and "last 177y in 144d intervals".
I already know how to use rrdtool graph to produce a trend line indicating where my average CPU usage is going. (For simplicity, we can pretend I'm on a single-CPU system; in real life I have more code to deal with that.)
rrdtool graph /tmp/foo.png \
--start -12w --end +24w \
--lower-limit 0 --upper-limit 100 --rigid \
--title 'cpu usage' --width 620 --height 200 --border 0 \
--vertical-label 'cpu usage' \
DEF:idle=/var/lib/munin/mybox/mybox-cpu-idle-d.rrd:42:AVERAGE \
DEF:iowait=/var/lib/munin/mybox/mybox-cpu-iowait-d.rrd:42:AVERAGE \
CDEF:percent_used=100,idle,-,iowait,- \
AREA:percent_used#00880077:'cpu usage' \
VDEF:fit_m=percent_used,LSLSLOPE \
VDEF:fit_b=percent_used,LSLINT \
CDEF:trendline=percent_used,POP,fit_m,COUNT,*,fit_b,+ \
LINE1:trendline#FFBB00:'Trend since 12w ago'
The problem with this graph is that it shows only the average CPU usage trend. But my workload is spiky: usage is very low 90% of the time and then has brief spikes. What I really care about is the trend of the spikes in CPU usage.
So I could run the same command replacing AVERAGE with MAX... but the actual maxes are so randomly distributed (and usually close to 100%) that they don't produce any useful trend line.
So I'm thinking that the graph I actually want would be a graph of the 95th percentile (or maybe just the 75th percentile... ideally I'd be able to adjust the parameter), where that "percentile" is taken over the data in each consecutive 24-hour period.
Conceptually, I want to boil down our last 9 days of data (48 data points per day) into just 9 data points (1 data point per day — representing the Nth percentile of the 48 original points from that day).
And then I'd fit a line to that data using LSLSLOPE and LSLINT and display it on the same graph as the rest of this stuff.
But I can't figure out how to boil down the data in this way, using rrdtool's RPN facilities.
I know that I can use PERCENTNAN to get the scalar number that is the 95th percentile of my whole data-series, but I want a data-series consisting of 9 numbers, not just one scalar.
I know that I can use TRENDNAN to get a data-series that is the mean of a sliding window of my data-series, which would be good enough if only it gave me the median (50th percentile) instead of the mean, and then allowed me to adjust that parameter from "50" up to "95"... but it doesn't.
Alternatively, I know how to use Python to compute the series I want, using rrdtool first and rrdtool fetch, but then there's no simple way to feed that series back into rrdtool to create the graph.
I'm thinking maybe I could extract usage_today, usage_yesterday, usage_2d, usage_3d,... into nine separate series, use PERCENTNAN on them all individually, and then somehow fit a line to that. But that's mostly desperate handwaving; if someone posted an answer that actually made that approach work, I'd accept it.
RRDTool has 95th percentile functionality built in. Note that the accuracy of the percentail calculations will depend on the granularity of the data available in the requested time period, though... so the bigger your 1-pdp RRA is, the better.
So, for example, to get a horizontal line at the 95th percentile, we can use these directives:
DEF:idlehr=/var/lib/munin/mybox/mybox-cpu-idle-d.rrd:42:AVERAGE:step=1
VDEF:pctidle=idlehr,95,PERCENTNAN
HRULE:pctidle#ff0000:95th_Percentile
The step=1 on the end of the DEF ensures that the highest resolution data available will be selected. This may be computationally intensive, if you're graphing for a full year and high resolution data are avaialable for this time window!
The problem is, though, that you want a graph showing a different value for each day -- in effect, a sliding window of percentile calculations, in the same way as TRED and PREDICT work, but with a step of one day. RRDTool cannot do this.
So, the answer is, you can show a graph for one day with a single value percentile for that day. You cannot create a graph with one data point per day, where that data point is calculated as the percentile for that day.
The only way I can think of to achieve this is to repeatedly call rrdtool xport iteratively to calculate the percentile values for a sequence of days, and then use that data to generate a bar graph in another graphing package.

RRDTool Counter increment lower than time

I create a standard RRDTool database with a default step of 5mn (300s).
I have different types of values in it, some gauges which are easily processed, but I have other values I would have in COUNTER but here is my problem :
I read the data in a program, and get the difference between values over two steps is good but the counter increment less than time (It can increment by less than 300 during a step), so my out value is wrong.
Is it possible to change the COUNTER for not be a number by second but by step or something like that, if it's not I suppose I have to calculate the difference in my program ?
Thank you for helping.
RRDTool is capable of handling fractional values, so there is no problem if the counter increments by less than the seconds interval since the last update.
RRDTool stores everything as a Rate. If your DS is of type GAUGE, then RRDTool assumes that the incoming value is alreayd a rate, and only applies Data Normalisation (more on this later). If the type is COUNTER or DERIVE, then the value/timepoint you are updating with is compared to the previous value/timepoint to obtain a rate thus: r=(x2 - x1)/(t2 - t1). The rate obtained is then Normalised. The other DS type is ABSOLUTE, which assumes the counter was reset on the last read, giving r=x2/(t2 - t1).
The Normalisation step adjusts the data point based on assuming a linear progression from the last data point so that it lies exactly on an interval boundary. For example, if your step is 5min, and you update at 12:06, the data point is adjusted back to what it would have been at 12:05, and stored against 12:05. However the last unadjusted DP is still preserved for use at the next update, so that overall rates are correct.
So, if you have a 300s (5min) interval, and the value increased by 150, the rate stored will be 0.5.
If the value you are graphing is something small, e.g. 'number of pages printed', this might seem counterintuitive, but it works well for large rates such as network traffic counters (which is what RRDTool was designed for).
If you really do not want to display fractional values in the generated graphs or output, then you can use a format string such as %.0f to enforce no decimal places and the displayed number will be rounded to the nearest integer.

RRD graphs in Zenoss showing NaN on large time ranges

I am trying to create COMMAND JSON datasource to monitor some values, for example from such script:
print json.dumps({
'values': {
'': {'random': random()},
},
'events': []
})
And when i just starting zencommand, appropriate rrd file is created, but cur, avg and max values on graph shows me NaN. That NaNs is replaced by actual numbers when I zoom in to a current point in time, which is not very far from start of monitoring.
Why it don't show correct min, max and avg values before I zoom in? Is that somehow related to consolidation? I read http://www.vandenbogaerdt.nl/rrdtool/min-avg-max.php, but that page don't tell anything about NaN values.
And is any way to quicker zoom in to the current timestamp to see some data faster?
When you are zoomed out, you'll be looking at the lower-granularity RRAs (Round Robin Archives). These do not get populated until enough data are in the higher-granularity ones; so, for example, if you have a 5min-granularity RRA, a 1hr-granularity RRA, and a 1day-granularity RRA, and have collected data for the last 45min, then you will see ~8 data points in your 'daily' graph (which uses the 5min RRA), but nothing in your 'monthly' (which will use the 1hr RRA) or your 'yearly' (which uses the 1day RRA).
This applies to any RRA; AVG, LAST, MAX, etc. Until the consolidated time window is complete, and the full complement of Primary Data Points has been collected for consolidation, the consolidated data point value is undefined.
RRDTool picks the RRA to use based on the requested graph data width and pixel width, as well as the requested consolidation functions. Although there are ways to force RRDtool to use a higher-granularity RRA than it needs to, and to consolidate on the fly, this is inefficient and slow. It also makes having the lower-granularity RRA pointless and throws away one of the major benefits of RRDtool (that it performs consolidation at update time making graphing faster)