RRDtool Percentage calculation - rrdtool

I want to calculate the percentage of usage of some features of my host, by RRD queries.(I have Cacti installed and Cacti stores the monitoring data in RRD).
For example if there is 1 GB total swap memory and now I have used 250 MB, the return value of my query should be 0.25
or as another example, if the total bandwidth of network is 200 and now 100 is used, the desired return value is 0.50
My questions are:
1) can RRD tell me these total values (total memory of host or total network bandwidth)?
2) Which query syntax can returns such percentages as described in examples?
If my questions are not obvious, I can describe more.
if anyone can point me to some good documentation on RRD to figure this out myself or if anyone can give me a good start it would be greatly appreciated.

For calculations in rrdtool you can use CDEF expressions when drawing graphs or in rrdtool xport commands.
CDEF:perc=x,200,/
The expressions are based on reversed polish notation and documented in man rrdgraph_data.
In this example:
perc is the name of the new computed field
stuff after = is an expression in reversed polish notation meaning x/200

Related

why bootstrap returns empty and values for standard deviation and CI and others are missing

I am trying to use the Stata bootstrap command to estimate the sd of the median of the hours variable from the nlsw88 dataset.
I can't get the answer but when I changed the variable in my code from hours to wage it works. The only difference between these two variables is that hours is integer while wage has decimals. I even changed the storage type of hours from byte to float but nothing changed.
sysuse nlsw88
bootstrap r(p50), nodots: summarize hours , detail
You are bootstrapping to estimate the median (p50), but in practice there is no variance in the median.
Not sure exactly how bootstrap works, but you are not setting a sample size and 48.75% of the observations has the value 40 which is also close to the mean. All of this makes it very likely that the medians in all samples are 40 and that there therefore is no variance between the samples.
You must pick a very small sample size to be likely to get any variance in the median between the randomly drawn samples. For example, if you only sample 5 observations each bootstrap round.
sysuse nlsw88
bootstrap r(p50) , size(5) nodots: summarize hours , detail
This might explain why you are getting missing values (which is your question), but it does not answer what you should do instead. Ask yourself if you really meant to use the median? Or if bootstrap is the right method to use here?

Using Logistic Regression For Timeseries Data in Amazon SageMaker

For a project I am working on, which uses annual financial reports data (of multiple categories) from companies which have been successful or gone bust/into liquidation, I previously created a (fairly well performing) model on AWS Sagemaker using a multiple linear regression algorithm (specifically, the AWS stock algorithm for logistic regression/classification problems - the 'Linear Learner' algorithm)
This model just produces a simple "company is in good health" or "company looks like it will go bust" binary prediction, based on one set of annual data fed in; e.g.
query input: {data:[{
"Gross Revenue": -4000,
"Balance Sheet": 10000,
"Creditors": 4000,
"Debts": 1000000
}]}
inference output: "in good health" / "in bad health"
I trained this model by just ignoring what year for each company the values were from and pilling in all of the annual financial reports data (i.e. one years financial data for one company = one input line) for the training, along with the label of "good" or "bad" - a good company was one which has existed for a while, but hasn't gone bust, a bad company is one which was found to have eventually gone bust; e.g.:
label
Gross Revenue
Balance Sheet
Creditors
Debts
good
10000
20000
0
0
bad
0
5
100
10000
bad
20000
0
4
100000000
I hence used these multiple features (gross revenue, balance sheet...) along with the label (good/bad) in my training input, to create my first model.
I would like to use the same features as before as input (gross revenue, balance sheet..) but over multiple years; e.g take the values from 2020 & 2019 and use these (along with the eventual company status of "good" or "bad") as the singular input for my new model. However I'm unsure of the following:
is this an inappropriate use of logistic regression Machine learning? i.e. is there a more suitable algorithm I should consider?
is it fine, or terribly wrong to try and just use the same technique as before, but combine the data for both years into one input line like:
label
Gross Revenue(2019)
Balance Sheet(2019)
Creditors(2019)
Debts(2019)
Gross Revenue(2020)
Balance Sheet(2020)
Creditors(2020)
Debts(2020)
good
10000
20000
0
0
30000
10000
40
500
bad
100
50
200
50000
100
5
100
10000
bad
5000
0
2000
800000
2000
0
4
100000000
I would personally expect that a company which has gotten worse over time (i.e. companies finances are worse in 2020 than in 2019) should be more likely to be found to be a "bad"/likely to go bust, so I would hope that, if I feed in data like in the above example (i.e. earlier years data comes before later years data, on an input line) my training job ends up creating a model which gives greater weighting to the earlier years data, when making predictions
Any advice or tips would be greatly appreciated - I'm pretty new to machine learning and would like to learn more
UPDATE:
Using Long-Short-Term-Memory Recurrent Neural Networks (LSTM RNN) is one potential route I think I could try taking, but this seems to commonly just be used with multivariate data over many dates; my data only has 2 or 3 dates worth of multivariate data, per company. I would want to try using the data I have for all the companies, over the few dates worth of data there are, in training
I once developed a so called Genetic Time Series in R. I used a Genetic Algorithm which sorted out the best solutions from multivariate data, which were fitted on a VAR in differences or a VECM. Your data seems more macro economic or financial than user-centric and VAR or VECM seems appropriate. (Surely it is possible to treat time-series data in the same way so that we can use LSTM or other approaches, but these are very common) However, I do not know if VAR in differences or VECM works with binary classified labels. Perhaps if you would calculate a metric outcome, which you later label encode to a categorical feature (or label it first to a categorical) than VAR or VECM may also be appropriate.
However you may add all yearly data points to one data points per firm to forecast its survival, but you would loose a lot of insight. If you are interested in time series ML which works a little bit different than for neural networks or elastic net (which could also be used with time series) let me know. And we can work something out. Or I'll paste you some sources.
Summary:
1.)
It is possible to use LSTM, elastic NEt (time points may be dummies or treated as cross sectional panel) or you use VAR in differences and VECM with a slightly different out come variable
2.)
It is possible but you will loose information over time.
All the best,
Patrick

clarification regarding difference between ALIGN_MEAN and ALIGN_SUM in google cloud metric explorer

I am collecting metrics related to memory using api compute.googleapis.com/guest/memory/bytes_used from google cloud metric explorer. I selected a particular instanceid and I set the alignment period to 1 day. so that I will get the metrics for 1 day.
For the same alignement period:
In advanced aggregation I selected the Aligner as mean and i got this value for the free category of memory 114.526 KiB
In advanced aggregation I selected the Aligner as sum and i got this value for the free category of memory 63.750 Mib
I am not understanding the formulae, on how this align_mean and align sum is calculated. i have set the alignment period to 1 day. Can anyone give me the forumula and the explanation.
Thanks a lot for your help.
It's only a graphical representation. You choose an alignment period of 1 day. So you have 1 value per day.
If you look the graph on 1 day, for the sum, you have 1 value, equal to 63Mib. I think that the line slightly go down because the day before the value is slightly higher.
Now, if you take the same value, but you say: I want to see the mean value during the day. You have 1 value per day, 63Mib, so the graph show you an interpolation of the mean per hours/minutes. If you change the timeframe, the line change. Even if you change the size of your screen it could change!!
Go to the Week or the Month timeframe view. The "aligment sum" should grow of 60Mib per day, the "aligment mean" should be flat around 60Mib (at month view)
We can read about difference ALIGN_MEAN and ALIGN_SUM in this GCP documentation [1]:
ALIGN_MEAN: Align the time series by returning the mean value in each alignment period. This aligner is valid for GAUGE and DELTA metrics with numeric values. The valueType of the aligned result is DOUBLE.
ALIGN_SUM: Align the time series by returning the sum of the values in each alignment period. This aligner is valid for GAUGE and DELTA metrics with numeric and distribution values. The valueType of the aligned result is the same as the valueType of the input.
I would also like to share this other document [2].
Best Wishes!
[1] https://cloud.google.com/monitoring/api/ref_v3/rest/v3/projects.alertPolicies#aligner
[2] https://cloud.google.com/monitoring/charts/metrics-selector#alignment

Rrdtool graphing a digital event without jagged edges

I'm reading my energy meter output to keep track of actual energy used, etc. The energy cost is calculated with a tarif, and that changes during the day from one state to another (1 or 2). When I graph the tarif together with the actual usage over time, the sharp edge of the tarif state gets jagged, possibly caused by the averaging.
I have used DS:tarif:GAUGE:... to setup the db, and I'm using DEF:tarif:xx.rrd:tarif:AVERAGE to graph.
How do I record and graph a "digital" signal with sharp edges?
There are two things that you might be referring to here.
First, make sure you are not using --slope-mode. This option uses slopes rather than steps in the graph; this sound like the wrong option for you.
Next, you have the problem that your tariff DS is varying between two values (lets say a and b), but when you look at the higher-level graphs, this starts to average out and looks wrong.
When looking at your high-granularity (IE, close up) graphs, you have one (or more) pixels per RRA data point, and one sample per RRA data point. So, you'll see either a or b. However, when you move up to a lower granularity (IE, further away) graph, RRDtool will start to have multiple data points per pixel. Depending on your RRD file definition, rrdtool will then either consolodate on the fly, or move to using a different RRA with more samples per datapoint.
So, this means that you have multiple samples per pixel, and they need to somehow be combined. By default, RRDTool will average them, which can result in the jagged behaviour.
However, what do you want to happen? If the time interval corresponding to a single pixel has 2 instances of a and three of b, what should the graph show?
Here are a couple of suggestions on how you might do it.
Use background.
Since your tariff has only two values, you could use this to colour the background - eg, make a red or green background depending on tariff, and then draw your usage graph line over the top of this. Background colour can be done by having an area from 0 to inf using a semi-transparent colour like #ff808080
Use a special MAX (or MIN) RRA
Maybe you just want to display the maximum tariff for that period. So, you can create an additional MAX type RRA for each consolodation interval, and graph the MAX. Of course, this means that when you have 1 pixel = 1 day, you'll be seeing just the higher value as a straight line. I suppose a 'median' consolodation would be useful here, but for obvious reasons RRDTool doesn't have this.
A bigger graph
A large graph means more pixels, which means less averaging required.
Display your data differently
What are you trying to visualise? Possibly you don't need to see the average cost per unit over this time window, and a calculated sum of units x tariff would work better - particularly since this would be able to be averaged up without problem.
Don't use RRDTool
RRDTool is designed to progressively normalise, summarise and expire regular time-series data over time, and does so very efficiently. However, if you're interested in having the exact data forever then maybe you need a different database.
Pre-normalise your sampling times
If your data change frequently, then Data Normalisation might be changing them. Make sure you always store the data on a time step boundary - if your RRD step is 300 (5min) then ensure you store the data at timestamps which are a multiple of 300, and don't use N (meaning 'whatever the time is Now').
Thank you for this very elaborate answer. Because I have not been able to find suitable answers to my problem, your summary, and my situation, may hopefully also help others.
Looking back, I should have included more information, and before addressing your points, let me do that right now:
Here is a snippet of my database definition script:
# 5 min sample rate = 300 step rate
# 1D report: 1 step every 5 minutes (5 min sample rate=1), 20 per hr, 24hrs = 480 slots
# 1Wk report: 1 step every hour (20 x 5 min samples=20), 24 hrs, 7 days = 168 slots
# 1M report: 1 step every hour (20 x 5 min samples=20), 24 hrs, 30 days = 720 slots
# 1Y report: 1 step every day (24 Hr * 20 samples=480), 365 days = 365 slots
rrdtool create energy_mon.rrd --step 300 --start 1480943366 \
DS:meter_total:COUNTER:600:U:U \
DS:meter_low:COUNTER:600:U:U \
DS:meter_hi:COUNTER:600:U:U \
DS:energy:GAUGE:600:U:U \
DS:tarif:GAUGE:600:U:U \
RRA:AVERAGE:0.5:1:480 \
RRA:AVERAGE:0.5:20:168 \
RRA:AVERAGE:0.5:20:720 \
RRA:AVERAGE:0.5:480:365
Here is a snippet of my graphing script:
#daily
rrdtool graph $GDIR/energy_daily.png --start -1d \
-w 675 -h 250 \
--vertical-label "KWatt" \
--lower-limit=0 \
--watermark "`date`" \
DEF:energy=$DIR/energy_mon.rrd:energy:AVERAGE \
LINE1:energy$GREEN_COLOR:"Energie" \
DEF:tarif=$DIR/energy_mon.rrd:tarif:AVERAGE \
LINE1:tarif$BLACK_COLOR:"Tarief"
#two days
rrdtool graph $GDIR/energy_2days.png --start -2d \
-w 675 -h 250 \
--vertical-label "KWatt" \
--lower-limit=0 \
--watermark "`date`" \
DEF:energy=$DIR/energy_mon.rrd:energy:AVERAGE \
LINE1:energy$GREEN_COLOR:"Energie" \
DEF:tarif=$DIR/energy_mon.rrd:tarif:AVERAGE \
LINE1:tarif$BLACK_COLOR:"Tarief"
Here is a graph of the "1 day" result:
1 day
Here is a graph of the "last 48 hrs" :
2 day
In order to see my "tarif" on the graphs together with the actual energy usage in KWatts, I multiply the 1 (low) or 2 (normal) setting by 100 and store that number in the database. On weekdays, the tariff changes at 23:00 hrs to "low" and back to normal at 07:00 hrs. On the "daily" graph this is reported correctly, albeit with a 5 min. sample resolution "error".
As you can see, the "last 48 hrs" set by "-2d" in the graphing instructions, should be using the same RRA data from the database, but the tarif line does not switch from 100 to 200, but "sits" at about half way for approx. 1-1/2 hours in time, or more than 30 samples. This is not a rounding based on a pixel level resolution (;-).
The only explanation I have at this moment is that the "-2d" graph uses the second RRA, that is actually intended for a week's worth of data. If this is the answer, then how does one influence the relationship between the "-d/w/m/y" settings in the graphing instructions and the intended RRA's? I don't see a connection between the two, so it's probably selected automatically, and most likely (if it's that smart) based on the amount of data points. The first RRA does not have enough so it switches to use the second RRA. Seems plausible, but is this true?
BTW, using "-1w", "-1m" and "-1y" for the other graphs I use, show the same weird results on the tarif "signal.
In the meantime, I have been able to "fix" the graphs without redefining my database by adding a CDEF. Here is what I use for all graphs other than the "1d".
DEF:tarif=$DIR/energy_mon.rrd:tarif:AVERAGE \
CDEF:normaal=tarif,100,GT,200,100,IF \
LINE1:normaal$BLACK_COLOR:"Tarief"
To explain the CDEF, IF the tariff is > 100 (or actually 1=low), set the result to "normaal" which is the high tariff. IF not, set the tariff to "dal" or low.
Your advice to use the background color for tariff is a very good one, and because I can use that on the current database, I have tried that too.
So is your MIN/MAX solution, I thought about that earlier but put that off because that will require setting up a new database and filling that with previous data. (which I have in .csv form, and I have gone through that tedious exercise before - Which is why I have the --start in the rddtool create)
But, I would like to get to the bottom of this issue.
Do you or anybody else have any more words of wisdom to share or can verify/explain the working of the RRA <==> DEF relationship or selection?

rrdtool: Compute 95th percentile of data within a sliding window

I'm using rrdtool to graph data about CPU usage as produced and stored by Munin. Munin (at least for us) stores each data-series in an .rrd file with 12 RRAs: "MIN", "MAX", and "AVERAGE" over each of the four periods "last 2d in 5m intervals", "last 9d in 30m intervals", "last 270d in 12h intervals", and "last 177y in 144d intervals".
I already know how to use rrdtool graph to produce a trend line indicating where my average CPU usage is going. (For simplicity, we can pretend I'm on a single-CPU system; in real life I have more code to deal with that.)
rrdtool graph /tmp/foo.png \
--start -12w --end +24w \
--lower-limit 0 --upper-limit 100 --rigid \
--title 'cpu usage' --width 620 --height 200 --border 0 \
--vertical-label 'cpu usage' \
DEF:idle=/var/lib/munin/mybox/mybox-cpu-idle-d.rrd:42:AVERAGE \
DEF:iowait=/var/lib/munin/mybox/mybox-cpu-iowait-d.rrd:42:AVERAGE \
CDEF:percent_used=100,idle,-,iowait,- \
AREA:percent_used#00880077:'cpu usage' \
VDEF:fit_m=percent_used,LSLSLOPE \
VDEF:fit_b=percent_used,LSLINT \
CDEF:trendline=percent_used,POP,fit_m,COUNT,*,fit_b,+ \
LINE1:trendline#FFBB00:'Trend since 12w ago'
The problem with this graph is that it shows only the average CPU usage trend. But my workload is spiky: usage is very low 90% of the time and then has brief spikes. What I really care about is the trend of the spikes in CPU usage.
So I could run the same command replacing AVERAGE with MAX... but the actual maxes are so randomly distributed (and usually close to 100%) that they don't produce any useful trend line.
So I'm thinking that the graph I actually want would be a graph of the 95th percentile (or maybe just the 75th percentile... ideally I'd be able to adjust the parameter), where that "percentile" is taken over the data in each consecutive 24-hour period.
Conceptually, I want to boil down our last 9 days of data (48 data points per day) into just 9 data points (1 data point per day — representing the Nth percentile of the 48 original points from that day).
And then I'd fit a line to that data using LSLSLOPE and LSLINT and display it on the same graph as the rest of this stuff.
But I can't figure out how to boil down the data in this way, using rrdtool's RPN facilities.
I know that I can use PERCENTNAN to get the scalar number that is the 95th percentile of my whole data-series, but I want a data-series consisting of 9 numbers, not just one scalar.
I know that I can use TRENDNAN to get a data-series that is the mean of a sliding window of my data-series, which would be good enough if only it gave me the median (50th percentile) instead of the mean, and then allowed me to adjust that parameter from "50" up to "95"... but it doesn't.
Alternatively, I know how to use Python to compute the series I want, using rrdtool first and rrdtool fetch, but then there's no simple way to feed that series back into rrdtool to create the graph.
I'm thinking maybe I could extract usage_today, usage_yesterday, usage_2d, usage_3d,... into nine separate series, use PERCENTNAN on them all individually, and then somehow fit a line to that. But that's mostly desperate handwaving; if someone posted an answer that actually made that approach work, I'd accept it.
RRDTool has 95th percentile functionality built in. Note that the accuracy of the percentail calculations will depend on the granularity of the data available in the requested time period, though... so the bigger your 1-pdp RRA is, the better.
So, for example, to get a horizontal line at the 95th percentile, we can use these directives:
DEF:idlehr=/var/lib/munin/mybox/mybox-cpu-idle-d.rrd:42:AVERAGE:step=1
VDEF:pctidle=idlehr,95,PERCENTNAN
HRULE:pctidle#ff0000:95th_Percentile
The step=1 on the end of the DEF ensures that the highest resolution data available will be selected. This may be computationally intensive, if you're graphing for a full year and high resolution data are avaialable for this time window!
The problem is, though, that you want a graph showing a different value for each day -- in effect, a sliding window of percentile calculations, in the same way as TRED and PREDICT work, but with a step of one day. RRDTool cannot do this.
So, the answer is, you can show a graph for one day with a single value percentile for that day. You cannot create a graph with one data point per day, where that data point is calculated as the percentile for that day.
The only way I can think of to achieve this is to repeatedly call rrdtool xport iteratively to calculate the percentile values for a sequence of days, and then use that data to generate a bar graph in another graphing package.