get Second last modified data from rrd file using rrdtool - rrdtool

How to get Second last modified data from rrd file using rrdtool?
By command rrdtool lastupdate we can get only last modified data. I want to get second last modified data.
Can any one tell me?

If you mean to get the actual data submitted, then you cannot do this. Remember that RRDTool stores normalised and consolidated data, not the raw data.
rrdtool lastupdate with give you the point in time and raw data value(s) of the last actual update, before normalisation and consolidation. This is stored so that ongoing rates can be calculated. After the next update, this data is normalised and consolidated so is no longer available.
You can use rrdtool fetch to obtain the last entries in any RRA (after normalisation and colsolidation). You can specify which RRA to use by giving the requested data resolution and consolidtion factor. Depending on the nature of your data (Gauge vs. Counter) and the time of submission (on the interval boundary or not) then this may or may not be the same.
So, in summary, if you have a 5-minute-interval RRD, with a 1cdp=1pdp AVG RRA, and you submit data at 11:59, 12:04 and 12:08, then lastupdate will give you "12:08" plus the data value(s) submitted; fetch will give you "12:00" (the start of the only completed 5-min time bucket) plus the normalised data for the 12:00-12:05 bucket.

Related

What is the maximum number of data values that amCharts can handle?

We are using amCharts 4 to show trend logs, and sometimes we end up with a lot of data that has to go into the chart. We'd like to know what the maximum number of data points that a chart can handle so we know how much data to aggregate (to reduce the data point count) before sending it into the package. To show the most accurate representation of the data as possible, we don't want to aggregate more aggressively than we have to. Our charts are x/y charts with value vs. date/time for up to 8 series.
In one case, we have a data set with well in excess of 600,000 data points in 8 series, and loading this into the chart, even in batches (i.e., loading one batch in, then adding the remaining batches to it in turn), will cause the charting package to run out of memory. In the case cited here, during our test, the charting package ran out of memory on the third batch, where the total of the 3 batches exceeded 600,000 data points, preventing further batches from being loaded in. For large sites that use our product, it is quite common to have that much data that the user wants to see in a chart if they want to see 6 months or a year's worth of data; so it's important that we be able to show some kind of representation of all that data, which is where aggregation comes in.

AWS Forecast cannot train the predictor due to missing data

This question is close, but doesn't quite help me with a similar issue as I am using a single data set and no related time series.
I am using AWS Forecast with a single time series dataset (no related data, just the main DS). It is a daily data set with about 10 years of data ranging from 2010-2020.
I have 3572 data points in the original data set; I manually filled missing data to ensure there were no missing days in the date range for a total of 3739 data points. I lopped off everything in 2020 to create a validation dataset and then configured the predictor for a 180 day Forecast. I keep getting the following error:
Unable to evaluate this dataset because there is missing data in the evaluation window for all items. Ensure that there is complete data for at least one item in the evaluation window starting from 2019-03-07T00:00:00 up to 2020-01-01T00:00.
There is definitely no missing data, I've double and triple checked the date range and data fill and every day between start and end dates has a data point. I also tried adding a data point for 1/1/2020 (it ended at 12/31/2019) and I continue to get this error. I can't figure out what it's asking me for, except that maybe I'm missing something in my math about the forecast Horizon and Backtest window offset?
Dataset example:
Brief model parameters (can share more if I'm missing something pertinent):
Total data points in training data: 3479
forecastHorizon = 180
create_predictor_response=forecast.create_predictor(PredictorName=predictorName,
ForecastHorizon=forecastHorizon,
PerformAutoML= True,
PerformHPO=False,
EvaluationParameters= {"NumberOfBacktestWindows": 1,
"BackTestWindowOffset": 180},
InputDataConfig= {"DatasetGroupArn": datasetGroupArn},
FeaturizationConfig= {"ForecastFrequency": 'D'
I noticed you don't have entry for 6/24/10 (this american date format is the worst btw)
I faced a similar problem when leaving out days (assuming you're modelling in daily frequency) just like that and having the Forecast automatic filling of gaps to nan values (as opposed to zero which is the default). I suggest you:
pre-fill literally every date within the range of training data (and of forecast window, if using related data)
choose zero as the option for automatically filling of missing values. I think mean or any other float value would also work for that matter
let me know if that works! I am also using Forecast and it's good to keep track of possible problems and solutions

How does Amazon Redshift reconstruct a row from columnar storage?

Amazon describes columnar storage like this:
So I guess this means in what PostgreSQL would call the "heap", blocks contain all the values for one column, then the next column, and so on.
Say I want to query for all people in their 30's, and I want to know their names. So columnar storage means less IO is required to read just the age of every row and find those that are 30-something, because all the other columns don't need to be read. Also maybe some efficient compression can be applied. That's neat, I guess.
Then what? This data structure alone doesn't explain how anything useful can happen after that. After determining what records are 30-something, how are the associated names found? What data structure is used? What are its performance characteristics?
If the Age column is the Sort Key, then the rows in the table will be stored in order of Age. This is great, because each 1MB storage block on disk keeps data for only one column, and it keeps note of the minimum and maximum values within the block.
Thus, searching for the rows that contain an Age of 30 means that Redshift can "skip over" blocks that do not contain Age=30. Since reading from disk is the slowest part of a database, this means it can operate much faster.
Once it has found the blocks that potentially contain Age=30, it reads those blocks from disk. Blocks are compressed, so they might contain much more data than the 1MB on disk. This means many rows can be read with fewer disk accesses.
Once those blocks are decompressed into memory, it finds the rows with Age=30 and then loads the corresponding blocks for the Name column. The compression ratio would be different for the Name column since it is text and is not sorted, so this might result in loading more blocks from disk for Name than for Age.
Redshift then assembles the data from Name and Age for the desired rows and performs any remaining operations.
These operations are also parallelized across multiple nodes based on the Distribution Key, which distributed data based on a given column (or replicates it between nodes for often-used tables). Data is typically distributed based upon a column that is frequently used in JOIN statements so that similar data is co-located on the same node. Each node returns its data to the Leader Node, which combines the data and provides the final results.
Bottom line: Minimise the amount of data read from disk and parallelize operations on separate nodes.
AFAIK every value in the columnar storage has an ID pointer (similar to CTID you mentioned), and to get the select results Redshift needs to find and combine the values with the same ID pointer for each column that's selected from the raw data. If memory allows it's stored in memory, unless it's spilling to disk. This process is called materialization (don't confuse with materialized view materialization). In your case there are 2 technically possible scenarios:
materialize all Age/Name pairs, then filter by Age=30, and output the result
filter Age column by Age=30, get IDs, get Name values with corresponding IDs, materialize pairs and output
I guess in this case #2 is what happens because materialization is more expensive than filtering. However, there is a plenty of scenarios where this is much less obvious (with complex queries and aggregations). It is the responsibility of the query optimizer to decide what's better. #1 is still better than the row oriented because it would still read just 2 columns.

RRD Tool Graph is not generating correct for one week and yearly

I have a case where I have collected SNMP data and stored it via rrdtool.
for daily and weekly graph is coming correct but when i see monthly and yearly it is showing only that day portion not correct graph as shown below.
Daily Graph code is : (working correct)
/usr/bin/rrdtool graph /opt/elitecore/ManageEngine/AppManager11/working/graphs/daily-tps.png -v "TPS" -t "TIME" DEF:tps1=/root/graphs/Total_TPS.rrd:TPS:MAX -s -86400 CDEF:tps2=tps1,300,* LINE1:tps2#ff0000:TOTAL_TPS GPRINT:tps2:LAST:"Cur: %5.2lf" GPRINT:tps2:AVERAGE:"Avg: %5.2lf" GPRINT:tps2:MAX:"Max: %5.2lf" GPRINT:tps2:MIN:"Min: %5.2lf\t\t\t"
Monthly Graph code is : (not coming graph as expected)
/usr/bin/rrdtool graph /opt/elitecore/ManageEngine/AppManager11/working/graphs/monthly-tps.png -v "TPS" -t "WEEK" DEF:tps1=/root/graphs/Total_TPS.rrd:TPS:MAX -s -2592000 CDEF:tps2=tps1,300,* LINE1:tps2#ff0000:TOTAL_TPS GPRINT:tps2:LAST:"Cur: %5.2lf" GPRINT:tps2:AVERAGE:"Avg: %5.2lf" GPRINT:tps2:MAX:"Max: %5.2lf" GPRINT:tps2:MIN:"Min: %5.2lf\t\t\t"
Yearly Graph code is : (not coming graph as expected)
/usr/bin/rrdtool graph /opt/elitecore/ManageEngine/AppManager11/working/graphs/yearly-tps.png -v "TPS" -t "MONTH" DEF:tps1=/root/graphs/Total_TPS.rrd:TPS:MAX -s -31536000 CDEF:tps2=tps1,300,* LINE1:tps2#ff0000:TOTAL_TPS GPRINT:tps2:LAST:"Cur: %5.2lf" GPRINT:tps2:AVERAGE:"Avg: %5.2lf" GPRINT:tps2:MAX:"Max: %5.2lf" GPRINT:tps2:MIN:"Min: %5.2lf\t\t\t"
Kindly let me know if i am doing any wrong.
yours Faithfully
Jignesh Dholakiya
Answer
The graph only shows five day's worth of data in the graphs because that is all the data there is in your RRD. Your RRD is configured to automatically discard any data older than this.
Explanation
The graphs show that your RRD currently only has 6 days' worth of data to display. As you cannot graph data which you do not have, the graphs show what they do have, and nothing for the rest.
Your rrdtool info gives this for RRA definitions (trimmed for clarity):
step = 300
rra[0].cf = "MAX"
rra[0].rows = 1500
rra[0].pdp_per_row = 1
rra[0].xff = 5.0000000000e-01
This means that you have a single RRA, type MAX, which has 1pdp per row and 1500 rows.
As a result, your RRA is (step)x(pdp per row)x(number of rows) long, which is 1500x300 seconds, which is a little over 5 days.
Since your RRD only has a single RRA, all of your graph functions will use this one -- doing additional consolidation on the fly if necessary. Thus all your graphs use this single RRA.
However, your RRA is only 5-and-a-bit days long. Therefore, data will be expired and discarded when it is this old. As a result, only the last 5-and-a-bit days' worth of data are available at any time for graphing, which is what you see in the graphs.
Solution:
You need to keep the data for longer. There are two ways to do this --
Increase the length of the existing RRA
Create additional RRAs to hold the consolidated data for the lower-resolution graphs.
Option 1 is the simplest, as you can use rrdtool tune to grow the size of RRA number 0. However, it is very expensive in disk space (since you will be keeping the detailed data for the entire time period), plus it is expensive in CPU (RRDtool will have to consolidate on the fly when making yearly graphs). This option is only recommended if you really need the high-resolution data for the entire period -- such as if you are calculating 95th Percentiles, for example.
Option 2 is the best. You add a new RRA, with the same CF but more pdp_per_row, for each graph you will be wishing to create. For a Weekly graph, use pdp_per_row=6 (for a half-hour consolidation), for Monthly use 24 (two-hourly) and for Yearly, use 288 (daily consolidation). As time goes by, the data will be consolidated up into these new RRA, and the graph functions will use them in preference. This is less computationally expensive, and uses less disk space; however you lose the high-resolution data over time, and your historical data will not be automatically consolidated into the new RRAs. Also, you cannot just add a new RRA to an existing RRD file -- you will need to either create a new RRD, or use a tool such as rrdmerge.

RRD graphs in Zenoss showing NaN on large time ranges

I am trying to create COMMAND JSON datasource to monitor some values, for example from such script:
print json.dumps({
'values': {
'': {'random': random()},
},
'events': []
})
And when i just starting zencommand, appropriate rrd file is created, but cur, avg and max values on graph shows me NaN. That NaNs is replaced by actual numbers when I zoom in to a current point in time, which is not very far from start of monitoring.
Why it don't show correct min, max and avg values before I zoom in? Is that somehow related to consolidation? I read http://www.vandenbogaerdt.nl/rrdtool/min-avg-max.php, but that page don't tell anything about NaN values.
And is any way to quicker zoom in to the current timestamp to see some data faster?
When you are zoomed out, you'll be looking at the lower-granularity RRAs (Round Robin Archives). These do not get populated until enough data are in the higher-granularity ones; so, for example, if you have a 5min-granularity RRA, a 1hr-granularity RRA, and a 1day-granularity RRA, and have collected data for the last 45min, then you will see ~8 data points in your 'daily' graph (which uses the 5min RRA), but nothing in your 'monthly' (which will use the 1hr RRA) or your 'yearly' (which uses the 1day RRA).
This applies to any RRA; AVG, LAST, MAX, etc. Until the consolidated time window is complete, and the full complement of Primary Data Points has been collected for consolidation, the consolidated data point value is undefined.
RRDTool picks the RRA to use based on the requested graph data width and pixel width, as well as the requested consolidation functions. Although there are ways to force RRDtool to use a higher-granularity RRA than it needs to, and to consolidate on the fly, this is inefficient and slow. It also makes having the lower-granularity RRA pointless and throws away one of the major benefits of RRDtool (that it performs consolidation at update time making graphing faster)