Write data to an RRD at irregular intervals - rrdtool

I'm trying to find out if I can store values captured at irregular intervals into an RRD.
I have a script which connects to an ActiveMQ server subscribes to a queue or topic and looks at the message header time stamp, compares it with Time.now to give me a delta.
The data I get from my script is as below;
000000.681 Time Delta
000000.793 Time Delta
000000.583 Time Delta
000001.994 Time Delta
The issue I face is that messages from the ActiveMQ don't necessarily come in at a 'regular interval' (e.g 1/sec, 1/2sec) They could come in at peak times as 5 a second, and quite times 1 every 10 seconds.
I'd like to be able to capture the output into an RRD so I can graph against it but having a look around on the internet it's not clear is this can be done, or if I'd be better off using a.n.other database/store to capture the data into.
The eventual output I'd like would be a graph showing the time delta for each message.
It looks like I could set the RRD using --step to 1 second, and the hart beat to 2 seconds having had a read of the docs.
I found a couple of posts here and here which talk about being careful with the intervals and the fact my data might be averaged, smoothed or otherwise messed about with when written to the RRD. But nothing I've found online has a similar usage case to mine so its a bit hard to know where I should be looking. I'd like my data to be stored as point for each message received.
I have a couple of RRD's setup for testing; one is taking the AVERAGE the other is taking the LAST to produce some graphs. My heartbeat is set for 100 seconds, but the interval is set to 1. I'm now getting data which looks correct. I'm also guessing that empty spaces in graph from the LAST RRA are due to my data coming in slower that 1 per second?
I'll post my create code & output as an answer.

rrdtool will always store data at regular intervals. As data is handed over to rrdtool, it first gets re-sampled to the --step interval. and then further consolidated to the intervals setup in the RRAs.
The exact arrival time of the data (to the millisecond) is taken into account as the re-sampling takes place ...
If two data points are further apart than specified by mrhb, the data is considered non-continuous and rrdtool will store 'unknown' for the interval affected.

I ended up making two sets of RRD's to experiment with;
rrdtool create test1.rrd \
--step '1' \
'DS:ds0:GAUGE:5:0:U' \
'RRA:AVERAGE:0.5:1:86400' \
'RRA:MAX:0.5:1:86400' \
'RRA:AVERAGE:0.5:60:10080' \
'RRA:MAX:0.5:60:10080' \
'RRA:AVERAGE:0.5:120:21600' \
'RRA:MAX:0.5:120:21600' \
'RRA:AVERAGE:0.5:300:105120' \
'RRA:MAX:0.5:300:105120'
and
rrdtool create test.rrd \
--step '1' \
'DS:ds0:GAUGE:5:0:U' \
'RRA:AVERAGE:0.5:1:86400' \
'RRA:LAST:0.5:1:86400' \
'RRA:AVERAGE:0.5:60:10080' \
'RRA:LAST:0.5:60:10080' \
'RRA:AVERAGE:0.5:120:21600' \
'RRA:LAST:0.5:120:21600' \
'RRA:AVERAGE:0.5:300:105120' \
'RRA:MAX:0.5:300:105120'
Which allows me to store;
1sec, archive is kept for 1day back
1min, archive is kept for 7day back
2min, archive is kept for 30day back
5min, archive is kept for 1year back
Which makes these nice graphs;
The graphs where made in PHP with the following code;
<?php
$opts = array(
'--width', '600',
'--height', '100',
'--title', 'Avg Time Delta xxxxxxxxxx (Last 1 Hr)',
'--vertical-label', 'Time Delta',
'--watermark', 'xxxxxxxxxx',
'--start', 'end-1h',
'DEF:out=test.rrd:ds0:AVERAGE',
'DEF:max=test.rrd:ds0:MAX',
'AREA:out#9966FF:Avg Time Delta',
'LINE:max#996600:Max Time Delta',
);
$ret = rrd_graph("graphs/1hr-graph.png", $opts);
if( !is_array($ret) )
{
$err = rrd_error();
echo "rrd_graph() ERROR: $err\n";
}
echo '<img src="http://server/graphs/1hr-graph.png">';
echo '<BR>';
?>
<?php
$opts = array(
'--width', '600',
'--height', '100',
'--title', 'Last Time Delta xxxxxxxxxx (Last 1 Hr)',
'--vertical-label', 'Time Delta',
'--watermark', 'xxxxxxxxxx',
'--start', 'end-1h',
'DEF:avg=test1.rrd:ds0:AVERAGE',
'DEF:last=test1.rrd:ds0:LAST',
'AREA:avg#99AAFF:Avg Time Delta',
'LINE:last#99AA00:Last Time Delta',
);
$ret = rrd_graph("graphs/1hr-last.png", $opts);
if( !is_array($ret) )
{
$err = rrd_error();
echo "rrd_graph() ERROR: $err\n";
}
echo '<img src="http://server/graphs/1hr-last.png">'
?>
From my own sanity checking and watching the data in realtime it looks like both of those graphs are correct, but behave in slightly different ways. When the data feed which this is monitoring is quite and I'm only getting 1 mesg every 10 sec I get a lot of gaps in the LAST graphs whereas the AVERAGE graphs are smoothed out to fill the gaps. I also tried with setting another RRD to ABSOLUTE but the graphs for that looks 'wrong' and the times are all below 1.0.
So it looks like I can feed my RRD at whatever interval I like from my script. It looks like the RRD will sample my data by its defined interval (In my case 1 sec) and then do what it needs to do based on the way I save it (Gauge, Absolute etc) With my heart-beat set to 100 I should always receive some data before that 100 sec times-out - thus avoiding NAN entries in my database.
At the moment I can't tell how well behaved this config will be during times of disruption (e.g delayed messages from the AMQ server) I will try and run some tests when I get some spare time and report back with anything significant.

Related

RRD Database have no Data after 3 days (everything is truncated)

i have create a rrd databasew (under php) with this code:
$opts = array( "--step", "60",
"DS:wattmin:GAUGE:300:0:8000",
"RRA:AVERAGE:0.5:1:2160",
"RRA:AVERAGE:0.5:5:2016",
"RRA:AVERAGE:0.5:15:2880",
"RRA:AVERAGE:0.5:60:8760",
"RRA:MIN:0.5:1:2160",
"RRA:MIN:0.5:5:2016",
"RRA:MIN:0.5:15:2880",
"RRA:MAX:0.5:60:8760",
"RRA:MAX:0.5:1:2160",
"RRA:MAX:0.5:5:2016",
"RRA:MAX:0.5:15:2880",
"RRA:MAX:0.5:60:8760");
$ret = rrd_create(RRD_DB_WP, $opts);
The image will be created like that:
$graphs=array("-6h","-12h","-1d","-1w","-1m","-1y");
$opts = array(
"-e now",
"--vertical-label=°C",
"-h 250",
"DEF:inoctets=".RRD_DB_WP.":wattmin:AVERAGE",
"AREA:inoctets#60B5E8:Watt/min",
"GPRINT:inbits:LAST:Las\: %4.0lfW",
"GPRINT:inbits:AVERAGE:Avg\: %4.0lfW",
"GPRINT:inbits:MAX:Max\: %4.0lfW\\n",
"DEF:grad8=".RRD_DB_TEMPS.":grad8:LAST",
"LINE2:grad8#F5A9A9:Wasser ",
"GPRINT:grad8:LAST:Las\: %2.1lf°C",
"GPRINT:grad8:AVERAGE:Avg\: %2.1lf°C",
"GPRINT:grad8:MIN:Min\: %2.1lf°C",
"GPRINT:grad8:MAX:Max\: %1.1lf°C\\n" );
$ret = rrd_graph(RRD_OUT_PATH. "/waermepumpe".$graph.".gif", $opts);
Everything works fine, but all data after 3 days are truncated and the created graphs after that time are always empty.
Bfo
If your graphs are showing no data past 3 days, then they may be being generated using the smaller RRA for some reason, which does not hold data back that far.
Possible reasons for this
You have incorrectly created your RRD file and the larger RRAs are missing. Use rrdtool info to verify the RRD file structure is as you expect.
You have not collected data for that long ago. Use rrdtool dump to view the content of the RRA and confirm there really is data stored.
You have changed your RRA sizes or added new ones ( using rrdtool tune ) and though they now exist they do not have data
Your graphing is using incorrect start and endpoints for the time window
Your graphing is forcing 60s granularity - and hence the smaller RRA - but is trying to graph a time window outside of that RRA

Long lived state with Google Dataflow

Just trying to get my head around the programming model here. Scenario is I'm using Pub/Sub + Dataflow to instrument analytics for a web forum. I have a stream of data coming from Pub/Sub that looks like:
ID | TS | EventType
1 | 1 | Create
1 | 2 | Comment
2 | 2 | Create
1 | 4 | Comment
And I want to end up with a stream coming from Dataflow that looks like:
ID | TS | num_comments
1 | 1 | 0
1 | 2 | 1
2 | 2 | 0
1 | 4 | 2
I want the job that does this rollup to run as a stream process, with new counts being populated as new events come in. My question is, where is the idiomatic place for the job to store the state for the current topic id and comment counts? Assuming that topics can live for years. Current ideas are:
Write a 'current' entry for the topic id to BigTable and in a DoFn query what the current comment count for the topic id is coming in. Even as I write this I'm not a fan.
Use side inputs somehow? It seems like maybe this is the answer, but if so I'm not totally understanding.
Set up a streaming job with a global window, with a trigger that goes off every time it gets a record, and rely on Dataflow to keep the entire pane history somewhere. (unbounded storage requirement?)
EDIT: Just to clarify, I wouldn't have any trouble implementing any of these three strategies, or a million different other ways of doing it, I'm more interested in what is the best way of doing it with Dataflow. What will be most resilient to failure, having to re-process history for a backfill, etc etc.
EDIT2: There is currently a bug with the dataflow service where updates fail if adding inputs to a flatten transformation, which will mean you'll need to discard and rebuild any state accrued in the job if you make a change to a job that includes adding something to a flatten operation.
You should be able to use triggers and a combine to accomplish this.
PCollection<ID> comments = /* IDs from the source */;
PCollection<KV<ID, Long>> commentCounts = comments
// Produce speculative results by triggering as data comes in.
// Note that this won't trigger after *every* element, but it will
// trigger relatively quickly (as the system divides incoming data
// into work units). You could also throttle this with something
// like:
// AfterProcessingTime.pastFirstElementInPane()
// .plusDelayOf(Duration.standardMinutes(5))
// which will produce output every 5 minutes
.apply(Window.triggering(
Repeatedly.forever(AfterPane.elementCountAtLeast(1)))
.accumulatingFiredPanes())
// Count the occurrences of each ID
.apply(Count.perElement());
// Produce an output String -- in your use case you'd want to produce
// a row and write it to the appropriate source
commentCounts.apply(new DoFn<KV<ID, Long>, String>() {
public void processElement(ProcessContext c) {
KV<ID, Long> element = c.element();
// This includes details about the pane of the window being
// processed, and including a strictly increasing index of the
// number of panes that have been produced for the key.
PaneInfo pane = c.pane();
return element.key() + " | " + pane.getIndex() + " | " + element.value();
}
});
Depending on your data, you could also read whole comments from the source, extract the ID, and then use Count.perKey() to get the counts for each ID. If you want a more complicated combination, you could look at defining a custom CombineFn and using Combine.perKey.
Since BigQuery does not support overwriting rows, one way to go about this is to write the events to BigQuery, and query the data using COUNT:
SELECT ID, COUNT(num_comments) from Table GROUP BY ID;
You can also do per-window aggregations of num_comments within Dataflow before writing the entries to BigQuery; the query above will continue to work.

WS02 CEP Siddhi Queries

New to Siddhi CEP. Other than the regular docs on WS02 CEP can someone point to a good tutorial.
Here are our requirements. Point out some clues on the right ways of writing such queries.
Have a single stream of sensor device notification ( IOT application ).
Stream input is via REST-JSON and output is also to be formatted to REST-JSON. ( Hope this is possible on WS02 CEP 3.1)
Kind of execution plan required:
- If device notification reports usage of Sensor 1, then monitor to see if within 5 mins if device notification reports usage of Sensor 2 also. If found then generate output stream reporting composite-activity back on REST-JSON.
- If such composite-activity is not detected during a time slot in morning, afternoon and evening then generate warning-event-stream status on REST-JSON. ( So how to find events which did not occur in time )
- If such composite-activity is not found within some time slots in morning, afternoon and evening then report failure1-event-stream status back on REST-JSON.
This should work day on day, so how will the previous processed data get deleted in WSO2 CEP.
Regards,
Amit
The queries can be as follows (these are draft queries and may require slight modifications to get them running)
To detect sensor 1, and then sensor 2 within 5 minutes (assuming sensorStram has id, value) you can simply use a pattern like following with the 'within' keyword:
from e1=sensorStream[sensorId == '1'] -> e2=sensorStream[sensorId == '2']
select 'composite activity detected' as description, e1.value as sensor1Value, e2.value as sensor2Value
within 5 minutes
insert into compositeActivityStream;
To detect non occurrences (id=1 arrives, but no id=2 within 5 minutes) we can have following two queries:
from sensorStream[sensorId == '1']#window.time(5 minutes)
select *
insert into delayedSensor1Stream for expired-events;
from e1=sensorStream[sensorId == '1'] -> nonOccurringEvent = sensorStream[sensorId == '2'] or delayedEvent=delayedSensor1Stream
select 'id=2 not found' as description, e1.value as id1Value, nonOccurringEvent.sensorId as nonOccurringId
having (not(nonOccurringId instanceof string))
insert into nonOccurrenceStream;
This will detect non-occurrences immediately at the end of 5 minutes after the arrival of id=1 event.
For an explanation of the above logic, have a look at the non occurrence sample of cep 4.0.0 (the syntax is a bit different, but the same idea)
Now since you need to periodically generate a report, we need another query. For convenience i assume you need a report every 6 hours (360 minutes) and use a time batch window here. Alternatively with the new CEP 4.0.0 you can use the 'Cron window' to generate this at specific times which is better for your use case.
from nonOccurrenceStream#window.timeBatch(360 minutes)
select count(id1Value) as nonOccurrenceCount
insert into nonOccurrenceReportsStream for expired-events;
You can use http input/output adaptors and do json mappings with json builders and formatters for this use case.

How can I check if a message is about to pass the MessageRetentionPeriod?

I have an app that uses SQS to queue jobs. Ideally I want every job to be completed, but some are going to fail. Sometimes re-running them will work, and sometimes they will just keep failing until the retention period is reached. . I want to keep failing jobs in the queue as long as possible, to give them the maximum possible chance of success, so I don't want to set a maxReceiveCount. But I do want to detect when a job reaches the MessageRetentionPeriod limit, as I need to send an alert when a job fails completely. Currently I have the max retention at 14 days, but some jobs will still not be completed by then.
Is there a way to detect when a job is about to expire, and from there send it to a deadletter queue for additional processing?
Before you follow my advice below and assuming I've done the math for periods correctly, you will be better off enabling a redrive policy on the queue if you check for messages less often than every 20 minutes and 9 seconds.
SQS's "redrive policy" allows you to migrates messages to a dead letter queue after a threshold number of receives. The maximum receives that AWS allows for this is 1000, and over 14 days that works out to about 20 minutes per receive. (For simplicity, that is assuming that your job never misses an attempt to read queue messages. You can tweak the numbers to build in a tolerance for failure.)
If you check more often than that, you'll want to implement the solution below.
You can check for this "cutoff date" (when the job is about to expire) as you process the messages, and send messages to the deadletter queue if they've passed the time when you've given up on them.
Pseudocode to add to your current routine:
Call GetQueueAttributes to get the count, in seconds, of your queue's Message Retention Period.
Call ReceiveMessage to pull messages off of the queue. Make sure to explicitly request that the SentTimestamp is visible.
Foreach message,
Find your message's expiration time by adding the message retention period to the sent timestamp.
Create your cutoff date by subtracting your desired amount of time from the message's expiration time.
Compare the cutoff date with the current time. If the cutoff date has passed:
Call SendMessage to send your message to the Dead Letter queue.
Call DeleteMessage to remove your message from the queue you are processing.
If the cutoff date has not passed:
Process the job as normal.
Here's an example implementation in Powershell:
$queueUrl = "https://sqs.amazonaws.com/0000/my-queue"
$deadLetterQueueUrl = "https://sqs.amazonaws.com/0000/deadletter"
# Get the message retention period in seconds
$messageRetentionPeriod = (Get-SQSQueueAttribute -AttributeNames "MessageRetentionPeriod" -QueueUrl $queueUrl).Attributes.MessageRetentionPeriod
# Receive messages from our queue.
$queueMessages = #(receive-sqsmessage -QueueUrl $queueUrl -WaitTimeSeconds 5 -AttributeNames SentTimestamp)
foreach($message in $queueMessages)
{
# The sent timestamp is in epoch time.
$sentTimestampUnix = $message.Attributes.SentTimestamp
# For powershell, we need to do some quick conversion to get a DateTime.
$sentTimestamp = ([datetime]'1970-01-01 00:00:00').AddMilliseconds($sentTimestampUnix)
# Get the expiration time by adding the retention period to the sent time.
$expirationTime = $sentTimestamp.AddDays($messageRetentionPeriod / 86400 )
# I want my cutoff date to be one hour before the expiration time.
$cutoffDate = $expirationTime.AddHours(-1)
# Check if the cutoff date has passed.
if((Get-Date) -ge $cutoffDate)
{
# Cutoff Date has passed, move to deadletter queue
Send-SQSMessage -QueueUrl $deadLetterQueueUrl -MessageBody $message.Body
remove-sqsmessage -QueueUrl $queueUrl -ReceiptHandle $message.ReceiptHandle -Force
}
else
{
# Cutoff Date has not passed. Retry job?
}
}
This will add some overhead to every message you process. This also assumes that your message handler will receive the message inbetween the cutoff time and the expiration time. Make sure that your application is polling often enough to receive the message.

RRDTool: RRD file not updating

My RRD file is not updating, what is the reason?
The graph shows the legend with: -nanv
I created the RRD file using this syntax:
rrdtool create ups.rrd --step 300
DS:input:GAUGE:600:0:360
DS:output:GAUGE:600:0:360
DS:temp:GAUGE:600:0:100
DS:load:GAUGE:600:0:100
DS:bcharge:GAUGE:600:0:100
DS:battv:GAUGE:600:0:100
RRA:AVERAGE:0.5:12:24
RRA:AVERAGE:0.5:288:31
Then I updated the file with this syntax:
rrdtool update ups.rrd N:$inputv:$outputv:$temp:$load:$bcharge:$battv
And graphed it with this:
rrdtool graph ups-day.png
-t "ups "
-s -1day
-h 120 -w 616
-a PNG
-cBACK#F9F9F9
-cSHADEA#DDDDDD
-cSHADEB#DDDDDD
-cGRID#D0D0D0
-cMGRID#D0D0D0
-cARROW#0033CC
DEF:input=ups.rrd:input:AVERAGE
DEF:output=ups.rrd:output:AVERAGE
DEF:temp=ups.rrd:temp:AVERAGE
DEF:load=ups.rrd:load:AVERAGE
DEF:bcharge=ups.rrd:bcharge:AVERAGE
DEF:battv=ups.rrd:battv:AVERAGE
LINE:input#336600
AREA:input#32CD3260:"Input Voltage"
GPRINT:input:MAX:" Max %lgv"
GPRINT:input:AVERAGE:" Avg %lgv"
GPRINT:input:LAST:"Current %lgv\n"
LINE:output#4169E1:"Output Voltage"
GPRINT:output:MAX:"Max %lgv"
GPRINT:output:AVERAGE:" Avg %lgv"
GPRINT:output:LAST:"Current %lgv\n"
LINE:load#FD570E:"Load"
GPRINT:load:MAX:" Max %lg%%"
GPRINT:load:AVERAGE:" Avg %lg%%"
GPRINT:load:LAST:" Current %lg%%\n"
LINE:temp#000ACE:"Temperature"
GPRINT:temp:MAX:" Max %lgc"
GPRINT:temp:AVERAGE:" Avg %lgc"
GPRINT:temp:LAST:" Current %lgc"
You will need at least 13 updates, each 5min apart (IE, 12 PDP (primary data points)) before you can get a single CDP (consolidated data point) written to your RRAs, enabling you to get a data point on the graph. This is because your smallest resolution RRA is a Count 12, meaning you need 12 PDP to make one CDP.
Until you have enough data to write a CDP, you have nothing to graph, and your graph will always have unknown data.
Alternatively, add a smaller resolution RRA (maybe Count 1) so that you do not need to collect data for so long before you have a full CDP.
The update script needs to be run at exactly the same interval as defined in your database.
I see it has a step value of 300 so the database should be updated every 5 minutes.
Just place you update script in a cron job (you can also do it for your graph script)
For example,
sudo crontab -e
If run for the first time choose your favorite editor (I usually go with Vim) and add the full path location of your script and run it every 5 minutes. So add this (don't forget to rename the path):
*/5 * * * * /usr/local/update_script > /dev/null && /usr/local/graph_script > /dev/null
Save it, and wait a couple of minutes. I usually redirect the output to /dev/null in case of the output that can be generated by a script. So if a script that will be executed gives an output crontab will fail and send a notification.