How does rrdtool RRDB associate/bind an RRA to a DS? - rrdtool

How does rrdtool RRDB associate/bind an RRA to a DS? An XML dump does not seem to reveal where this binding info is kept. Neither does rrdinfo. But this info must be in there, because multiple RRAs can be associated with a single DS. Perhaps Am I missing something?

Every DS is in every RRA. You do not need to bind a specific DS to specific RRAs as the vector created from the set of DS is common to all.
The difference between RRAs is not that they have a different DS vector, but that they have different lengths and granularities, and different roll-up functions. This enables the RRA to pre-calculate summary data at storage time, so that at graph time, most of the work is already done, speeding up the process.

Related

How do I query Prometheus for the timeseries that was updated last?

I have 100 instances of a service that use one database. I want them to export a Prometheus metric with the number of rows in a specific table of this database.
To avoid hitting the database with 100 queries at the same time, I periodically elect one of the instances to do the measurement and set a Prometheus gauge to the number obtained. Different instances may be elected at different times. Thus, each of the 100 instances may have its own value of the gauge, but only one of them is “current” at any given time.
What is the best way to pick only this “current” value from the 100 gauges?
My first idea was to export two gauges from each instance: the actual measurement and its timestamp. Then perhaps I could take the max(timestamp), then and it with the actual metric. But I can’t figure out how to do this in PromQL, because max will erase the instance I could and on.
My second idea was to reset the gauge to −1 (some sentinel value) at some time after the measurement. But this looks brittle, because if I don’t synchronize everything tightly, the “current” gauge could be reset before or after the “new” one is set, causing gaps or overlaps. Similar considerations go for explicitly deleting the metric and for exporting it with an explicit timestamp (to induce staleness).
I figured out the first idea (not tested yet):
avg(my_rows_count and on(instance) topk(1, my_rows_count_timestamp))
avg could as well be max or min, it only serves to erase instance from the final result.
last_over_time should do the trick
last_over_time(my_rows_count[1m])
given only one of them is “current” at any given time, like you said.

RRDTOOL - RPN-Expression help, how to reference input in COMPUTE DS (add two DS values or the LAST DS value with a new number)?

I have a data feed that has a single value that increases over time until a forced wrap-around.
I have the wrap-around under control.
The value from the data feed I pass into a RRD GAUGE as ds1.
I want to add a couple data sources to handle exceptions where on a certain condition detected by my script (that calls rrdupdate) to add some details for reporting.
When the condition is true in the script, I want to update the RRD with:
the normal value into ds1
the difference of the prior value to the current value to be marked as batch exceptions into ds2
count (sum) all ds2 values in a similar way to ds1.
I've been playing with the below but wonder if there is a method using COMPUTE or do I need to code all the logic into the bash script to poll rrdinfo, fetch the last_ds lines and prep the data accordingly? Does the rrd COMPUTE type have the ability to read other DS's?
If ds2.value > 0 then set ds3.value to (ds3.last_ds + ds2.value) ?
I looked at the rpn-expression and found it references 'input' but does not show how to feed those inputs into the COMPUTE operation?
eg:
Currently state
DS:ds1:GAUGE:28800:0:U
DS:ds2:COUNTER:1800:0:U
DS:ds3:GAUGE:1800:0:U
RRA:LAST:0.99999:1:72000
RRA:LAST:0.99999:4:17568
RRA:LAST:0.99999:8:18000
RRA:LAST:0.99999:32:4500
RRA:LAST:0.99999:96:1825
Desired state?
DS:ds1:GAUGE:28800:0:U
DS:ds2:COUNTER:1800:0:U
DS:ds3:COMPUTE:1800:0:U
DS:cs1:COMPUTE:input,0,GT,ds3,ds2,+,input,IF <-- what is 'input' is it passed via rrdupdate cs1:[value]?
RRA:LAST:0.99999:1:72000
RRA:LAST:0.99999:4:17568
RRA:LAST:0.99999:8:18000
RRA:LAST:0.99999:32:4500
RRA:LAST:0.99999:96:1825
Alternatively ds1 could have store the total without the exceptions and I could use an AREA and a STACK to plot the total.
If someone is knowledgeable of rpn-expressions when used with rrd it would be a massive help to clarity the rpn-express input reference & what is possible. There is very limited info online about this. If the script has to poll the RRD files for last_ds and do the calculations that is fine just it RRA has the smarts in the COMPUTE DS type, I'd rather use them.
Thank you.
A COMPUTE type datasource needs to have an RPN formula that completely describes it in terms of the other (non-compute) datasources. So, you cannot have multiple definitions of the same source, nor can it populate until the last of the other DS for that time window have been populated.
So, for example, if you have datasources a and b, and you want a COMPUTE type datasource that is equal to a if b<0, and a+b otherwise, you can use
DS:a:COUNTER:1800:0:U
DS:b:GAUGE:1800:0:U
DS:c:COMPUTE:b,0,GT,a,b,+,a,IF
From this, you can see how the definition of c uses RPN to define a single value using the values of a and b (and a constant). The calculation is performed solely within the configured time interval, and subsequently all three are stored and aggregated in the defined RRAs in the same way. You can also then use the graphs functions over c exactly as you would for a or b; the compute function is used only at data storing time.
Here is a full working example for the benefit of the original poster:
rrdtool create test.rrd --step 1800 \
DS:a:COUNTER:28800:0:U \
DS:b:COUNTER:28000:0:U \
DS:c:GAUGE:3600:0:U \
DS:d:COUNTER:3600:0:U \
DS:x:COMPUTE:b,0,GT,a,b,+,a,IF \
RRA:LAST:0.99999:1:72000 \
RRA:LAST:0.99999:4:17568 \
RRA:LAST:0.99999:8:18000 \
RRA:LAST:0.99999:32:4500 \
RRA:LAST:0.99999:96:1825

How would I merge related records in apache beam / dataflow, based on hundreds of rules?

I have data I have to join at the record level. For example data about users is coming in from different source systems but there is not a common primary key or user identifier
Example Data
Source System 1:
{userid = 123, first_name="John", last_name="Smith", many other columns...}
Source System 2:
{userid = EFCBA-09DA0, fname="J.", lname="Smith", many other columns...}
There are about 100 rules I can use to compare one record to another
to see if customer in source system 1 is the same as source system 2.
Some rules may be able to infer record values and add data to a master record about a customer.
Because some rules may infer/add data to any particular record, the rules must be re-applied again when a record changes.
We have millions of records per day we'd have to unify
Apache Beam / Dataflow implementation
Apache beam DAG is by definition acyclic but I could just republish the data through pubsub to the same DAG to make it a cyclic algorithm.
I could create a PCollection of hashmaps that continuously do a self join against all other elements but this seems it's probably an inefficient method
Immutability of a PCollection is a problem if I want to be constantly modifying things as it goes through the rules. This sounds like it would be more efficient with Flink Gelly or Spark GraphX
Is there any way you may know in dataflow to process such a problem efficiently?
Other thoughts
Prolog: I tried running on subset of this data with a subset of the rules but swi-prolog did not seem scalable, and I could not figure out how I would continuously emit the results to other processes.
JDrools/Jess/Rete: Forward chaining would be perfect for the inference and efficient partial application, but this algorithm is more about applying many many rules to individual records, rather than inferring record information from possibly related records.
Graph database: Something like neo4j or datomic would be nice since joins are at the record level rather than row/column scans, but I don't know if it's possible in beam to do something similar
BigQuery or Spanner: Brute forcing these rules in SQL and doing full table scans per record is really slow. It would be much preferred to keep the graph of all records in memory and compute in-memory. We could also try to concat all columns and run multiple compare and update across all columns
Or maybe there's a more standard way to solving these class of problems.
It is hard to say what solution works best for you from what I can read so far. I would try to split the problem further and try to tackle different aspects separately.
From what I understand, the goal is to combine together the matching records that represent the same thing in different sources:
records come from a number of sources:
it is logically the same data but formatted differently;
there are rules to tell if the records represent the same entity:
collection of rules is static;
So, the logic probably roughly goes like:
read a record;
try to find existing matching records;
if matching record found:
update it with new data;
otherwise save the record for future matching;
repeat;
To me this looks very high level and there's probably no single 'correct' solution at this level of detail.
I would probably try to approach this by first understanding it in more detail (maybe you already do), few thoughts:
what are the properties of the data?
are there patterns? E.g. when one system publishes something, do you expect something else from other systems?
what are the requirements in general?
latency, consistency, availability, etc;
how data is read from the sources?
can all the systems publish the records in batches in files, submit them into PubSub, does your solution need to poll them, etc?
can the data be read in parallel or is it a single stream?
then the main question of how can you efficiently match a record in general will probably look different under different assumptions and requirements as well. For example I would think about:
can you fit all data in memory;
are your rules dynamic. Do they change at all, what happens when they do;
can you split the data into categories that can be stored separately and matched efficiently, e.g. if you know you can try to match some things by id field, some other things by hash of something, etc;
do you need to match against all of historical/existing data?
can you have some quick elimination logic to not do expensive checks?
what is the output of the solution? What are the requirements for the output?

Can rrdtool store data for metrics, list of which changes over time, like, for example, top 10 processes consuming CPU?

We need to create a graph with top 10 items, which will change from time to time, for example - top 10 processes consuming CPU or any other top 10 items, we can generate values for on the monitored server, with possibility to have names of the items on the graph.
Please tell me, is there any way to store this information using rrdtool?
Thanks
If you want to store this kind of information with rrdtool, you will have to create a separate rrd database for each item, update them accordingly and finally generate charts picking the 10 'top' rrd files ...
In other words, quite a lot of the magic has to happen in the script you write around rrdtool ... rrdtool will take care of storing the time series data ...

Setting RRD deletion granularity

We currently use RRDtool through munin for trending of our services. We'd like to keep more data than we currently do, that is, we don't want the interstitial points deleted once we get the data gets older than a week. I can't find a flag that I can pass to RRDtool to do this.
We're aware that this will increase the storage requirements, but we'd like to make the decision as to how much data is too much, rather than have it made for us.
you can use rrdtool resize to modify the length of the RRAs