gcp monitoring "Any time series violates" vs "All time series violate"

gcp monitoring "Any time series violates" vs "All time series violate" - google-cloud-platform

What's the difference between the two options "Any time series violates" and "All time series violate"? I can imagine what would the former one do easily, but I have no idea what would the latter one do.
All time series? how long is its range? and why does it have a for option?

What's the difference between the two options "Any time series violates" and "All time series violate"? I can imagine what would the former one do easily, but I have no idea what would the latter one do.
First, what is "time series violates" - its when CURRENT VALUE of metric is outside of expected range, e.g: above the threshold specified.
Second, "any/all/percent/number" - let's say you have 5 time series, e.g.: cpu usage on 5 instances, then per dropdown options the whole alert condition will violate when:
"any time series": any 1 of the time series is in violation
"all time series": all 5 of the time series are in violation
"percent of time series" (40%): 2 out of 5 of the time series are in violation, and yes, selecting 39% or 41% on small numbers will give you different results, so
"number of time series" (3): 3 out of 5 of the time series are in violation
Third, for aka Duration box, - it looks like "if my time series violates FOR 5 minutes, then violate the condition". And for some simpler alerts this can even work, but once you try to combine it with say, "metric is absent" or other complicated configuration, you will see that what actually happens is "wait for 5 minutes after the problem is there, and only then trigger the violation".
In practice, the use of for field is discouraged and its better to keep it on default "Most recent value".
If you do need the "cpu usage is above 90% for 5 minutes", then correct way of doing it is by denoizing/smoothing your data:
set alignment period to 5 minutes (or whatever is the sliding window that you want)
then choose reasonable aligner (like, mean which will average the values)
and then while the chart will have less datapoints, they would be less noisy and you can act upon the latest value.

Any time will trigger if there is a violation of any time series, inside the window chosen on "for".
Let's say there is 5 time series, it will trigger if there is a violation in one of them.
For the all time series, it will trigger if it happens 5 out of 5.

Related

gcp monitoring “number of time series series violates”

What's the difference between the "number of time series series violates" and the other conditional triggers? I can imagine what would the other conditional trigger would do easily, but I have no idea what would the this one would do.
How I would interpret "number of time series series violates" two ways .
for example A : I can have 5 vm instances , the conditional trigger for "number of time series series violates" at-least 3 times(one instance becomes absent 3 times) , and "is absent" for a 1 day.
or Example B : I can have 5 vm instances , the conditional trigger for "number of time series series violates" at-least 3 vm instances would have to exceed the threshold(become absent) , and "is absent" for a 1 day.
Thank you in advance for clarifying my misunderstanding .

The Example B is correct, let's assume you have a condition for VM Instances and CPU Usage. You have 5 VM so you have 5 different time series - one for each VM. When you set Number of time series violates that means 3 time series has exceed threshold or you can say 3 out of 5 of the time series are in violation. Alternatively you can use the percentage option and set it to 60% which will yield same result.
Setting longer time frame will give you the behaviour you are describing in Example A

GCP : Cloud Functions Graphs

When I execute a CF on GCP, it has graphs on 4 parameter. Invocations, Active Instance are easy to understand what data is trying to say. But I am unable to make sense of other graphs,i.e execution time & memory usage. This is a screenshot of one of our http triggered CF. Can someone explain how exactly to make sense of this data? What does CF mean when it says, 99th percentile: 882.85
Is 99th percentile good or bad?

It is neither good nor bad; these are the statistics for the average execution time.
See what percentile actually means, in order to understand the chart's meaning.
eg. 99% of the observations fall below the average execution duration of 882.85 ms -
and that 1% of the observations have extreme values, which do not fall below that.
These 882.85 ms might only be suboptimal, in case the function could possibly run quicker.
It's represented alike this, so that a few extreme values won't distort the whole statistics.

RRDTool data values (e.g. max value) are different in different time resolutions

currently I'm experimenting a bit with RRDTool. I'm aware that the accuracy gets lower the longer the time periods are selected. But I thought I could bypass this with my datasource settings.
For example temperature and humidity from my house, resoultion 1h:
And now with the resolution of 1d:
As you could see, there is a great difference for the max. value of the blue line.
I created my datasources and archives with this values:
"rrdtool create temp.rrd --step 30",
"DS:temp:GAUGE:60:U:U",
"DS:humidity:GAUGE:60:U:U",
"RRA:AVERAGE:0.5:1:1051200",
"RRA:MAX:0.5:1:1051200",
"RRA:MIN:0.5:1:1051200",
I thought that 1051200 (1 year = 31536000 / 30 s (resoulution) = 1051200) is correct for saving every value for a year and that there should be no need for interpolating.
Is it possible to get the exact values displayed even if the resolution changes (for example the max humidity (Luftfeuchtigkeit) at 99.9%)?
Here are my values for image creation:
"--start" => "-1h", (-1d etc-)
"--title" => "Haustemperatur",
"--vertical-label" => "°C / % RLF",
"--width" => 800,
"--height" => 600,
"--lower-limit" => "-5",
"DEF:temperatur=$rrdFile:temperatur:LAST",
"DEF:humidity=$rrdFile:humidity:LAST",
"LINE1:temperatur#33CC33:Temperatur",
"GPRINT:temperatur:LAST:\t\tAktuell\: %4.2lf °C",
"GPRINT:temperatur:AVERAGE:Schnitt\: %4.2lf °C",
"GPRINT:temperatur:MAX:Maximum\: %4.2lf °C\j",
"LINE1:humidity#0000FF:Relative Luftfeuchtigkeit",
"GPRINT:humidity:LAST:Aktuell\: %4.2lf %%",
"GPRINT:humidity:AVERAGE:Schnitt\: %4.2lf %%",
"GPRINT:humidity:MAX:Maximum\: %4.2lf %%\j",
Thanks for your help and any suggestions.
P.S. I'm using a library to generate the graphs and the database, please do not be surprised about possible syntax errors.

Your problem is that you are causing the values to be rolled-up on the fly at graph time, but have not correctly specified which rollup function to use. Your second graph is showing the MAXIMUM of the LAST in the interval, not the true Maximum.
There are a few issues to explain with this configuration:
Firstly, your RRD is defined using 3 RRAs with 1cdp=1pdp and different consolidation functions (AVG, MIN, MAX). This means they are functionally identical, but they do not save you any time at graphing as they have not done any pre-rollup for you! You should definitely consider having just one of these (probably AVG) and adding others at lower resolution to help speed up graphing when you have a bigger time window.
Secondly, you need to specify the on-the-fly rollup function. When graphing, RRDTool will work out the best RRA to use based on your DEF lines, and will perform any additional consolidation required on the fly. This can take a long time if the only available RRA is too high-granularity.
Your graph request uses DEF:temperatur=$rrdFile:temperatur:LAST but you do not actually have a LAST type RRA, so RRDTool will grab the last average. Your RRA data points are at 30s interval, but your second graph has (approx) 5min per pixel, meaning that RRDTool needs to grab the 10 entries from the RRA, and print the last. Looking at the data in the top graph, it seems that the last in that interval was the 66 value, though previous ones were 100.
So you have a choice. Do you want the graph to show the average for the time period, the maximum, or both? Do you want the figures at the bottom to show the maximum of the average, or the maximum of everything?
For example
"DEF:temperatur=$rrdFile:temperatur:AVERAGE",
"DEF:humidity=$rrdFile:humidity:AVERAGE",
"DEF:temperaturmax=$rrdFile:temperatur:MAX;reduce=MAX",
"DEF:humiditymax=$rrdFile:humidity:MAX;reduce=MAX",
"LINE1:temperatur#33CC33:Temperatur",
"LINE1:temperaturmax#66EE66:Maximum Temperatur",
"GPRINT:temperatur:LAST:\t\tAktuell\: %4.2lf °C",
"GPRINT:temperatur:AVERAGE:Schnitt\: %4.2lf °C",
"GPRINT:temperaturmax:MAX:Maximum\: %4.2lf °C\j",
"LINE1:humidity#0000FF:Relative Luftfeuchtigkeit",
"LINE1:humiditymax#3333FF:Maximum Luftfeuchtigkeit",
"GPRINT:humidity:LAST:Aktuell\: %4.2lf %%",
"GPRINT:humidity:AVERAGE:Schnitt\: %4.2lf %%",
"GPRINT:humiditymax:MAX:Maximum\: %4.2lf %%\j",
In this case, we define a separate DEF for the maximum data set, so that we can always obtain the highest value even after consolidation. This is also used in the GPRINT so that we get the MAX of the MAX rather than the MAX of the AVERAGE. The Maximum line is now drawn separately to the average line, so that we can see the effect of any rollup of data - the lines will be together at high-resolution but get further apart as the time window widens and resolution decreases.
TheDEF is set to force any rollup function used for the maxima to be MAX rather than AVG, so we can be sure to get the maximum rather than average of maxima.
We are also using AVERAGE rather than LAST in order to get more meaningful data after rollup. Note that we could also use a separate DEF for the LAST as well if we wanted to though it is of less usefulness.
Note that, if you ever expect to be generating graphs over more than a few days, you should definitely consider adding some lower-resolution RRAs for AVERAGE and MAX or else the graphs will generate very slowly. RRDTool is designed with the intention that data will be rolled up over time, rather than (as in a traditional database) every sample kept as-is. So, unless you really need to have 30s resolution data kept for an entire year, you may prefer to keep this high resolution data for only a week, and then have separate RRAs that roll up to 1 hour resolution and keep for longer. Many people keep the 30s for 2 days, then 30min-summary for 2 weeks, 2h-summary for 2 months, and then 1day-summary for 2 years.
For more information, see the RRDTool manual pages.

Counting riemann events in rate function

Hi I have a use case where I have to aggregate my application response time for a time interval of 10 i.e (rate 10 and then calculate the average . The real problem is there is no way to calculate the number of events in riemann rate function for the time interval of 10. Is there any way to do that other than using (fixed-time-window .

Rate is unlikely to be the function you want for this. If I understand it you would like your code to:
gather all the events that happen in ten minutes
once the events are all available, calculate the average of the :metric key
emit one event with the service name, and that average as the value of it's metric.
If I'm not understanding then this answer won't fit, so let me know.
Rate takes in a reporting-interval and any number of streams to forward the average rate to. That first parameter to rate only determines how often it report the current rate and has no effect on the period over which it's aggregated. the built in rate function only has one available agrigation interval, it always reports in "metric per second". so it accumulates events for one second, and averages the mertic over that second, and it properly handles edge cases like reporting intervals with no events as a zero metric a reasonable number of times, though not forever. You should use rate where it fits, and not use it where you need explicit control over the aggregation period.
I often want events per minute, so I set the reporting period to 60 seconds, then multiply the output by 60 before forwarding it. This saves me handling all the edge cases in custom aggregation functions. keep in mind that this looses some accuracy in the rounding.
(rate 60
(adjust [:metric * 60]
index datadog))
You may also want to do something like:
(fixed-time-window 10
(smap folds/median
... your stuff ...

Configure SQLite for real time operations

In short, this post would like to answer the following question : how (if possible) can we configure a SQLite database to be absolutely sure that any INSERT command will return in less than 8 milliseconds?
By configure, I mean: compiling options, database pragma options, and run-time options.
To give some background, we would like to apply the same INSERT statement at 120 fps. (1000 ms / 120 fps ≃ 8 ms)
The Database is created with the following strings:
"CREATE TABLE IF NOT EXISTS MYTABLE ("
"int1 INTEGER PRIMARY KEY AUTOINCREMENT, "
"int2 INTEGER, "
"int3 INTEGER, "
"int4 INTEGER, "
"fileName TEXT);
and the options:
"PRAGMA SYNCHRONOUS=NORMAL;"
"PRAGMA JOURNAL_MODE=WAL;"
The INSERT statement is the following one:
INSERT INTO MYTABLE VALUES (NULL, ?, ?, ?, ?)
The last ? (for fileName) is the name of a file, so it's a small string. Each INSERT is thus small.
Of course, I use precompiled statements to accelerate the process.
I have a little program that makes one insert every 8 ms, and measures the time taking to perform this insert. To be more precise, the program makes one insert, THEN wait for 8 ms, THEN makes another insert, etc... At the end, 7200 inserts were pushed, so the program runs for about 1 minutes.
Here are two links that show two charts:
This image shows how many milliseconds were spent to make an insert as a function of the time expressed in minutes. As you can see, most of the time, the insert time is 0, but there are spikes than can go higher than 100 ms.
This image is the histogram representation of the same data. All the values below 5 ms are not represented, but I can tell you that from the 7200 inserts, 7161 are below 5 milliseconds (and would give a huge peak at 0 that would make the chart less readable).
The total program time is
real 1m2.286s
user 0m1.228s
sys 0m0.320s.
Let's say it's 1 minute and 4 seconds. Don't forget that we spend 7200 times 8 milliseconds to wait. So the 7200 inserts take 4 seconds ---> we have a rate of 1800 inserts per seconds, and thus an average time of 0.55 milliseconds per insert.
This is really great, except that in my case, i want ALL THE INSERTS to be below 8 milliseconds, and the chart shows that this is clearly not the case.
So where do these peaks come from?
When the WAL file reaches a given size (1MB in our case), SQLite makes a checkpoint (the WAL file is applied to the real database file). And because we passed PRAGMA SYNCHRONOUS=NORMAL, then at this moment, SQLite performs a fsync on the hard drive.
We suppose this is this fsync that makes the corresponding insert really slow.
This long insert time does not depend on the WAL file size. We played with the pragma WAL_AUTOCHECKPOINT (1000 by default) linked to the WAL file, and we could not reduce the height of the peaks.
We also tried with PRAGMA SYNCHRONOUS=OFF. The performances are better but still not enough.
For information, the dirty_background_ratio (/proc/sys/vm/dirty_background_ratio) on my computer was set to 0, meaning that all dirty pages in the cache must be flushed immediately on the hard drive.
Does anyone have an idea and how to "smooth" the chart, meaning that all inserts time will not overpass 8 ms ?

By default, pretty much everything in SQLite is optimized for throughput, not latency.
WAL mode moves most delays into the checkpoint, but if you don't want those big delays, you have to use more frequent checkpoints, i.e., do a checkpoint after each transaction.
In that case, WAL mode does not make sense; better try journal_mode=persist.
(This will not help much because the delay comes mostly from the synchronization, not from the amount of data.)
If the WAL/journal operations are too slow, and if even synchronous=off is not fast enough, then your only choice is to disable transaction safety and try journal_mode=memory or even =off.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js