Unit Testing Prometheus Alerts: input series and interval

Unit Testing Prometheus Alerts: input series and interval - unit-testing

I have been writing unit tests for my Prometheus alerts and I have just increased the interval range in my alert, therefore I need to modify my current test. This is my modified test:
- interval: 15m
# Series data.
input_series:
- series: 'some_bucket{service_name="some-service", le="1000"}'
values: 6 6 6 6 6 6 6
- series: 'some_bucket{service_name="some-service", le="10000"}'
values: 10 11 12 13 14 14 14
- series: 'some_bucket{service_name="some-service", le="+Inf"}'
values: 10 100 200 300 400 500 600
alert_rule_test:
- eval_time: 5m
alertname: someName
exp_alerts: []
- eval_time: 15m
alertname: someName
exp_alerts:
- exp_labels:
severity: error
service_name: some-service
exp_annotations:
summary: "a summary"
description: "adescription"
and my alert rule is:
histogram_quantile(0.95, sum by(le) (rate(some_bucket{service_name="some-service"}[15m]))) >= 1000
The test is working fine, it does not trigger at the eval_time of 5 minutes and it does when it hits the correct interval. My question is regarding the interval set at the top
- interval: 15m
My understanding is that this should be the scraping interval, but if I change it to 1 the test fails. Why is that? Does it mean that my time series/input data needs changing?
Thank you

The given interval is not the scrape interval per se but the time between the values in the series.
Setting interval to 15 min means that your series (with seven entries each, so six gaps between them) define data for 6 x 15 = 90 minutes.
Setting this to 1m means that after six minutes your test data is empty. I couldn't find a behavior in any documentation but I guess it is either undefined or treated as missing value.
The following test will run with interval: 15m. Setting this to 1m breaks the test and you can see that you get 'nil' as values for the buckets.
evaluation_interval: 1m
tests:
- interval: 1m
# Series data.
input_series:
- series: 'some_bucket{service_name="some-service", le="1000"}'
values: 6 6 6 6 6 6 6
- series: 'some_bucket{service_name="some-service", le="10000"}'
values: 10 11 12 13 14 14 14
- series: 'some_bucket{service_name="some-service", le="+Inf"}'
values: 10 100 200 300 400 500 600
promql_expr_test:
- expr: histogram_quantile(0.95, sum by(le) (rate(some_bucket{service_name="some-service"}[15m])))
eval_time: 15m
exp_samples:
- value: 10000
- expr: some_bucket
eval_time: 16m
exp_samples:
- labels: 'some_bucket{service_name="some-service",le="1000"}'
value: 6
- labels: 'some_bucket{service_name="some-service",le="10000"}'
value: 11
- labels: 'some_bucket{service_name="some-service",le="+Inf"}'
value: 100

Related

Prometheus: the "for" is breaking my test

I've this alert which I try to cover by unit tests:
- alert: Alert name
annotations:
summary: 'Summary.'
book: "https://link.com"
expr: sum(increase(app_receiver{app="app_name", exception="exception"}[1m])) > 0
for: 5m
labels:
severity: 'critical'
team: 'myteam'
This test scenario is failing each time until the for: 5m will be commented in the code. In this case, it'll be successful.
rule_files:
- ./../../rules/test.yml
evaluation_interval: 1m
tests:
- interval: 1m
input_series:
- series: 'app_receiver{app="app_name", exception="exception"}'
values: '0 0 0 0 0 0 0 0 0 0'
- series: 'app_receiver{app="app_name", exception="exception"}'
values: '0 0 0 0 0 10 20 40 60 80'
alert_rule_test:
- alertname: Alert name
eval_time: 5m
exp_alerts:
- exp_labels:
severity: 'critical'
team: 'myteam'
exp_annotations:
summary: 'Summary.'
book: "https://link.com"
The result of this test:
FAILED:
alertname:Alert name, time:5m,
exp:"[Labels:{alertname=\"Alert name\", severity=\"critical\", team=\"myteam\"} Annotations:{book=\"https://link.com\", summary=\"Summary.\"}]",
got:"[]
Can someone please help me fix this test and explain a failure reason?

What do series values stand for in Prometheus unit test?

I am trying to understand what series values stand for in Prometheus unit test.
The official doc does not provide any info.
For example, fire an alert if any instance is down over 10 seconds.
alerting-rules.yml
groups:
- name: alert_rules
rules:
- alert: InstanceDown
expr: up == 0
for: 10s
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 10 seconds."
alerting-rules.test.yml
rule_files:
- alerting-rules.yml
evaluation_interval: 1m
tests:
- interval: 1m
input_series:
- series: 'up{job="prometheus", instance="localhost:9090"}'
values: '0 0 0 0 0 0 0 0 0 0 0 0 0 0 0'
alert_rule_test:
- eval_time: 10m
alertname: InstanceDown
exp_alerts:
- exp_labels:
severity: critical
instance: localhost:9090
job: prometheus
exp_annotations:
summary: "Instance localhost:9090 down"
description: "localhost:9090 of job prometheus has been down for more than 10 seconds."
Originally, I thought because of interval: 1m, which is 60 seconds, and there are 15 numbers, 60 / 15 = 4s, so each value stands for 4 seconds (1 means up, 0 means down).
However, when the values are
values: '0 0 0 0 0 0 0 0 0 0 0 0 0 0 0'
or
values: '1 1 1 1 1 1 1 1 1 0 0 0 0 0 0'
Both will pass the test when I run promtool test rules alerting-rules.test.yml.
But below will fail:
values: '1 1 1 1 1 1 1 1 1 1 0 0 0 0 0'
So my original thought each number stands for 4s is wrong. If my assumption is correct, then only when less three 0s will fail the test.
What do series values stand for in Prometheus unit test?

Your assumption is incorrrect. The number in the values doesn't correspond at the number of value in the interval but which value the series will have after each interval. For example:
values: 1 1 1 1 1 1
# 1m 2m 3m 4m 5m 6m
In your example, since you evaluate the value at 10min (with eval_time) the evaluation will be based on the tenth value in the values. Since you check if up==0, when you change the tenth value to 1 it will fail because the alert will not be trigger as excepted.

K8s cronjob is not scheduled correctly several times a day

I’ve a cluster with cronjob which is running Ok, I schedule the cronjob to run on every 3 min
and I notice that sometimes the jobs are not running at all for 6-9 min , (two or three intervals ) .
This happen several times a day and I'm not sure why , how can I check what is the problem can be ?
is there a way to overcome this ?
we use k8s 1.14.7
This is the cronjob
I try to change also the interval to 10 min and still I see this pattern , i.e. several times a day (for 20/30 min - 2-3 intervals )the job is not running .
the job execution time is just 30 sec and does not run in parallel (it run like a singleton job )
the logs (for the running jobs) doesn't show anything
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: fdc-job
namespace: {{required "Mon namespace variable '(.Values.mon.namespace)' is required" .Values.mon.namespace}}
spec:
suspend: false
schedule: "*/3 * * * *"
successfulJobsHistoryLimit: 1
failedJobsHistoryLimit: 1
startingDeadlineSeconds: 10
jobTemplate:
spec:
backoffLimit: 1
template:
spec:
serviceAccountName: mon-sa
containers:
- name: cluster-check
image: {{required "A valid .Values.artifactory.host entry required!" .Values.artifactory.host}}/{{(.Values.mon.image.repository)}}:{{(.Values.mon.image.tag)}}
args: [“fdc"]
restartPolicy: Never
activeDeadlineSeconds: 100
imagePullSecrets:
- name: docker-images-secret
update
running command: kubectl get pods return the following for the last 40 min ...
cluster-1569490560-b78nw 0/1 Completed 0 39m
cluster-1569490740-8gcwl 0/1 Completed 0 36m
cluster-1569490920-t9hwj 0/1 Completed 0 33m
cluster-1569491280-qz5sp 0/1 Completed 0 27m
cluster-1569491460-r2dwv 0/1 Completed 0 24m
cluster-1569491640-qn7r8 0/1 Completed 0 21m
cluster-1569492180-vkxcs 0/1 Completed 0 12m
cluster-1569492360-ksn7s 0/1 Completed 0 9m41s
cluster-1569492540-qqwwc 0/1 Completed 0 6m40s
cluster-1569492720-v2dr2 0/1 Completed 0 3m40s
as you can see the job run every 3 min and you can see that
15 / 18/ 30 min doesn't shown as they don't executed , any idea ?
in addition, i've upgraded the version of k8s to 1.15.4 which doesn't solve the problem
The command kubectl get cronjobs returns
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE
cluster */3 * * * * False 0 2m49s 10d
Any clue or direction will be very helpful...

Django ORM QUERY Adjacent row sum with sqlite

In my database I'm storing data as below:
id amt
-- -------
1 100
2 -50
3 100
4 -100
5 200
I want to get output like below
id amt balance
-- ----- -------
1 100 100
2 -50 50
3 100 150
4 -100 50
5 200 250
How to do with in django orm

Data frames pandas python

I have a data frame that looks like this:
id age sallary
1 16 500
2 21 1000
3 25 3000
4 30 6000
5 40 25000
and a list of ids that I would like to ignore [1,3,5]
how can I get a data frame that will contain all the remaining rows: 2,4.
Big thanks for every one.

Call isin and negate the result using ~:
In [42]:
ignore_ids=[1,3,5]
df[~df.id.isin(ignore_ids)]
Out[42]:
id age sallary
1 2 21 1000
3 4 30 6000

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Unit Testing Prometheus Alerts: input series and interval - unit-testing

Related

Prometheus: the "for" is breaking my test

What do series values stand for in Prometheus unit test?

K8s cronjob is not scheduled correctly several times a day

Django ORM QUERY Adjacent row sum with sqlite

Data frames pandas python

Categories

Resources