I've this alert which I try to cover by unit tests:
- alert: Alert name
annotations:
summary: 'Summary.'
book: "https://link.com"
expr: sum(increase(app_receiver{app="app_name", exception="exception"}[1m])) > 0
for: 5m
labels:
severity: 'critical'
team: 'myteam'
This test scenario is failing each time until the for: 5m will be commented in the code. In this case, it'll be successful.
rule_files:
- ./../../rules/test.yml
evaluation_interval: 1m
tests:
- interval: 1m
input_series:
- series: 'app_receiver{app="app_name", exception="exception"}'
values: '0 0 0 0 0 0 0 0 0 0'
- series: 'app_receiver{app="app_name", exception="exception"}'
values: '0 0 0 0 0 10 20 40 60 80'
alert_rule_test:
- alertname: Alert name
eval_time: 5m
exp_alerts:
- exp_labels:
severity: 'critical'
team: 'myteam'
exp_annotations:
summary: 'Summary.'
book: "https://link.com"
The result of this test:
FAILED:
alertname:Alert name, time:5m,
exp:"[Labels:{alertname=\"Alert name\", severity=\"critical\", team=\"myteam\"} Annotations:{book=\"https://link.com\", summary=\"Summary.\"}]",
got:"[]
Can someone please help me fix this test and explain a failure reason?
Related
I have been writing unit tests for my Prometheus alerts and I have just increased the interval range in my alert, therefore I need to modify my current test. This is my modified test:
- interval: 15m
# Series data.
input_series:
- series: 'some_bucket{service_name="some-service", le="1000"}'
values: 6 6 6 6 6 6 6
- series: 'some_bucket{service_name="some-service", le="10000"}'
values: 10 11 12 13 14 14 14
- series: 'some_bucket{service_name="some-service", le="+Inf"}'
values: 10 100 200 300 400 500 600
alert_rule_test:
- eval_time: 5m
alertname: someName
exp_alerts: []
- eval_time: 15m
alertname: someName
exp_alerts:
- exp_labels:
severity: error
service_name: some-service
exp_annotations:
summary: "a summary"
description: "adescription"
and my alert rule is:
histogram_quantile(0.95, sum by(le) (rate(some_bucket{service_name="some-service"}[15m]))) >= 1000
The test is working fine, it does not trigger at the eval_time of 5 minutes and it does when it hits the correct interval. My question is regarding the interval set at the top
- interval: 15m
My understanding is that this should be the scraping interval, but if I change it to 1 the test fails. Why is that? Does it mean that my time series/input data needs changing?
Thank you
The given interval is not the scrape interval per se but the time between the values in the series.
Setting interval to 15 min means that your series (with seven entries each, so six gaps between them) define data for 6 x 15 = 90 minutes.
Setting this to 1m means that after six minutes your test data is empty. I couldn't find a behavior in any documentation but I guess it is either undefined or treated as missing value.
The following test will run with interval: 15m. Setting this to 1m breaks the test and you can see that you get 'nil' as values for the buckets.
evaluation_interval: 1m
tests:
- interval: 1m
# Series data.
input_series:
- series: 'some_bucket{service_name="some-service", le="1000"}'
values: 6 6 6 6 6 6 6
- series: 'some_bucket{service_name="some-service", le="10000"}'
values: 10 11 12 13 14 14 14
- series: 'some_bucket{service_name="some-service", le="+Inf"}'
values: 10 100 200 300 400 500 600
promql_expr_test:
- expr: histogram_quantile(0.95, sum by(le) (rate(some_bucket{service_name="some-service"}[15m])))
eval_time: 15m
exp_samples:
- value: 10000
- expr: some_bucket
eval_time: 16m
exp_samples:
- labels: 'some_bucket{service_name="some-service",le="1000"}'
value: 6
- labels: 'some_bucket{service_name="some-service",le="10000"}'
value: 11
- labels: 'some_bucket{service_name="some-service",le="+Inf"}'
value: 100
I am trying to extract the text between the two strings using the following regex.
(?s)Non-terminated Pods:.*?in total.\R(.*)(?=Allocated resources)
This regex looks fine in regex101 but somehow does not print the pod details when used with perl or grep -P. Below command results in empty output.
kubectl describe node |perl -le '/(?s)Non-terminated Pods:.*?in total.\R(.*)(?=Allocated resources)/m; printf "$1"'
Here is the sample input:
PodCIDRs: 10.233.65.0/24
Non-terminated Pods: (7 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
default foo 0 (0%) 0 (0%) 0 (0%) 0 (0%) 105s
kube-system nginx-proxy-kube-worker-1 25m (1%) 0 (0%) 32M (1%) 0 (0%) 9m8s
kube-system nodelocaldns-xbjp8 100m (5%) 0 (0%) 70Mi (4%) 170Mi (10%) 7m4s
Allocated resources:
Question:
how to extract the info from the above output, to look like below. What is wrong in the regex or the command that I am using?
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
default foo 0 (0%) 0 (0%) 0 (0%) 0 (0%) 105s
kube-system nginx-proxy-kube-worker-1 25m (1%) 0 (0%) 32M (1%) 0 (0%) 9m8s
kube-system nodelocaldns-xbjp8 100m (5%) 0 (0%) 70Mi (4%)
Question-2: What if I have two blocks of similar inputs. How to extract the pod details ?
Eg:
if the input is:
PodCIDRs: 10.233.65.0/24
Non-terminated Pods: (7 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
default foo 0 (0%) 0 (0%) 0 (0%) 0 (0%) 105s
kube-system nginx-proxy-kube-worker-1 25m (1%) 0 (0%) 32M (1%) 0 (0%) 9m8s
kube-system nodelocaldns-xbjp8 100m (5%) 0 (0%) 70Mi (4%) 170Mi (10%) 7m4s
Allocated resources:
....some
.......random data...
PodCIDRs: 10.233.65.0/24
Non-terminated Pods: (7 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
default foo-1 0 (0%) 0 (0%) 0 (0%) 0 (0%) 105s
kube-system nginx-proxy-kube-worker-2 25m (1%) 0 (0%) 32M (1%) 0 (0%) 9m8s
kube-system nodelocaldns-xbjp3-2 100m (5%) 0 (0%) 70Mi (4%) 170Mi (10%) 7m4s
Allocated resources:
With some obvious assumptions, and keeping it close to the pattern in the question:
perl -0777 -wnE'
#pods = /Non-terminated\s+Pods:\s+\([0-9]+\s+in\s+total\)\n(.*?)\nAllocated resources:/gs;
say for #pods
' input-file
(note modifiers on the regex in this line, which is too wide to fit on screen: /gs)
The regex from the question works when used instead of the one in this answer (and with no /s modifier, as it should) on a single block of text. To work with multiple blocks the (.*) in it need be changed to (.*?), so that it doesn't match all the way to the last Allocated...
The question doesn't say how precisely is that regex "used with perl"; I can't say what failed.
Comments on the command-line program above:
The -0777 switch makes it read the file whole into a string, available in the program in the variable $_, to which the regex is bound by default
There is also the switch -g, an alias for -0777, available starting with 5.36.0
We still need the -n switch so that the program iterates over the "lines" of input (STDIN or a file). In this case the input record separator is undefined so it's all just one "line"
The regex captures are returned since the match operator is in the list context, being assigned to the array #pods
Using gnu-grep you can use your regex with some tweaks:
kubectl describe node |
grep -zoP '(?s)Non-terminated Pods:.*?in total.\R\K(.*?)(?=Allocated resources)'
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
default foo 0 (0%) 0 (0%) 0 (0%) 0 (0%) 105s
kube-system nginx-proxy-kube-worker-1 25m (1%) 0 (0%) 32M (1%) 0 (0%) 9m8s
kube-system nodelocaldns-xbjp8 100m (5%) 0 (0%) 70Mi (4%) 170Mi (10%) 7m4s
Used \K (match reset) after \R to remove that line from output
Used -z option to treat treat input and output data as sequences of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline.
PS: Same regex will work with second input block as well with header line shown before each block.
Alternatively you can use any version sed for this job as well:
kubectl describe node |
sed -n '/Non-terminated Pods:.*in total.*/,/Allocated resources:/ {//!p;}'
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
default foo 0 (0%) 0 (0%) 0 (0%) 0 (0%) 105s
kube-system nginx-proxy-kube-worker-1 25m (1%) 0 (0%) 32M (1%) 0 (0%) 9m8s
kube-system nodelocaldns-xbjp8 100m (5%) 0 (0%) 70Mi (4%) 170Mi (10%) 7m4s
With your shown samples, please try following GNU awk code. Written and tested in GNU awk. Simple explanation would be, setting RS as Non-terminated Pods:.*Allocated resources: for Input_file. Then in main program checking if RT is NOT NULL then using gsub function of awk to substitute (^|\n)Non-terminated Pods:[^\n]*\n OR \nAllocated resources:\n* with NULL in RT variable and then printing its value which will provide output as per shown samples.
awk -v RS='Non-terminated Pods:.*Allocated resources:' '
RT{
gsub(/(^|\n)Non-terminated Pods:[^\n]*\n|\nAllocated resources:\n*/,"",RT)
print RT
}
' Input_file
A possible solution could be as following for a very big files to read line by line.
Select range of lines of interest and remove the last one which is not included into desired output.
use strict;
use warnings;
while(<>) {
if( /^ Namespace/ .. /^Allocated resources:/ ) {
print unless /^Allocated resources:/;
}
}
exit 0;
Output
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
default foo 0 (0%) 0 (0%) 0 (0%) 0 (0%) 105s
kube-system nginx-proxy-kube-worker-1 25m (1%) 0 (0%) 32M (1%) 0 (0%) 9m8s
kube-system nodelocaldns-xbjp8 100m (5%) 0 (0%) 70Mi (4%) 170Mi (10%) 7m4s
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
default foo-1 0 (0%) 0 (0%) 0 (0%) 0 (0%) 105s
kube-system nginx-proxy-kube-worker-2 25m (1%) 0 (0%) 32M (1%) 0 (0%) 9m8s
kube-system nodelocaldns-xbjp3-2 100m (5%) 0 (0%) 70Mi (4%) 170Mi (10%) 7m4s
When I start passenger, multiple processes have the same connection.
bundle exec passenger-status
Requests in queue: 0
* PID: 13830 Sessions: 0 Processed: 107 Uptime: 1h 24m 22s
CPU: 0% Memory : 446M Last used: 41s ago
* PID: 13909 Sessions: 0 Processed: 0 Uptime: 41s
CPU: 0% Memory : 22M Last used: 41s ago
ss -antp4 | grep ':3306 '
ESTAB 0 0 XXX.XXX.XXX.XXX:55488 XXX.XXX.XXX.XXX:3306 users:(("ruby",pid=13909,fd=14),("ruby",pid=13830,fd=14),("ruby",pid=4672,fd=14)) #<= 4672 is preloader process?
ESTAB 0 0 XXX.XXX.XXX.XXX:55550 XXX.XXX.XXX.XXX:3306 users:(("ruby",pid=13830,fd=24))
ESTAB 0 0 XXX.XXX.XXX.XXX:55552 XXX.XXX.XXX.XXX:3306 users:(("ruby",pid=13909,fd=24))
Is the connection using port 55488 correct?
I believe that inconsistencies occur when multiple processes refer to the same connection. But I can't find the problem in my application.
I am using Rails 4.x and passenger 6.0.2
I am trying to understand what series values stand for in Prometheus unit test.
The official doc does not provide any info.
For example, fire an alert if any instance is down over 10 seconds.
alerting-rules.yml
groups:
- name: alert_rules
rules:
- alert: InstanceDown
expr: up == 0
for: 10s
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 10 seconds."
alerting-rules.test.yml
rule_files:
- alerting-rules.yml
evaluation_interval: 1m
tests:
- interval: 1m
input_series:
- series: 'up{job="prometheus", instance="localhost:9090"}'
values: '0 0 0 0 0 0 0 0 0 0 0 0 0 0 0'
alert_rule_test:
- eval_time: 10m
alertname: InstanceDown
exp_alerts:
- exp_labels:
severity: critical
instance: localhost:9090
job: prometheus
exp_annotations:
summary: "Instance localhost:9090 down"
description: "localhost:9090 of job prometheus has been down for more than 10 seconds."
Originally, I thought because of interval: 1m, which is 60 seconds, and there are 15 numbers, 60 / 15 = 4s, so each value stands for 4 seconds (1 means up, 0 means down).
However, when the values are
values: '0 0 0 0 0 0 0 0 0 0 0 0 0 0 0'
or
values: '1 1 1 1 1 1 1 1 1 0 0 0 0 0 0'
Both will pass the test when I run promtool test rules alerting-rules.test.yml.
But below will fail:
values: '1 1 1 1 1 1 1 1 1 1 0 0 0 0 0'
So my original thought each number stands for 4s is wrong. If my assumption is correct, then only when less three 0s will fail the test.
What do series values stand for in Prometheus unit test?
Your assumption is incorrrect. The number in the values doesn't correspond at the number of value in the interval but which value the series will have after each interval. For example:
values: 1 1 1 1 1 1
# 1m 2m 3m 4m 5m 6m
In your example, since you evaluate the value at 10min (with eval_time) the evaluation will be based on the tenth value in the values. Since you check if up==0, when you change the tenth value to 1 it will fail because the alert will not be trigger as excepted.
I’ve a cluster with cronjob which is running Ok, I schedule the cronjob to run on every 3 min
and I notice that sometimes the jobs are not running at all for 6-9 min , (two or three intervals ) .
This happen several times a day and I'm not sure why , how can I check what is the problem can be ?
is there a way to overcome this ?
we use k8s 1.14.7
This is the cronjob
I try to change also the interval to 10 min and still I see this pattern , i.e. several times a day (for 20/30 min - 2-3 intervals )the job is not running .
the job execution time is just 30 sec and does not run in parallel (it run like a singleton job )
the logs (for the running jobs) doesn't show anything
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: fdc-job
namespace: {{required "Mon namespace variable '(.Values.mon.namespace)' is required" .Values.mon.namespace}}
spec:
suspend: false
schedule: "*/3 * * * *"
successfulJobsHistoryLimit: 1
failedJobsHistoryLimit: 1
startingDeadlineSeconds: 10
jobTemplate:
spec:
backoffLimit: 1
template:
spec:
serviceAccountName: mon-sa
containers:
- name: cluster-check
image: {{required "A valid .Values.artifactory.host entry required!" .Values.artifactory.host}}/{{(.Values.mon.image.repository)}}:{{(.Values.mon.image.tag)}}
args: [“fdc"]
restartPolicy: Never
activeDeadlineSeconds: 100
imagePullSecrets:
- name: docker-images-secret
update
running command: kubectl get pods return the following for the last 40 min ...
cluster-1569490560-b78nw 0/1 Completed 0 39m
cluster-1569490740-8gcwl 0/1 Completed 0 36m
cluster-1569490920-t9hwj 0/1 Completed 0 33m
cluster-1569491280-qz5sp 0/1 Completed 0 27m
cluster-1569491460-r2dwv 0/1 Completed 0 24m
cluster-1569491640-qn7r8 0/1 Completed 0 21m
cluster-1569492180-vkxcs 0/1 Completed 0 12m
cluster-1569492360-ksn7s 0/1 Completed 0 9m41s
cluster-1569492540-qqwwc 0/1 Completed 0 6m40s
cluster-1569492720-v2dr2 0/1 Completed 0 3m40s
as you can see the job run every 3 min and you can see that
15 / 18/ 30 min doesn't shown as they don't executed , any idea ?
in addition, i've upgraded the version of k8s to 1.15.4 which doesn't solve the problem
The command kubectl get cronjobs returns
NAME SCHEDULE SUSPEND ACTIVE LAST SCHEDULE AGE
cluster */3 * * * * False 0 2m49s 10d
Any clue or direction will be very helpful...