I am trying to understand what series values stand for in Prometheus unit test.
The official doc does not provide any info.
For example, fire an alert if any instance is down over 10 seconds.
alerting-rules.yml
groups:
- name: alert_rules
rules:
- alert: InstanceDown
expr: up == 0
for: 10s
labels:
severity: critical
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 10 seconds."
alerting-rules.test.yml
rule_files:
- alerting-rules.yml
evaluation_interval: 1m
tests:
- interval: 1m
input_series:
- series: 'up{job="prometheus", instance="localhost:9090"}'
values: '0 0 0 0 0 0 0 0 0 0 0 0 0 0 0'
alert_rule_test:
- eval_time: 10m
alertname: InstanceDown
exp_alerts:
- exp_labels:
severity: critical
instance: localhost:9090
job: prometheus
exp_annotations:
summary: "Instance localhost:9090 down"
description: "localhost:9090 of job prometheus has been down for more than 10 seconds."
Originally, I thought because of interval: 1m, which is 60 seconds, and there are 15 numbers, 60 / 15 = 4s, so each value stands for 4 seconds (1 means up, 0 means down).
However, when the values are
values: '0 0 0 0 0 0 0 0 0 0 0 0 0 0 0'
or
values: '1 1 1 1 1 1 1 1 1 0 0 0 0 0 0'
Both will pass the test when I run promtool test rules alerting-rules.test.yml.
But below will fail:
values: '1 1 1 1 1 1 1 1 1 1 0 0 0 0 0'
So my original thought each number stands for 4s is wrong. If my assumption is correct, then only when less three 0s will fail the test.
What do series values stand for in Prometheus unit test?
Your assumption is incorrrect. The number in the values doesn't correspond at the number of value in the interval but which value the series will have after each interval. For example:
values: 1 1 1 1 1 1
# 1m 2m 3m 4m 5m 6m
In your example, since you evaluate the value at 10min (with eval_time) the evaluation will be based on the tenth value in the values. Since you check if up==0, when you change the tenth value to 1 it will fail because the alert will not be trigger as excepted.
Related
I have been writing unit tests for my Prometheus alerts and I have just increased the interval range in my alert, therefore I need to modify my current test. This is my modified test:
- interval: 15m
# Series data.
input_series:
- series: 'some_bucket{service_name="some-service", le="1000"}'
values: 6 6 6 6 6 6 6
- series: 'some_bucket{service_name="some-service", le="10000"}'
values: 10 11 12 13 14 14 14
- series: 'some_bucket{service_name="some-service", le="+Inf"}'
values: 10 100 200 300 400 500 600
alert_rule_test:
- eval_time: 5m
alertname: someName
exp_alerts: []
- eval_time: 15m
alertname: someName
exp_alerts:
- exp_labels:
severity: error
service_name: some-service
exp_annotations:
summary: "a summary"
description: "adescription"
and my alert rule is:
histogram_quantile(0.95, sum by(le) (rate(some_bucket{service_name="some-service"}[15m]))) >= 1000
The test is working fine, it does not trigger at the eval_time of 5 minutes and it does when it hits the correct interval. My question is regarding the interval set at the top
- interval: 15m
My understanding is that this should be the scraping interval, but if I change it to 1 the test fails. Why is that? Does it mean that my time series/input data needs changing?
Thank you
The given interval is not the scrape interval per se but the time between the values in the series.
Setting interval to 15 min means that your series (with seven entries each, so six gaps between them) define data for 6 x 15 = 90 minutes.
Setting this to 1m means that after six minutes your test data is empty. I couldn't find a behavior in any documentation but I guess it is either undefined or treated as missing value.
The following test will run with interval: 15m. Setting this to 1m breaks the test and you can see that you get 'nil' as values for the buckets.
evaluation_interval: 1m
tests:
- interval: 1m
# Series data.
input_series:
- series: 'some_bucket{service_name="some-service", le="1000"}'
values: 6 6 6 6 6 6 6
- series: 'some_bucket{service_name="some-service", le="10000"}'
values: 10 11 12 13 14 14 14
- series: 'some_bucket{service_name="some-service", le="+Inf"}'
values: 10 100 200 300 400 500 600
promql_expr_test:
- expr: histogram_quantile(0.95, sum by(le) (rate(some_bucket{service_name="some-service"}[15m])))
eval_time: 15m
exp_samples:
- value: 10000
- expr: some_bucket
eval_time: 16m
exp_samples:
- labels: 'some_bucket{service_name="some-service",le="1000"}'
value: 6
- labels: 'some_bucket{service_name="some-service",le="10000"}'
value: 11
- labels: 'some_bucket{service_name="some-service",le="+Inf"}'
value: 100
In Stata I have a list of subjects and contributions from an economic experiment.
There are multiple rounds being played for each treatment. Now I want to keep track of those who contributed in the first period and give them either 1 if a contributor or 0 if a defector. The game is played for multiple periods, but I only really care about the first round. My current code looks like this
g firstroundcont = 0
replace firstroundcont = 1 if c>0 & period==1
This however results in everyone getting a 0 for every subsequent period meaning that they are not "identified" as either a "first round" contributor or a defector for all other periods in the dataset. The table below shows a snippet of how my data looks and how the variable firstroundcont should look.
sessionID
period
subject
group
contribution
firstroundcont
1
1
1
1
4
1
1
1
2
1
0
0
1
1
3
1
2
1
1
1
4
2
10
1
1
1
5
2
0
0
1
1
6
2
0
0
1
2
1
1
0
1
1
2
2
1
5
0
1
2
3
1
0
1
#JR96 is right: this sorely and surely needs a data example. But I guess you want something with the flavour of
bysort id (period) : gen wanted = c[1] > 0
See https://www.stata.com/support/faqs/data-management/creating-dummy-variables/ and https://www.stata-journal.com/article.html?article=dm0099 for more on how to get indicators in one step. The business of generating with 0 and then replacing with 1 can usually be cut to a direct one-line statement.
I've this alert which I try to cover by unit tests:
- alert: Alert name
annotations:
summary: 'Summary.'
book: "https://link.com"
expr: sum(increase(app_receiver{app="app_name", exception="exception"}[1m])) > 0
for: 5m
labels:
severity: 'critical'
team: 'myteam'
This test scenario is failing each time until the for: 5m will be commented in the code. In this case, it'll be successful.
rule_files:
- ./../../rules/test.yml
evaluation_interval: 1m
tests:
- interval: 1m
input_series:
- series: 'app_receiver{app="app_name", exception="exception"}'
values: '0 0 0 0 0 0 0 0 0 0'
- series: 'app_receiver{app="app_name", exception="exception"}'
values: '0 0 0 0 0 10 20 40 60 80'
alert_rule_test:
- alertname: Alert name
eval_time: 5m
exp_alerts:
- exp_labels:
severity: 'critical'
team: 'myteam'
exp_annotations:
summary: 'Summary.'
book: "https://link.com"
The result of this test:
FAILED:
alertname:Alert name, time:5m,
exp:"[Labels:{alertname=\"Alert name\", severity=\"critical\", team=\"myteam\"} Annotations:{book=\"https://link.com\", summary=\"Summary.\"}]",
got:"[]
Can someone please help me fix this test and explain a failure reason?
Given that I have the following output :
Loopback1 is up, line protocol is up
Hardware is Loopback
Description: ** NA4-ISIS-MGMT-LOOPBACK1_MPLS **
Internet address is 84.116.226.27/32
MTU 1514 bytes, BW 8000000 Kbit, DLY 5000 usec,
reliability 255/255, txload 1/255, rxload 1/255
Encapsulation LOOPBACK, loopback not set
Keepalive set (10 sec)
Last input 12w3d, output never, output hang never
Last clearing of "show interface" counters never
Input queue: 0/75/0/0 (size/max/drops/flushes); Total output drops: 0
Queueing strategy: fifo
Output queue: 0/0 (size/max)
5 minute input rate 0 bits/sec, 0 packets/sec
5 minute output rate 0 bits/sec, 0 packets/sec
0 packets input, 0 bytes, 0 no buffer
Received 0 broadcasts (0 IP multicasts)
0 runts, 0 giants, 0 throttles
0 input errors, 0 CRC, 0 frame, 0 overrun, 0 ignored, 0 abort
6 packets output, 456 bytes, 0 underruns
0 output errors, 0 collisions, 0 interface resets
0 output buffer failures, 0 output buffers swapped out
How can I match "Loopback1" and not "Loopback" ?
In other words, how can I match the interface name only if there is a number next to it, in Tcl ?
use lookahead
Loopback(?=\d+)
It matches only Loopback in Loopback followed by any number of digits. If you want to match loopback and the number, useLoopback\d+
I have a Django view which creates 500-5000 new database INSERTS in a loop. Problem is, it is really slow! I'm getting about 100 inserts per minute on Postgres 8.3. We used to use MySQL on lesser hardware (smaller EC2 instance) and never had these types of speed issues.
Details:
Postgres 8.3 on Ubuntu Server 9.04.
Server is a "large" Amazon EC2 with database on EBS (ext3) - 11GB/20GB.
Here is some of my postgresql.conf -- let me know if you need more
shared_buffers = 4000MB
effective_cache_size = 7128MB
My python:
for k in kw:
k = k.lower()
p = ProfileKeyword(profile=self)
logging.debug(k)
p.keyword, created = Keyword.objects.get_or_create(keyword=k, defaults={'keyword':k,})
if not created and ProfileKeyword.objects.filter(profile=self, keyword=p.keyword).count():
#checking created is just a small optimization to save some database hits on new keywords
pass #duplicate entry
else:
p.save()
Some output from top:
top - 16:56:22 up 21 days, 20:55, 4 users, load average: 0.99, 1.01, 0.94
Tasks: 68 total, 1 running, 67 sleeping, 0 stopped, 0 zombie
Cpu(s): 5.8%us, 0.2%sy, 0.0%ni, 90.5%id, 0.7%wa, 0.0%hi, 0.0%si, 2.8%st
Mem: 15736360k total, 12527788k used, 3208572k free, 332188k buffers
Swap: 0k total, 0k used, 0k free, 11322048k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
14767 postgres 25 0 4164m 117m 114m S 22 0.8 2:52.00 postgres
1 root 20 0 4024 700 592 S 0 0.0 0:01.09 init
2 root RT 0 0 0 0 S 0 0.0 0:11.76 migration/0
3 root 34 19 0 0 0 S 0 0.0 0:00.00 ksoftirqd/0
4 root RT 0 0 0 0 S 0 0.0 0:00.00 watchdog/0
5 root 10 -5 0 0 0 S 0 0.0 0:00.08 events/0
6 root 11 -5 0 0 0 S 0 0.0 0:00.00 khelper
7 root 10 -5 0 0 0 S 0 0.0 0:00.00 kthread
9 root 10 -5 0 0 0 S 0 0.0 0:00.00 xenwatch
10 root 10 -5 0 0 0 S 0 0.0 0:00.00 xenbus
18 root RT -5 0 0 0 S 0 0.0 0:11.84 migration/1
19 root 34 19 0 0 0 S 0 0.0 0:00.01 ksoftirqd/1
Let me know if any other details would be helpful.
One common reason for slow bulk operations like this is each insert happening in its own transaction. If you can get all of them to happen in a single transaction, it could go much faster.
Firstly, ORM operations are always going to be slower than pure SQL. I once wrote an update to a large database in ORM code and set it running, but quit it after several hours when it had completed only a tiny fraction. After rewriting it in SQL the whole thing ran in less than a minute.
Secondly, bear in mind that your code here is doing up to four separate database operations for every row in your data set - the get in get_or_create, possibly also the create, the count on the filter, and finally the save. That's a lot of database access.
Bearing in mind that a maximum of 5000 objects is not huge, you should be able to read the whole dataset into memory at the start. Then you can do a single filter to get all the existing Keyword objects in one go, saving a huge number of queries in the Keyword get_or_create and also avoiding the need to instantiate duplicate ProfileKeywords in the first place.