Prometheus series values for time metrics - unit-testing

I'm defining a data series for testing a Prometheus alert using the container_last_seen metric from the cadvisor exporter.
How do I enter timestamp series values, as returned by the container_last_seen metric? I'm testing Prometheus alerts on an Apple Mac which run in production on Linux boxes.
Here's one thing I tried:
input_series:
- series: |
container_last_seen{container_label_com_docker_swarm_service_name="service1",env="prod",instance="10.0.0.1"}
values: '1563968832+0x61'
It seems whatever I put in the values for the series is not accepted.
I've also tried durations: '0h+1mx60'
As this is legal: time() - container_last_seen{...} cls is definitely a timestamp, and I would expect a timestamp to be represented by a Unix epoch number. Executing the query on Prometheus gives Unix epoch times, but putting numbers in a series is rejected with the error below.
promtool is recognising the different types but giving much the same error:
➜ promtool test rules alertrules-service-oriented-test.yml
Unit Testing: alertrules-service-oriented-test.yml
FAILED:
1:1: parse error: unexpected number "0" in series values
If the values are '1h+0mx61', promtool correctly identifies the values as durations:
1:1: parse error: unexpected duration "1h" in series values
Note that when this test is commented out, there is no 1:1: parse error and the tests complete successfully. This is not a problem with out of sight parts of the test file.
Thanks for any insights.
Here's the alert:
alertrules.yaml:
- name: containers
interval: 15s
rules:
- alert: prod_container_crashing
expr: |
count by (instance, container_label_com_docker_swarm_service_name)
(
count_over_time(container_last_seen{container_label_com_docker_swarm_service_name!="",env="prod"}[15m])
) - 1 > 2
for: 5m
labels:
service: prod
type: container
severity: critical
annotations:
summary: "pdce {{ $labels.container_label_com_docker_swarm_service_name }}"
description: "{{ $labels.container_label_com_docker_swarm_service_name }} in prod cluster on {{ $labels.instance }} is crashing"
and here's the test file:
alertrules_test.yml:
rule_files:
- alertrules.yml
evaluation_interval: 1m
tests:
- name: container_tests
interval: 15s
input_series:
- series: |
container_last_seen{container_label_com_docker_swarm_service_name="service1",env="prod",instance="10.0.0.1"}
values: '1563968832+0x61'
alert_rule_test:
- eval_time: 15m
alertname: prod_container_crashing
exp_alerts:
- exp_labels:
service: prod
type: container
severity: critical
exp_annotations:
summary: prod service1
description: service1 in prod cluster on 10.0.0.1 is crashing

When the series: value is all on one line, without a > or | yaml flow operator, e.g.
- series: container_last_seen{container_label_com_docker_swarm_service_name="service1",env="prod",instance="10.0.0.1"}
values: '1563968832+0x61'
the error is not there, I don't know why. So this doesn't appear to be a data typing issue.
It's a shame for readability reasons-- either Prometheus or GoLang may have a squeaky wheel in their YAML implementation.

Related

How to set table format as default in google cloud shell?

When I try to list anything, my result is not grouped as a table ( as in the video). Each region is listed separately with its descriptions. Something like this
NAME: us-west3
CPUS: 0/24
DISKS_GB: 0/4096
ADDRESSES: 0/8
RESERVED_ADDRESSES: 0/8
STATUS: UP
TURNDOWN_DATE:
NAME: us-west4
CPUS: 0/24
DISKS_GB: 0/4096
ADDRESSES: 0/8
RESERVED_ADDRESSES: 0/8
STATUS: UP
TURNDOWN_DATE:
Please try:
gcloud config set accessibility/screen_reader False
And then repeat the command.

set alertmanager to distribute alerts to different channel by job name

I want to send my alert to two different distribution lists in Alertmanager for Prometheus. The only way to distinguish my alerts is by their job name.
my alert names are like below:
sample1:
Labels
alertname = SyslogErrors
instance = 22.32.23.32:2324
job = my-job-sample-service-dev
message = Exception raised during message subscription. Trying again in 60 seconds
monitor = server1
severity = critical
Annotations
description = Errors have been found for my-job-sample-service-dev application in /data/logs/messages/my-job-sample-service-dev syslog file
Source
sample2:
Labels
alertname = SyslogErrors
instance = 22.32.23.32:2324
job = my-job-sample-service-pre-dev
message = Exception raised during message subscription. Trying again in 60 seconds
monitor = server1
severity = critical
Annotations
description = Errors have been found for my-job-sample-service-pre-dev application in /data/logs/messages/my-job-sample-service-pre-dev syslog file
Source
here is my sample alertmanager config file:
global:
smtp_smarthost: 'mail.server.com:25'
smtp_from: 'dev#server.com'
smtp_require_tls: false
templates:
- '/etc/alertmanager/template/*.tmpl'
route:
receiver: mail-receiver-dev
group_by: ['alertname']
group_wait: 3s
group_interval: 5s
repeat_interval: 1h
# All alerts that do not match the following child routes
# will remain at the root node and be dispatched to 'default-receiver'.
routes:
- receiver: 'mail-pre-dev'
group_wait: 10s
match_re:
- job = .*pre-dev.*
- receiver: 'mail-dev'
group_wait: 10s
match_re:
- job = .*dev.*
receivers:
- name: 'mail-dev'
email_configs:
- to: 'dev-group#server.com'
send_resolved: true
- name: 'mail-pre-dev'
email_configs:
- to: 'pre-dev-group#server.com'
send_resolved: true
I am using the below link as a reference:
reference
Testing config file link
testscript for using above link: {service="foo-service",severity="critical",job="my-job-sample-service-dev"}
So the question is, how to send an alert to a different channel by using regex for the job title? At the moment when I test all the alert goes to pre-dev.
Change the following:
match_re:
- job = .*pre-dev.*
To:
matchers:
- job =~ ".*pre-dev.*"
Note:
"match_re" is deprecated and must be replaced by "matchers", but if you want to use it, the correct syntax is:
match_re:
- job: ".*pre-dev.*"

Error when deploying google cloud function - run out of memory?

I have used the following deployment for the example code used in the tutorial for a google cloud function. The function should simply print the statements below when a new item is added to my bucket, (which happens every half hour)
Example code (file is also called hello_gs.py):
def hello_gcs(event, context):
print('Event ID: {}'.format(context.event_id))
print('Event type: {}'.format(context.event_type))
print('Bucket: {}'.format(event['bucket']))
print('File: {}'.format(event['name']))
print('Metageneration: {}'.format(event['metageneration']))
print('Created: {}'.format(event['timeCreated']))
print('Updated: {}'.format(event['updated']))
I deploy it with:
gcloud functions deploy hello_gcs \
--trigger-resource bucket1 \
--trigger-event google.storage.object.finalize
I get the following error in my logs
insertId: "000000-f7b8ac5b-61f2-4d37-902a-b21ab56372c9"
labels: {1}
logName: "projects/project-name-v2/logs/cloudfunctions.googleapis.com%2Fcloud-functions"
receiveTimestamp: "2021-10-20T11:38:19.093774441Z"
resource: {2}
severity: "ERROR"
textPayload: "Function cannot be initialized. Error: memory limit exceeded.
"
timestamp: "2021-10-20T11:38:18.112056018Z"
and yet the function is so simple and small I find this hard to understand?
Any ideas what I am doing wrong here, and help would be appreciated.

Task result to a file

I have a simple playbook that run Cisco nxos command, which the playbook ran successful.
Would like to know what is the code save all the result into a file regardless how many hosts I have and use Survey to input the filename.
Currently, here is my code:
---
- name: run multiple commands on remote nodes
nxos_command:
commands:
- show clock
- show int status
- show cdp neigh
- show int desc
- show port-channel summ
- show vpc
- show vpc role
Try with code
---
- name: run multiple commands on remote nodes
register: myshell_output
nxos_command:
commands:
- show clock
- show int status
- show cdp neigh
- show int desc
- show port-channel summ
- show vpc
- show vpc role
- name: Saving data to local file
copy:
content: "{{ myshell_output.stdout|join('\n') }}"
dest: "/tmp/hello.txt"
delegate_to: localhost
It give me an error:
FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'ansible.utils.unsafe_proxy.AnsibleUnsafeText object' has no attribute 'stdout'\n\nThe error appears to be in '/tmp/awx_1869_7__9l_9l/project/roles/bcpcommands/tasks/main.yml': line 3, column 3, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n- name: run multiple commands on remote nodes\n ^ here\n"}
The host normally I limit it at Ansible-Tower LIMIT column.
The ideal output of the file possible to include the hostname and commands that I key in?
Thanks
You probably got the indenting wrong. Try;
---
- hosts: my_host
tasks:
- name: run multiple commands on remote nodes
nxos_command:
commands: "{{ item }}"
loop:
- show clock
- show int status
- show cdp neigh
- show int desc
- show port-channel summ
- show vpc
- show vpc role
register: myshell_output
- debug:
msg: "{{ myshell_output }}"
- name: Saving data to local file and include hostname
copy:
content: "{{ myshell_output.stdout|join('\n') }} hostname: {{ inventory_hostname }}"
dest: "/tmp/hello.txt"
delegate_to: localhost
Edit the hostname.
The debug task must output an 'stdout' message. If that one is not present, then your copy task will fail.

Connecting cassandra-stress to AWS Keyspaces

I've provisions a Keyspace on AWS and in order to make sure it can achieve our desired performance I'm trying to run the cassandra-stress tool on it and compare it to other architectures we're experimenting with.
I managed to connect to it using the following cqlshrc:
[connection]
port = 9142
factory = cqlshlib.ssl.ssl_transport_factory
[ssl]
validate = true
certfile = /root/.cassandra/AmazonRootCA1.pem
And the following command (hoping that soon enough there will be Python3 support, the development was completed this February according to their Jira ticket):
cqlsh cassandra.eu-central-1.amazonaws.com 9142 -u "myuser-at-722222222222" -p "12/12ZmHmtD1klsDk9cgqt/XXXXXXXXxUz6Sy687z/U=" --ssl --cqlversion="3.4.4"
Surprisingly or not, when using the official AWS guides things tend to work.
So I went on and tried connecting the cassandra-stress tool (I have it inside a Docker container, I'd rather keep my OS Java free) to the same Keyspace.
First I converted the AWS AmazonRootCA1.pem into cassandra_truststore.jks using the following commands (explained here):
openssl x509 -outform der -in AmazonRootCA1.pem -out temp_file.der
keytool -import -alias cassandra -keystore cassandra_truststore.jks -file temp_file.der
Now when I'm trying to run the actual tool like this:
./cassandra-stress write -node cassandra.eu-central-1.amazonaws.com -port native=9142 thrift=9142 jmx=9142 -transport truststore=/root/.cassandra/cassandra_truststore.jks truststore-password=mypassword -mode native cql3 user="myuser-at-722222222222" password="12/12ZmHmtD1klsDk9cgqt/XXXXXXXXxUz6Sy687z/U="
I'm getting the following error:
******************** Stress Settings ********************
Command:
Type: write
Count: -1
No Warmup: false
Consistency Level: LOCAL_ONE
Target Uncertainty: 0.020
Minimum Uncertainty Measurements: 30
Maximum Uncertainty Measurements: 200
Key Size (bytes): 10
Counter Increment Distibution: add=fixed(1)
Rate:
Auto: true
Min Threads: 4
Max Threads: 1000
Population:
Sequence: 1..1000000
Order: ARBITRARY
Wrap: true
Insert:
Revisits: Uniform: min=1,max=1000000
Visits: Fixed: key=1
Row Population Ratio: Ratio: divisor=1.000000;delegate=Fixed: key=1
Batch Type: not batching
Columns:
Max Columns Per Key: 5
Column Names: [C0, C1, C2, C3, C4]
Comparator: AsciiType
Timestamp: null
Variable Column Count: false
Slice: false
Size Distribution: Fixed: key=34
Count Distribution: Fixed: key=5
Errors:
Ignore: false
Tries: 10
Log:
No Summary: false
No Settings: false
File: null
Interval Millis: 1000
Level: NORMAL
Mode:
API: JAVA_DRIVER_NATIVE
Connection Style: CQL_PREPARED
CQL Version: CQL3
Protocol Version: V4
Username: myuser-at-722222222222
Password: *suppressed*
Auth Provide Class: null
Max Pending Per Connection: 128
Connections Per Host: 8
Compression: NONE
Node:
Nodes: [cassandra.eu-central-1.amazonaws.com]
Is White List: false
Datacenter: null
Schema:
Keyspace: keyspace1
Replication Strategy: org.apache.cassandra.locator.SimpleStrategy
Replication Strategy Pptions: {replication_factor=1}
Table Compression: null
Table Compaction Strategy: null
Table Compaction Strategy Options: {}
Transport:
factory=org.apache.cassandra.thrift.TFramedTransportFactory; truststore=/root/.cassandra/cassandra_truststore.jks; truststore-password=mypassword; keystore=null; keystore-password=null; ssl-protocol=TLS; ssl-alg=SunX509; store-type=JKS; ssl-ciphers=TLS_RSA_WITH_AES_128_CBC_SHA,TLS_RSA_WITH_AES_256_CBC_SHA;
Port:
Native Port: 9142
Thrift Port: 9142
JMX Port: 9142
Send To Daemon:
*not set*
Graph:
File: null
Revision: unknown
Title: null
Operation: WRITE
TokenRange:
Wrap: false
Split Factor: 1
java.lang.RuntimeException: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: cassandra.eu-central-1.amazonaws.com/3.127.48.183:9142 (com.datastax.driver.core.exceptions.TransportException: [cassandra.eu-central-1.amazonaws.com/3.127.48.183] Channel has been closed))
at org.apache.cassandra.stress.settings.StressSettings.getJavaDriverClient(StressSettings.java:220)
at org.apache.cassandra.stress.settings.SettingsSchema.createKeySpacesNative(SettingsSchema.java:79)
at org.apache.cassandra.stress.settings.SettingsSchema.createKeySpaces(SettingsSchema.java:69)
at org.apache.cassandra.stress.settings.StressSettings.maybeCreateKeyspaces(StressSettings.java:228)
at org.apache.cassandra.stress.StressAction.run(StressAction.java:57)
at org.apache.cassandra.stress.Stress.run(Stress.java:143)
at org.apache.cassandra.stress.Stress.main(Stress.java:62)
Caused by: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: cassandra.eu-central-1.amazonaws.com/3.127.48.183:9142 (com.datastax.driver.core.exceptions.TransportException: [cassandra.eu-central-1.amazonaws.com/3.127.48.183] Channel has been closed))
at com.datastax.driver.core.ControlConnection.reconnectInternal(ControlConnection.java:233)
at com.datastax.driver.core.ControlConnection.connect(ControlConnection.java:79)
at com.datastax.driver.core.Cluster$Manager.init(Cluster.java:1424)
at com.datastax.driver.core.Cluster.getMetadata(Cluster.java:403)
at org.apache.cassandra.stress.util.JavaDriverClient.connect(JavaDriverClient.java:160)
at org.apache.cassandra.stress.settings.StressSettings.getJavaDriverClient(StressSettings.java:211)
... 6 more
I've tried changing some parameters such as the jks password etc. (Just in case I was wrong) but I got a different error message so it's probably not the case.
Did I miss something?
Try using TLP Stress instead.
tlp-stress run RandomPartitionAccess -d 10m --host cassandra.us-east-1.amazonaws.com --port 9142 --username alice --password fLyWYFlTCD5J2gzGAZ –ssl --max-requests 4000 --dc us-east-2 --threads 10
https://thelastpickle.com/tlp-stress/