set alertmanager to distribute alerts to different channel by job name - regex

I want to send my alert to two different distribution lists in Alertmanager for Prometheus. The only way to distinguish my alerts is by their job name.
my alert names are like below:
sample1:
Labels
alertname = SyslogErrors
instance = 22.32.23.32:2324
job = my-job-sample-service-dev
message = Exception raised during message subscription. Trying again in 60 seconds
monitor = server1
severity = critical
Annotations
description = Errors have been found for my-job-sample-service-dev application in /data/logs/messages/my-job-sample-service-dev syslog file
Source
sample2:
Labels
alertname = SyslogErrors
instance = 22.32.23.32:2324
job = my-job-sample-service-pre-dev
message = Exception raised during message subscription. Trying again in 60 seconds
monitor = server1
severity = critical
Annotations
description = Errors have been found for my-job-sample-service-pre-dev application in /data/logs/messages/my-job-sample-service-pre-dev syslog file
Source
here is my sample alertmanager config file:
global:
smtp_smarthost: 'mail.server.com:25'
smtp_from: 'dev#server.com'
smtp_require_tls: false
templates:
- '/etc/alertmanager/template/*.tmpl'
route:
receiver: mail-receiver-dev
group_by: ['alertname']
group_wait: 3s
group_interval: 5s
repeat_interval: 1h
# All alerts that do not match the following child routes
# will remain at the root node and be dispatched to 'default-receiver'.
routes:
- receiver: 'mail-pre-dev'
group_wait: 10s
match_re:
- job = .*pre-dev.*
- receiver: 'mail-dev'
group_wait: 10s
match_re:
- job = .*dev.*
receivers:
- name: 'mail-dev'
email_configs:
- to: 'dev-group#server.com'
send_resolved: true
- name: 'mail-pre-dev'
email_configs:
- to: 'pre-dev-group#server.com'
send_resolved: true
I am using the below link as a reference:
reference
Testing config file link
testscript for using above link: {service="foo-service",severity="critical",job="my-job-sample-service-dev"}
So the question is, how to send an alert to a different channel by using regex for the job title? At the moment when I test all the alert goes to pre-dev.

Change the following:
match_re:
- job = .*pre-dev.*
To:
matchers:
- job =~ ".*pre-dev.*"
Note:
"match_re" is deprecated and must be replaced by "matchers", but if you want to use it, the correct syntax is:
match_re:
- job: ".*pre-dev.*"

Related

Prometheus series values for time metrics

I'm defining a data series for testing a Prometheus alert using the container_last_seen metric from the cadvisor exporter.
How do I enter timestamp series values, as returned by the container_last_seen metric? I'm testing Prometheus alerts on an Apple Mac which run in production on Linux boxes.
Here's one thing I tried:
input_series:
- series: |
container_last_seen{container_label_com_docker_swarm_service_name="service1",env="prod",instance="10.0.0.1"}
values: '1563968832+0x61'
It seems whatever I put in the values for the series is not accepted.
I've also tried durations: '0h+1mx60'
As this is legal: time() - container_last_seen{...} cls is definitely a timestamp, and I would expect a timestamp to be represented by a Unix epoch number. Executing the query on Prometheus gives Unix epoch times, but putting numbers in a series is rejected with the error below.
promtool is recognising the different types but giving much the same error:
➜ promtool test rules alertrules-service-oriented-test.yml
Unit Testing: alertrules-service-oriented-test.yml
FAILED:
1:1: parse error: unexpected number "0" in series values
If the values are '1h+0mx61', promtool correctly identifies the values as durations:
1:1: parse error: unexpected duration "1h" in series values
Note that when this test is commented out, there is no 1:1: parse error and the tests complete successfully. This is not a problem with out of sight parts of the test file.
Thanks for any insights.
Here's the alert:
alertrules.yaml:
- name: containers
interval: 15s
rules:
- alert: prod_container_crashing
expr: |
count by (instance, container_label_com_docker_swarm_service_name)
(
count_over_time(container_last_seen{container_label_com_docker_swarm_service_name!="",env="prod"}[15m])
) - 1 > 2
for: 5m
labels:
service: prod
type: container
severity: critical
annotations:
summary: "pdce {{ $labels.container_label_com_docker_swarm_service_name }}"
description: "{{ $labels.container_label_com_docker_swarm_service_name }} in prod cluster on {{ $labels.instance }} is crashing"
and here's the test file:
alertrules_test.yml:
rule_files:
- alertrules.yml
evaluation_interval: 1m
tests:
- name: container_tests
interval: 15s
input_series:
- series: |
container_last_seen{container_label_com_docker_swarm_service_name="service1",env="prod",instance="10.0.0.1"}
values: '1563968832+0x61'
alert_rule_test:
- eval_time: 15m
alertname: prod_container_crashing
exp_alerts:
- exp_labels:
service: prod
type: container
severity: critical
exp_annotations:
summary: prod service1
description: service1 in prod cluster on 10.0.0.1 is crashing
When the series: value is all on one line, without a > or | yaml flow operator, e.g.
- series: container_last_seen{container_label_com_docker_swarm_service_name="service1",env="prod",instance="10.0.0.1"}
values: '1563968832+0x61'
the error is not there, I don't know why. So this doesn't appear to be a data typing issue.
It's a shame for readability reasons-- either Prometheus or GoLang may have a squeaky wheel in their YAML implementation.

Google Kubernetes Engine The node was low on resource: ephemeral-storage. which exceeds its request of 0

I have a GKE cluster where I create jobs through django, it runs my c++ code images and the builds are triggered through github. It was working just fine up until now. However I have recently pushed a new commit to github (It was a really small change, like three-four lines of basic operations) and it built an image as usual. But this time, it said Pod errors: BackoffLimitExceeded, Error with exit code 137 when trying to create the job through my simple job, and the job is not completed.
I did some digging into the problem and through runnig kubectl describe POD_NAME I got this output from a failed pod:
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
kube-api-access-nqgnl:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 7m32s default-scheduler Successfully assigned default/xvb8zfzrhhmz-jk9vf to gke-cluster-1-default-pool-ee7e99bb-xzhk
Normal Pulling 7m7s kubelet Pulling image "gcr.io/videoo3-360019/github.com/videoo-io/videoo-render:latest"
Normal Pulled 4m1s kubelet Successfully pulled image "gcr.io/videoo3-360019/github.com/videoo-io/videoo-render:latest" in 3m6.343917225s
Normal Created 4m1s kubelet Created container jobcontainer
Normal Started 4m kubelet Started container jobcontainer
Warning Evicted 3m29s kubelet The node was low on resource: ephemeral-storage. Container jobcontainer was using 91144Ki, which exceeds its request of 0.
Normal Killing 3m29s kubelet Stopping container jobcontainer
Warning ExceededGracePeriod 3m19s kubelet Container runtime did not kill the pod within specified grace period.
The error occurs because of this line:
The node was low on resource: ephemeral-storage. Container jobcontainer was using 91144Ki, which exceeds its request of 0.
I do not have a yaml file where I set my pod informations, instead I make a django call handle configuration which looks like this:
def kube_create_job_object(name, container_image, namespace="default", container_name="jobcontainer", env_vars={}):
# Body is the object Body
body = client.V1Job(api_version="batch/v1", kind="Job")
# Body needs Metadata
# Attention: Each JOB must have a different name!
body.metadata = client.V1ObjectMeta(namespace=namespace, name=name)
# And a Status
body.status = client.V1JobStatus()
# Now we start with the Template...
template = client.V1PodTemplate()
template.template = client.V1PodTemplateSpec()
# Passing Arguments in Env:
env_list = []
for env_name, env_value in env_vars.items():
env_list.append( client.V1EnvVar(name=env_name, value=env_value) )
print(env_list)
security = client.V1SecurityContext(privileged=True, allow_privilege_escalation=True, capabilities= client.V1Capabilities(add=["CAP_SYS_ADMIN"]))
container = client.V1Container(name=container_name, image=container_image, env=env_list, stdin=True, security_context=security)
template.template.spec = client.V1PodSpec(containers=[container], restart_policy='Never')
body.spec = client.V1JobSpec(backoff_limit=0, ttl_seconds_after_finished=600, template=template.template)
return body
def kube_create_job(manifest, output_uuid, output_signed_url, webhook_url, valgrind, sleep, isaudioonly):
credentials, project = google.auth.default(
scopes=['https://www.googleapis.com/auth/cloud-platform', ])
credentials.refresh(google.auth.transport.requests.Request())
cluster_manager = ClusterManagerClient(credentials=credentials)
cluster = cluster_manager.get_cluster(name=f"path/to/cluster")
with NamedTemporaryFile(delete=False) as ca_cert:
ca_cert.write(base64.b64decode(cluster.master_auth.cluster_ca_certificate))
config = client.Configuration()
config.host = f'https://{cluster.endpoint}:443'
config.verify_ssl = True
config.api_key = {"authorization": "Bearer " + credentials.token}
config.username = credentials._service_account_email
config.ssl_ca_cert = ca_cert.name
client.Configuration.set_default(config)
# Setup K8 configs
api_instance = kubernetes.client.BatchV1Api(kubernetes.client.ApiClient(config))
container_image = get_first_success_build_from_list_builds(client)
name = id_generator()
body = kube_create_job_object(name, container_image,
env_vars={
"PROJECT" : json.dumps(manifest),
"BUCKET" : settings.GS_BUCKET_NAME,
})
try:
api_response = api_instance.create_namespaced_job("default", body, pretty=True)
print(api_response)
except ApiException as e:
print("Exception when calling BatchV1Api->create_namespaced_job: %s\n" % e)
return body
What causes this and how can I fix it? Am I supposed to set resource/limit varibles to a value and if so how can I do that inside my django job call?
It looks like you are running out of storage on the actual node itself. Since your job spec does not have a request for ephemeral storage, it is being scheduled on any node and in this case it appears like that particular node does not have enough storage available.
I'm not a Python expert, but looks like you should be able to do something like:
storage_size = SOME_VALUE
requests = {'ephemeral-storage': storage_size}
resources = client.V1ResourceRequirements(requests=requests)
container = client.V1Container(name=container_name, image=container_image, env=env_list, stdin=True, security_context=security, resources=resources)

CloudWatch logs acting weird

I have two log files with multi-line log statements. Both of them have same datetime format at the begining of each log statement. The configuration looks like this:
state_file = /var/lib/awslogs/agent-state
[/opt/logdir/log1.0]
datetime_format = %Y-%m-%d %H:%M:%S
file = /opt/logdir/log1.0
log_stream_name = /opt/logdir/logs/log1.0
initial_position = start_of_file
multi_line_start_pattern = {datetime_format}
log_group_name = my.log.group
[/opt/logdir/log2-console.log]
datetime_format = %Y-%m-%d %H:%M:%S
file = /opt/logdir/log2-console.log
log_stream_name = /opt/logdir/log2-console.log
initial_position = start_of_file
multi_line_start_pattern = {datetime_format}
log_group_name = my.log.group
The cloudwatch logs agent is sending log1.0 logs correctly to my log group on cloudwatch, however, its not sending log files for log2-console.log.
awslogs.log says:
2016-11-15 08:11:41,308 - cwlogs.push.batch - WARNING - 3593 - Thread-4 - Skip event: {'timestamp': 1479196444000, 'start_position': 42330916L, 'end_position': 42331504L}, reason: timestamp is more than 2 hours in future.
2016-11-15 08:11:41,308 - cwlogs.push.batch - WARNING - 3593 - Thread-4 - Skip event: {'timestamp': 1479196451000, 'start_position': 42331504L, 'end_position': 42332092L}, reason: timestamp is more than 2 hours in future.
Though server time is correct. Also weird thing is Line numbers mentioned in start_position and end_position does not exist in actual log file being pushed.
Anyone else experiencing this issue?
I was able to fix this.
The state of awslogs was broken. The state is stored in a sqlite database in /var/awslogs/state/agent-state. You can access it via
sudo sqlite3 /var/awslogs/state/agent-state
sudo is needed to have write access.
List all streams with
select * from stream_state;
Look up your log stream and note the source_id which is part of a json data structure in the v column.
Then, list all records with this source_id (in my case it was 7675f84405fcb8fe5b6bb14eaa0c4bfd) in the push_state table
select * from push_state where k="7675f84405fcb8fe5b6bb14eaa0c4bfd";
The resulting record has a json data structure in the v column which contains a batch_timestamp. And this batch_timestamp seams to be wrong. It was in the past and any newer (more than 2 hours) log entries were not processed anymore.
The solution is to update this record. Copy the v column, replace the batch_timestamp with the current timestamp and update with something like
update push_state set v='... insert new value here ...' where k='7675f84405fcb8fe5b6bb14eaa0c4bfd';
Restart the service with
sudo /etc/init.d/awslogs restart
I hope it works for you!
We had the same issue and the following steps fixed the issue.
If log groups are not updating with latest events:
Run These steps:
Stopped the awslogs service
Deleted file /var/awslogs/state/agent-state
Updated /var/awslogs/etc/awslogs.conf configuration from hostaname to
instance ID Ex:
log_stream_name = {hostname} to log_stream_name = {instance_id}
Started awslogs service.
I was able to resolve this issue on Amazon Linux by:
sudo yum reinstall awslogs
sudo service awslogs restart
This method retained my config files in /var/awslogs/, though you may wish to back them up before a reinstall.
Note: In my troubleshooting, I had also deleted my Log Group via the AWS Console. The restart fully reloaded all historical logs, but at the present timestamp, which is of less value. I'm unsure if deleting the Log Group was this was necessary for this method to work. You might want to look at setting the initial_position config to end_of_file before you restart.
I found the reason. The time zone in my docker container is inconsistent with the time zone of my host computer. After setting the two time zones to be consistent, the problem is solved

AWS logs agent setup

We have recently setup AWS logs agent on one of our test servers. Our log files usually contain multi-line events. e.g one of our log event is:
[10-Jun-2016 07:30:16 UTC] SQS Post Response: Array
(
[Status] => 200
[ResponseBody] => <?xml version="1.0"?><SendMessageResponse xmlns="http://queue.amazonaws.com/doc/2009-02-01/"><SendMessageResult><MessageId>053c7sdf5-1e23-wa9d-99d8-2a0cf9eewe7a</MessageId><MD5OfMessageBody>8e542d2c2a1325a85eeb9sdfwersd58f</MD5OfMessageBody></SendMessageResult><ResponseMetadata><RequestId>4esdfr30-c39b-526b-bds2-14e4gju18af</RequestId></ResponseMetadata></SendMessageResponse>
)
The log agent reference documentation says to use 'multi_line_start_pattern' option for such logs. Our AWS Log agent config is as follows:
[httpd_info.log]
file = /var/log/httpd/info.log*
log_stream_name = info.log
initial_position = start_of_file
log_group_name = test.server.name
multi_line_start_pattern = '(\[)+\d{2}-[a-zA-Z]{3}+-\d{4}'
However, the logs agent reporting breaks on aforementioned and similar events. The way it is being reported to CloudWatch Logs is as follows:
Event 1:
[10-Jun-2016 11:21:26 UTC] SQS Post Response: Array
Event 2:
( [Status] => 200 [ResponseBody] => <?xml version="1.0"?><SendMessageResponse xmlns="http://queue.amazonaws.com/doc/2009-02-01/"><SendMessageResult><MessageId>053c7sdf5-1e23-wa9d-99d8-2a0cf9eewe7a</MessageId><MD5OfMessageBody>8e542d2c2a1325a85eeb9sdfwersd58f</MD5OfMessageBody></SendMessageResult><ResponseMetadata><RequestId>4esdfr30-c39b-526b-bds2-14e4gju18af</RequestId></ResponseMetadata></SendMessageResponse>
Event 3:
)
Despite of the fact that its only a single event. Any clue whats going on here?
I think all you need to add is the following to your awslogs.conf
datetime_format = %d-%b-%Y %H:%M:%S UTC
time_zone = UTC
multi_line_start_pattern = {datetime_format}
http://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AgentReference.html
multi_line_start_pattern
Specifies the pattern for identifying the start of a log message. A log message is made of a line that matches the pattern and any following lines that don't match the pattern. The valid values are regular expression or {datetime_format}. When using {datetime_format}, the datetime_format option should be specified. The default value is ‘^[^\s]' so any line that begins with non-whitespace character closes the previous log message and starts a new log message.
If that datetime format didn't work, you would need to update your regex to actually match your specific datetime. I don't think the one you have listed above actually works for your given format.
You could try this for instance:
[\d{2}-[\w]{3}-\d{4}\s{1}\d{2}:\d{2}:\d{2}\s{1}\w+]
does match
[10-Jun-2016 11:21:26 UTC]
See here: http://www.regexpal.com/?fam=96811
Once completed, issue a restart of the service and check to see if its parsing correctly.
$ sudo service awslogs restart

AWS Worker tier cron - server error #500 - "post http 1.1 500 AWS aws-sqsd/2.0"

I'm trying to set up a cronjob at Elastic Beanstalk. The task is being scheduled. For testing purposes it should run every minute... However it is not working. It is a Django app. The app is running in two Environments, one is the worker and the other one is "hosting" the application.
This part is working. The command is running but it's not being executed (the files are not being deleted).
Here is views.py:
#login_required
def delete_expired_files(request):
users = DemoUser.objects.all()
for user in users:
documents = Document.objects.filter(owner=user.id)
if documents:
for doc in documents:
now = timezone.now()
if now >= doc.date_published + timedelta(days = doc.owner.group.valid_time):
doc.delete()
return redirect("user_home")
cron.yml:
version: 1
cron:
- name: "delete_expired_files"
url: "http://networksapp.elasticbeanstalk.com/networks_app/delete_expired_files"
schedule: "* * * * *"
However, it prints this on the log file at the access_log part :
"POST /myapp/management/commands/delete_expired_files HTTP/1.1" 500 124709 "-" "aws-sqsd/2.0"
This is the log file I am accessing so far:
Log file content
Why is it? How can I fix it?
Thank you so much.