AWS Cloudwatch Alarm on Error fails on Insufficient Data - amazon-web-services

I try to create an alarm if my a json log message is of error log level. My filter works fine but when I create my alarm, it always fails on insufficient data. Seemingly, because there are no errors.
Any ideas?

The way around this was to define two metrics with the same name but with inverse filters. The filter that matches error level log messages must return a metric value of 1, while the second filter should match all messages, or at least one message within the time period and return a metric value of 0. The presence of the 0 value avoids the insufficient data error.
When the alarm is created from the metric, both filters are combined. If a sum statistic is applied and a alarm rule of >0 applied, the alarm will trigger only when error messages arrive and not run into insufficient data.
Here is an example using the boto3 client:
import boto3
client = boto3.client('logs')
logGroupName = 'myLogGroup'
# create this SNS topic with your email subscription...
env['aws_sns_arn_error_email'] = 'arn:aws:sns:eu-west-1:1234567:log_error'
env['sys_type'] = 'production'
metricsNamespace = 'LogMetrics'
metricName = 'ErrorCount' + "_%(sys_type)s" % env
print colors.cyan('Put metric $(metricName)s' % env)
cloudwatch_client = boto3.client('cloudwatch')
response = cloudwatch_client.put_metric_data(
Namespace=metricsNamespace,
MetricData=[
{
'MetricName': metricName,
'Unit': 'Count',
'Value': 1
},
]
)
logs_client = boto3.client('logs')
print colors.cyan('Put metric filter $.levelname-ERROR')
logs_client.put_metric_filter(
logGroupName=env.log_group_name_ea,
filterName='levelname-ERROR',
filterPattern='{ $.levelname = "ERROR" }',
metricTransformations=[
{
'metricNamespace': metricsNamespace,
'metricValue': '0',
'metricName': metricName,
}]
)
print colors.cyan('Put metric filter catchAll')
logs_client.put_metric_filter(
logGroupName=env.log_group_name_ea,
filterName="catchAll",
filterPattern='',
metricTransformations=[
{
'metricNamespace': metricsNamespace,
'metricValue': '1',
'metricName': metricName,
}]
)
print colors.cyan('Put metric alarm, email on error')
response = cloudwatch_client.put_metric_alarm(
AlarmName='email on error',
AlarmDescription='email on error',
ActionsEnabled=True,
AlarmActions=[
env.aws_sns_arn_error_email,
],
MetricName=metricName,
Namespace=metricsNamespace,
Statistic='Sum',
Period=300,
Unit='Count',
EvaluationPeriods=1,
Threshold=0,
ComparisonOperator='GreaterThanThreshold'
)

Related

Export DynamoDb metrics logs to S3 or CloudWatch

I'm trying to use DynamoDB metrics logs in an external observability tool.
To do that, I'll need to get these log data from S3 or CloudWatch log groups (not from Insights or CloudTrail).
For this reason, if there isn't a way to use CloudWatch, I'll need to export these metric logs from DynamoDb to S3, and from there export to CloudWatch or try to get those data directly from S3.
Do you know this is possible?
You could try using Logstash, it has a plugin for Cloudwatch and S3:
https://www.elastic.co/guide/en/logstash/current/plugins-inputs-cloudwatch.html
https://www.elastic.co/guide/en/logstash/current/plugins-inputs-s3.html
AWS puts DynamoDB metrics (table operation, table, and account) over CloudWatch metrics. Also, you can create as many metrics as you need. If you use Python, you can read it with boto3. The CloudWatch client has this method:
get_metric_data
Try this with your metrics:
cloudwatch_client = boto3.client('cloudwatch')
yesterday = date.today() - timedelta(days=1)
today = date.today()
response = cloudwatch_client.get_metric_data(
MetricDataQueries=[
{
'Id': 'some_request',
'MetricStat': {
'Metric': {
'Namespace': 'DynamoDB',
'MetricName': 'metric_name',
'Dimensions': []
},
'Period': 3600,
'Stat': 'Sum',
}
},
],
StartTime=datetime(yesterday.year, yesterday.month, yesterday.day),
EndTime=datetime(today.year, today.month, today.day),
)
print(response)

Creating a CloudWatch Metrics from the Athena Query results

My Requirement
I want to create a CloudWatch-Metric from Athena query results.
Example
I want to create a metric like user_count of each day.
In Athena, I will write an SQL query like this
select date,count(distinct user) as count from users_table group by 1
In the Athena editor I can see the result, but I want to see these results as a metric in Cloudwatch.
CloudWatch-Metric-Name ==> user_count
Dimensions ==> Date,count
If I have this cloudwatch metric and dimensions, I can easily create a Monitoring Dashboard and send send alerts
Can anyone suggest a way to do this?
You can use CloudWatch custom widgets, see "Run Amazon Athena queries" in Samples.
It's somewhat involved, but you can use a Lambda for this. In a nutshell:
Setup your query in Athena and make sure it works using the Athena console.
Create a Lambda that:
Runs your Athena query
Pulls the query results from S3
Parses the query results
Sends the query results to CloudWatch as a metric
Use EventBridge to run your Lambda on a recurring basis
Here's an example Lambda function in Python that does step #2. Note that the Lamda function will need IAM permissions to run queries in Athena, read the results from S3, and then put a metric into Cloudwatch.
import time
import boto3
query = 'select count(*) from mytable'
DATABASE = 'default'
bucket='BUCKET_NAME'
path='yourpath'
def lambda_handler(event, context):
#Run query in Athena
client = boto3.client('athena')
output = "s3://{}/{}".format(bucket,path)
# Execution
response = client.start_query_execution(
QueryString=query,
QueryExecutionContext={
'Database': DATABASE
},
ResultConfiguration={
'OutputLocation': output,
}
)
#S3 file name uses the QueryExecutionId so
#grab it here so we can pull the S3 file.
qeid = response["QueryExecutionId"]
#occasionally the Athena hasn't written the file
#before the lambda tries to pull it out of S3, so pause a few seconds
#Note: You are charged for time the lambda is running.
#A more elegant but more complicated solution would try to get the
#file first then sleep.
time.sleep(3)
###### Get query result from S3.
s3 = boto3.client('s3');
objectkey = path + "/" + qeid + ".csv"
#load object as file
file_content = s3.get_object(
Bucket=bucket,
Key=objectkey)["Body"].read()
#split file on carriage returns
lines = file_content.decode().splitlines()
#get the second line in file
count = lines[1]
#remove double quotes
count = count.replace("\"", "")
#convert string to int since cloudwatch wants numeric for value
count = int(count)
#post query results as a CloudWatch metric
cloudwatch = boto3.client('cloudwatch')
response = cloudwatch.put_metric_data(
MetricData = [
{
'MetricName': 'MyMetric',
'Dimensions': [
{
'Name': 'DIM1',
'Value': 'dim1'
},
],
'Unit': 'None',
'Value': count
},
],
Namespace = 'MyMetricNS'
)
return response
return

subFolder is empty when using a Google IoT Core gateway and Pub/Sub

I have a device publishing through a gateway on the events topic (/devices/<dev_id>/events/motion) to PubSub. It's landing in PubSub correctly but subFolder is just an empty string.
On the gateway I'm publishing using the code below. f"mb.{device_id}" is the device_id (not the gateway ID and attribute could be anything - motion, temperature, etc
def report(self, device_id, attribute, value):
topic = f"/devices/mb.{device_id}/events/{attribute}"
timestamp = datetime.utcnow().timestamp()
client.publish(topic, json.dumps({"v": value, "ts": timestamp}))
And this is the cloud function listening on the PubSub queue.
def iot_to_bigtable(event, context):
payload = json.loads(base64.b64decode(event["data"]).decode("utf-8"))
timestamp = payload.get("ts")
value = payload.get("v")
if not timestamp or value is None:
raise BadDataException()
attributes = event.get("attributes", {})
device_id = attributes.get("deviceId")
registry_id = attributes.get("deviceRegistryId")
attribute = attributes.get("subFolder")
if not device_id or not registry_id or not attribute:
raise BadDataException()
A sample of the event in Pub/Sub:
{
#type: 'type.googleapis.com/google.pubsub.v1.PubsubMessage',
attributes: {
deviceId: 'mb.26727bab-0f37-4453-82a4-75d93cb3f374',
deviceNumId: '2859313639674234',
deviceRegistryId: 'mb-staging',
deviceRegistryLocation: 'europe-west1',
gatewayId: 'mb.42e29cd5-08ad-40cf-9c1e-a1974144d39a',
projectId: 'mb-staging',
subFolder: ''
},
data: 'eyJ2IjogImxvdyIsICJ0cyI6IDE1OTA3NjgzNjcuMTMyNDQ4fQ=='
}
Why is subFolder empty? Based on the docs I was expecting it to be the attribute (i.e. motion or temperature)
This issue has nothing to do with Cloud IoT Core. It is instead caused by how Pub/Sub handles failed messages. It was retrying messages from ~12 hours ago that had failed (and didn't have an attribute).
You fix this by purging the Subscription in Pub/Sub.

How to pull specific information out of a alarm event in Lambda

I set up a CPU alarm for an EC2 instance that triggers an SNS Topic that has an endpoint that is a Lambda function. The Lambda function will then send ma an email and slack message telling me that an instance is in the alarm start and tell me exactly what instance it came from. I have the email and slack working and now I just need to get the instance ID from the event that my Lambda received from the alarm.
I get the following event in the Lambda function. I want to just pull out the instance ID from it, which in this example would be "i-07db9e2f61d100". It is located in "Dimensions".
How about also pulling out the "AlarmName" (which would be "cpu-mon" in this example)?
Here is all the data in the event I receive:
{'Records': [{'EventSource': 'aws:sns', 'EventVersion': '1.0', 'EventSubscriptionArn': 'arn:aws:sns:us-east-2:Alarm-test:db99f3fe-1c4b', 'Sns': {'Type': 'Notification', 'MessageId': '9921c85a-6f59-50c0', 'TopicArn': 'arn:aws:sns:us-east-2:4990:Alarm-test', 'Subject': 'ALARM: "cpu-mon" in US East (Ohio)', 'Message': '{"AlarmName":"cpu-mon","AlarmDescription":"Alarm when CPU exceeds 70 percent","AWSAccountId":"000000000","NewStateValue":"ALARM","NewStateReason":"Threshold Crossed: 2 out of the last 2 datapoints [99.8333333333333 (26/08/19 19:19:00), 99.1803278688525 (26/08/19 19:18:00)] were greater than the threshold (70.0) (minimum 2 datapoints for OK -> ALARM transition).","StateChangeTime":"2019-08-26T19:20:52.350+0000","Region":"US East (Ohio)","OldStateValue":"OK","Trigger":{"MetricName":"CPUUtilization","Namespace":"AWS/EC2","StatisticType":"Statistic","Statistic":"AVERAGE","Unit":"Percent","Dimensions":[{"value":"i-07db9e2f61d100","name":"InstanceId"}],"Period":60,"EvaluationPeriods":2,"ComparisonOperator":"GreaterThanThreshold","Threshold":70.0,"TreatMissingData":"","EvaluateLowSampleCountPercentile":""}}', 'Timestamp': '2019-08-26T19:20:52.403Z', 'SignatureVersion': '1', 'Signature': 'UeWhS==', 'SigningCertUrl': 'https://sns.us-east-2.amazonaws.com/SimpleNotificationService-63f9.pem', 'UnsubscribeUrl': 'https://sns.us-east-2.amazonaws.com/?Action=Unsubscribe&SubscriptionArn=arn:aws:sns:us-east-2:49:Alarm-test:dfe-1c4b-4db9', 'MessageAttributes': {}}}]}
Here is my Lambda function (python) -
# Sends Slack and text message
import json
import subprocess
import boto3
session = boto3.Session(
region_name="us-east-1"
)
sns_client = session.client('sns')
def lambda_handler(event, context):
print("THIS IS THE EVENT - " + str(event))
data = json.dumps({'text': str(event)})
# Send text alerts
alertNumbers = ["1-xxx-xxx-xxxx"]
# Send text message
for i in range(len(alertNumbers)):
sns_client.publish(
PhoneNumber=alertNumbers[i],
Message=msg,
MessageAttributes={
'AWS.SNS.SMS.SenderID': {
'DataType': 'String',
'StringValue': 'SENDERID'
},
'AWS.SNS.SMS.SMSType': {
'DataType': 'String',
'StringValue': 'Promotional'
}
}
)
# Send Slack message
subprocess.call([
'curl',
'-X', 'POST',
'-H', 'Content-type: application/json',
'--data', data,
'https://hooks.slack.com/services/000000'
Thanks for any help!
You simply need to access the data of the event and put it where you want it.
Inside your lambda_handler add this as the first line:
message = json.loads(event['Records'][0]['Sns']['Message'])
Now the SNS message is available as message. To get the AlarmName is as simple as message['AlarmName'] and the instance id is at message['Trigger']['Dimensions'][0]['value']

Cloudwatch boto3 put_log_events giving incorrect return

I'm using boto3 to post log events from some code. The results show that 0 bytes are stored - everything else appears valid, the next sequenceToken and the creationtime - but there are no events in the log. The message I'm sending is just a simple message = "test". It appears when I call this function though, unexpected results of the logstream type are returned. Anyone know what might be causing this?
kwargs = {'logGroupName': self.log_group_name,
'logStreamName': self.log_stream_name,
'logEvents': [
{
'timestamp': ts,
'message': message
},
]}
token = self.get_seq_token()
if token:
print 'token:' + token
kwargs.update({'sequenceToken': token})
response = self.client.put_log_events(**kwargs)
Results seem to be a log stream:
{'storedBytes': 0, 'creationTime': 1481640079355,
'uploadSequenceToken': 'validtoken',
'logStreamName': 'test_stream_1',
'lastIngestionTime': 1481640079447,
'arn': 'arn:aws:logs:us-east-1:[aws_id]:log-group:test_group:log-stream:test_stream_1'}
From the documentation was expecting:
{
'nextSequenceToken': 'string',
'rejectedLogEventsInfo': {
'tooNewLogEventStartIndex': 123,
'tooOldLogEventEndIndex': 123,
'expiredLogEventEndIndex': 123
}
}
The incorrect result was a red herring- the error was that the time was too long ago.Need to multiply unix time by 1000
ts = int(time.time()*1000)
Related to this:
amazon CloudWatchLogs putLogEvents