How to capture lambda code level error in cloudwatch? - amazon-web-services

I am trying to setup cloudwatch alarm for Lambda execution. I am able to setup ALARM and OK for Errors. But whenever there is any syntax error in my code I get INSUFFICIENT_DATA alarm.
I added my code below:
import json
import sys
print "Buckle your seat belt even if you are in back seat"
def lambda_handler(event, context):
try:
print( "value 1 = " + event['key'])
print( "value 2 = " + event['key2'])
print( "value 3 = " + event['key3'])
return event['key1']
except Exception as e:
print sys.exc_info()[0]
raise
Test Data:
{ "key3": "value3", "key2": "value2", "key1": "value1" }
Here is the error I am generating:
{
"stackTrace": [
[
"/var/task/lambda_function.py",
6,
"lambda_handler",
"print( \"value 1 = \" + event['key'])"
]
],
"errorType": "KeyError",
"errorMessage": "'key'"
}
I can create a metric filter for KeyError and set my alarm. But I want to create one single alarm for all errors whether system level like lambda execution or code level like KeyError etc.
Can anyone please help me how to capture the syntax error or data error in one single alarm of cloudwatch?
Thanks

if you want to catch the error in the logs, then you can use logger feature and do logger.debug(e) or logger.error(e), where e is the exception raised and caught. exception will be available in logs and then you will be able to setup alarm on top of that.

Related

Getting error while testing AWS Lambda function: "Invalid database identifier"

Hi I'm getting error while testing lambda function like:
{
"errorMessage": "An error occurred (InvalidParameterValue) when calling the DescribeDBInstances operation: Invalid database identifier: <RDS instance id>",
"errorType": "ClientError",
"stackTrace": [
" File \"/var/task/lambda_function.py\", line 25, in lambda_handler\n db_instances = rdsClient.describe_db_instances(DBInstanceIdentifier=rdsInstanceId)['DBInstances']\n",
" File \"/var/runtime/botocore/client.py\", line 391, in _api_call\n return self._make_api_call(operation_name, kwargs)\n",
" File \"/var/runtime/botocore/client.py\", line 719, in _make_api_call\n raise error_class(parsed_response, operation_name)\n"
]
}
AND here is my lambda code :
import json
import boto3
import logging
import os
#Logging
LOGGER = logging.getLogger()
LOGGER.setLevel(logging.INFO)
#Initialise Boto3 for RDS
rdsClient = boto3.client('rds')
def lambda_handler(event, context):
#log input event
LOGGER.info("RdsAutoRestart Event Received, now checking if event is eligible. Event Details ==> ", event)
#Input event from the SNS topic originated from RDS event notifications
snsMessage = json.loads(event['Records'][0]['Sns']['Message'])
rdsInstanceId = snsMessage['Source ID']
stepFunctionInput = {"rdsInstanceId": rdsInstanceId}
rdsEventId = snsMessage['Event ID']
#Retrieve RDS instance ARN
db_instances = rdsClient.describe_db_instances(DBInstanceIdentifier=rdsInstanceId)['DBInstances']
db_instance = db_instances[0]
rdsInstanceArn = db_instance['DBInstanceArn']
# Filter on the Auto Restart RDS Event. Event code: RDS-EVENT-0154.
if 'RDS-EVENT-0154' in rdsEventId:
#log input event
LOGGER.info("RdsAutoRestart Event detected, now verifying that instance was tagged with auto-restart-protection == yes")
#Verify that instance is tagged with auto-restart-protection tag. The tag is used to classify instances that are required to be terminated once started.
tagCheckPass = 'false'
rdsInstanceTags = rdsClient.list_tags_for_resource(ResourceName=rdsInstanceArn)
for rdsInstanceTag in rdsInstanceTags["TagList"]:
if 'auto-restart-protection' in rdsInstanceTag["Key"]:
if 'yes' in rdsInstanceTag["Value"]:
tagCheckPass = 'true'
#log instance tags
LOGGER.info("RdsAutoRestart verified that the instance is tagged auto-restart-protection = yes, now starting the Step Functions Flow")
else:
tagCheckPass = 'false'
#log instance tags
LOGGER.info("RdsAutoRestart Event detected, now verifying that instance was tagged with auto-restart-protection == yes")
if 'true' in tagCheckPass:
#Initialise StepFunctions Client
stepFunctionsClient = boto3.client('stepfunctions')
# Start StepFunctions WorkFlow
# StepFunctionsArn is stored in an environment variable
stepFunctionsArn = os.environ['STEPFUNCTION_ARN']
stepFunctionsResponse = stepFunctionsClient.start_execution(
stateMachineArn= stepFunctionsArn,
name=event['Records'][0]['Sns']['MessageId'],
input= json.dumps(stepFunctionInput)
)
else:
LOGGER.info("RdsAutoRestart Event detected, and event is not eligible")
return {
'statusCode': 200
}
I'm trying to Stop an Amazon RDS database which starts automatically after 7 days. I'm following this AWS document: Field Notes: Stopping an Automatically Started Database Instance with Amazon RDS | AWS Architecture Blog
Can anyone help me?
The error message is saying: Invalid database identifier: <RDS instance id>"
It seems to be coming from this line:
db_instances = rdsClient.describe_db_instances(DBInstanceIdentifier=rdsInstanceId)['DBInstances']
The error message is saying that the rdsInstanceId variable contains <RDS instance id>, which seems to be an example value rather than a real value.
In looking at the code on Field Notes: Stopping an Automatically Started Database Instance with Amazon RDS | AWS Architecture Blog, it is asking you to create a test event that includes this message:
"Message": "{\"Event Source\":\"db-instance\",\"Event Time\":\"2020-07-09 15:15:03.031\",\"Identifier Link\":\"https://console.aws.amazon.com/rds/home?region=<region>#dbinstance:id=<RDS instance id>\",\"Source ID\":\"<RDS instance id>\",\"Event ID\":\"http://docs.amazonwebservices.com/AmazonRDS/latest/UserGuide/USER_Events.html#RDS-EVENT-0154\",\"Event Message\":\"DB instance started\"}",
If you look closely at that line, it includes this part to identify the Amazon RDS instance:
dbinstance:id=<RDS instance id>
I think that you are expected to modify the provided test event to fill-in your own values for anything in <angled brackets> (such as the Instance Id of your Amazon RDS instance).

An error has occurred: The server encountered an error processing the Lambda response

I am using AWS Lex and AWS Lambda for creating a chatbot. The request and response format are as follows
Event being passed to AWS Lambda
{
"alternativeIntents": [
{
"intentName": "AMAZON.FallbackIntent",
"nluIntentConfidence": null,
"slots": {}
}
],
"botVersion": "$LATEST",
"dialogState": "ConfirmIntent",
"intentName": "OrderBeverage",
"message": "you want to order 2 pints of beer",
"messageFormat": "PlainText",
"nluIntentConfidence": {
"score": 0.92
},
"responseCard": null,
"sentimentResponse": null,
"sessionAttributes": {},
"sessionId": "2021-05-10T09:13:06.841Z-bSWmdHVL",
"slotToElicit": null,
"slots": {
"Drink": "beer",
"Quantity": "2",
"Unit": "pints"
}
}
Response Format-
{
"statusCode": 200,
"dialogAction": {
"type": "Close",
"fulfillmentState": "Fulfilled",
"message": {
"contentType": "PlainText",
"content": "Message to convey to the user. For example, Thanks, your pizza has been ordered."
}
}
}
AWS LAMBDA PYTHON IMPLEMENTATION-
import json
def lambda_handler(event, context):
# TODO implement
slots= event["slots"];
drink,qty,unit= slots["Drink"], slots["Quantity"], slots["Unit"]
retStr= "your order of "+qty+" "+unit+ " of "+drink+ " is coming right up!";
return {"dialogAction": {
"type": "Close",
"fulfillmentState": "Fulfilled",
"message": {
"contentType": "PlainText",
"content": retStr
},
}
}
The formats are in accordance with the documentation, however still getting error in processing lambda response. What is the issue?
This error occurs when the execution of the Lambda function fails and throws an error back to Amazon Lex.
I have attempted to recreate your environment using the python code shared and the test input event.
The output format that you have specified in the original post is correct. Your problem appears to lie with the input test event. The input message that you are using differs from what Lex is actually sending to your Lambda function.
Try adding some additional debugging to your Lambda function to log the event that Lex passes into it and then use the logged event as your new test event.
Ensure that you have CloudWatch logging enabled for the Lambda function so that you can view the input message in the logs.
Here's how my Lambda function looks:
import json
import logging
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
def dispatch(event):
# TODO implement
slots= event["slots"];
drink,qty,unit= slots["Drink"], slots["Quantity"], slots["Unit"]
retStr= "your order of "+qty+" "+unit+ " of "+drink+ " is coming right up!";
return {"dialogAction": {
"type": "Close",
"fulfillmentState": "Fulfilled",
"message": {
"contentType": "PlainText",
"content": retStr
},
}}
def lambda_handler(event, context):
logger.debug('event={}'.format(event))
response = dispatch(event)
logger.debug(response)
return response
Now if you test via the Lex console you will find your error in the CloudWatch logs.:
[ERROR] KeyError: 'slots' Traceback (most recent call last): File
"/var/task/lambda_function.py", line 33, in lambda_handler
response = dispatch(event) File "/var/task/lambda_function.py", line 19, in dispatch
slots= event["slots"];
Using this error trace and the logged event, you should see that slots is nested within currentIntent.
You will need to update your code to extract the slot values from the correct place.
Trust this helps you.

ETIMEOUT from Tedious in Lambda

Setup
AWS Lambda (3s timeout)
NodeJS 12.x
mssql 2.6.1
tedious (dependency of mssql and so it installs tedious 6.7.0)
SQL Server DB in RDS (db.t3.small)
I'm also dealing with a fair bit of traffic, with roughly 1k invocations per minute.
Problem
Most of the time the Lambda executes just fine. Roughly 0.35% of the time the Lambda throws an error. The logs look like this:
In the screenshot you can see that the function STARTs, prints some debug info, then throws an error, ENDs, and REPORTs.
While this is a "timeout" error, the error message says,
RequestError: Timeout: Request failed to complete in 15000ms
This confuses me because as you see in the REPORT log, the invocation time was just 255.33ms total.
Question
The obvious question is how does something timeout after 15 seconds in just 255ms? Is this an issue with tedious, mssql, my code, or something else? If my code is relevant to the question please let me know and I can add it. I assume the code is basically functional because it works > 99% of the time.
Failed Theories:
The logs are interleaved and the errors are not from the 225ms invocation. That's wrong because in the screenshot the request ID's match up.
There's an intermittent error connecting to the DB, possibly a DNS issue. I have seen very rare EAIAGAIN DNS errors when resolving the DB host, but that doesn't seem to match with the steadiness of this error.
I didn't spot anything super helpful in the GitHub issues for Tedious.
Full Error
{
"errorType": "Runtime.UnhandledPromiseRejection",
"errorMessage": "RequestError: Timeout: Request failed to complete in 15000ms",
"reason": {
"errorType": "RequestError",
"errorMessage": "Timeout: Request failed to complete in 15000ms",
"code": "ETIMEOUT",
"originalError": {
"errorType": "RequestError",
"errorMessage": "Timeout: Request failed to complete in 15000ms",
"code": "ETIMEOUT",
"message": "Timeout: Request failed to complete in 15000ms",
"stack": [
"RequestError: Timeout: Request failed to complete in 15000ms",
" at RequestError (/var/task/node_modules/mssql/node_modules/tedious/lib/errors.js:32:12)",
" at Connection.requestTimeout (/var/task/node_modules/mssql/node_modules/tedious/lib/connection.js:1212:46)",
" at Timeout._onTimeout (/var/task/node_modules/mssql/node_modules/tedious/lib/connection.js:1180:14)",
" at listOnTimeout (internal/timers.js:549:17)",
" at processTimers (internal/timers.js:492:7)"
]
},
"name": "RequestError",
"number": "ETIMEOUT",
"precedingErrors": [],
"stack": [
"RequestError: Timeout: Request failed to complete in 15000ms",
" at Request.userCallback (/var/task/node_modules/mssql/lib/tedious/request.js:429:19)",
" at Request.callback (/var/task/node_modules/mssql/node_modules/tedious/lib/request.js:56:14)",
" at Connection.endOfMessageMarkerReceived (/var/task/node_modules/mssql/node_modules/tedious/lib/connection.js:2407:20)",
" at Connection.dispatchEvent (/var/task/node_modules/mssql/node_modules/tedious/lib/connection.js:1279:15)",
" at Parser.<anonymous> (/var/task/node_modules/mssql/node_modules/tedious/lib/connection.js:1072:14)",
" at Parser.emit (events.js:315:20)",
" at Parser.EventEmitter.emit (domain.js:482:12)",
" at Parser.<anonymous> (/var/task/node_modules/mssql/node_modules/tedious/lib/token/token-stream-parser.js:37:14)",
" at Parser.emit (events.js:315:20)",
" at Parser.EventEmitter.emit (domain.js:482:12)"
]
},
"promise": {},
"stack": [
"Runtime.UnhandledPromiseRejection: RequestError: Timeout: Request failed to complete in 15000ms",
" at process.<anonymous> (/var/runtime/index.js:35:15)",
" at process.emit (events.js:315:20)",
" at process.EventEmitter.emit (domain.js:482:12)",
" at processPromiseRejections (internal/process/promises.js:209:33)",
" at processTicksAndRejections (internal/process/task_queues.js:98:32)"
]
}

When using lambda to generate elbv2 attributes (name specifically), receiving error from Lambda that name is longer than 32 characters

I am building a CloudFormation template that uses a Lambda function to generate the name of the load balancer built by the template.
When the function runs, it fails with the following error:
Failed to validate attributes of ELB arn:aws-us-gov:elasticloadbalancing:us-gov-west-1:273838691273:loadbalancer/app/dev-fu-WALB-18VHO2DJ4MHK/c69c48fd3464de01. An error occurred (ValidationError) when calling the DescribeLoadBalancers operation: The load balancer name 'arn:aws-us-gov:elasticloadbalancing:us-gov-west-1:273838691273:loadbalancer/app/dev-fu-WALB-18VHO2DJ4MHK/c69c48fd3464de01' cannot be longer than '32' characters.
It is obviously pulling the arn rather than the name of the elbv2.
I opened a ticket with AWS to no avail, and also with the company that wrote the script... same results.
I have attached the script and any help is greatly appreciated.
import cfn_resource
import boto3
import boto3.session
import logging
logger = logging.getLogger()
handler = cfn_resource.Resource()
# Retrieves DNSName and source security group name for the specified ELB
#handler.create
def get_elb_attribtes(event, context):
properties = event['ResourceProperties']
elb_name = properties['PORALBName']
elb_template = properties['PORALBTemplate']
elb_subnets = properties['PORALBSubnets']
try:
client = boto3.client('elbv2')
elb = client.describe_load_balancers(
Names=[
elb_name
]
)['LoadBalancers'][0]
for az in elb['AvailabilityZones']:
if not az['SubnetId'] in elb_subnets:
raise Exception("ELB does not include VPC subnet '" + az['SubnetId'] + "'.")
target_groups = client.describe_target_groups(
LoadBalancerArn=elb['LoadBalancerArn']
)['TargetGroups']
target_group_arns = []
for target_group in target_groups:
target_group_arns.append(target_group['TargetGroupArn'])
if elb_template == 'geoevent':
if elb['Type'] != 'network':
raise Exception("GeoEvent Server requires network ElasticLoadBalancer V2.")
response_data = {}
response_data['DNSName'] = elb['DNSName']
response_data['TargetGroupARNs'] = target_group_arns
msg = 'ELB {} found.'.format(elb_name)
logger.info(msg)
return {
'Status': 'SUCCESS',
'Reason': msg,
'PhysicalResourceId': context.log_stream_name,
'StackId': event['StackId'],
'RequestId': event['RequestId'],
'LogicalResourceId': event['LogicalResourceId'],
'Data': response_data
}
except Exception, e:
error_msg = 'Failed to validate attributes of ELB {}. {}'.format(elb_name, e)
logger.error(error_msg)
return {
'Status': 'FAILED',
'Reason': error_msg,
'PhysicalResourceId': context.log_stream_name,
'StackId': event['StackId'],
'RequestId': event['RequestId'],
'LogicalResourceId': event['LogicalResourceId']
}
The error says:
An error occurred (ValidationError) when calling the DescribeLoadBalancers operation
So, looking at where it calls DescribeLoadBalancers:
elb = client.describe_load_balancers(
Names=[
elb_name
]
)['LoadBalancers'][0]
The error also said:
The load balancer name ... cannot be longer than '32' characters.
The name comes from:
properties = event['ResourceProperties']
elb_name = properties['PORALBName']
So, the information is being passed into the Lambda function via event. This is coming from whatever is triggering the Lambda function. So, you'll need to find out what is triggering the function and discover what information it actually sending. Your problem is outside of the code listed.
Other options
In your code, you can send event to the debug logs (eg print (event)) and see whether they are passing the ELB name in a different field.
Alternatively, you could call describe_load_balancers without a Name filter to retrieve a list of all load balancers, then use the ARN (that you have) to find the load balancer of interest. Simply loop through all the results until you find the one that matches the ARN you have. Then, continue as normal.

How to use boto3 waiters to take snapshot from big RDS instances

I started migrating my code to boto 3 and one nice addition I noticed are the waiters.
I want to create a snapshot from a db instance and I want to check for it's availability before I resume with my code.
My approach is the following:
# Notice: Step : Check snapshot availability [1st account - Oregon]
print "--- Check snapshot availability [1st account - Oregon] ---"
new_snap = client1.describe_db_snapshots(DBSnapshotIdentifier=new_snapshot_name)['DBSnapshots'][0]
# print pprint.pprint(new_snap) #debug
waiter = client1.get_waiter('db_snapshot_completed')
print "Manual snapshot is -pending-"
sleep(60)
waiter.wait(
DBSnapshotIdentifier = new_snapshot_name,
IncludeShared = True,
IncludePublic = False
)
print "OK. Manual snapshot is -available-"
,but the documentation says that it polls the status every 15 seconds for 40 times. That is 10 minutes. Yet, a rather big DB will need more than that .
How could I use the waiter to alleviate for that?
Waiters have configuration parameters'delay' and 'max_attempts'
like this :
waiter = rds_client.get_waiter('db_instance_available')
print( "waiter delay: " + str(waiter.config.delay) )
waiter.py on github
You could do it without the waiter if you like.
From the documentation for that waiter:
Polls RDS.Client.describe_db_snapshots() every 15 seconds until a successful state is reached. An error is returned after 40 failed checks.
Basically that means it does the following:
RDS = boto3.client('rds')
RDS.describe_db_snapshots()
You can just run that but filter to your snapshot id, here is the syntax.http://boto3.readthedocs.io/en/latest/reference/services/rds.html#RDS.Client.describe_db_snapshots
response = client.describe_db_snapshots(
DBInstanceIdentifier='string',
DBSnapshotIdentifier='string',
SnapshotType='string',
Filters=[
{
'Name': 'string',
'Values': [
'string',
]
},
],
MaxRecords=123,
Marker='string',
IncludeShared=True|False,
IncludePublic=True|False
)
This will end up looking something like this:
snapshot_description = RDS.describe_db_snapshots(DBSnapshotIdentifier='YOURIDHERE')
then you can just loop until that returns a snapshot which is available. So here is a very rough idea.
import boto3
import time
RDS = boto3.client('rds')
RDS.describe_db_snapshots()
snapshot_description = RDS.describe_db_snapshots(DBSnapshotIdentifier='YOURIDHERE')
while snapshot_description['DBSnapshots'][0]['Status'] != 'available' :
print("still waiting")
time.sleep(15)
snapshot_description = RDS.describe_db_snapshots(DBSnapshotIdentifier='YOURIDHERE')
I think the other answer alluded to this solution but here it is expressly.
[snip]
...
# Create your waiter
waiter_db_snapshot = client1.get_waiter('db_snapshot_completed')
# Increase the max number of tries as appropriate
waiter_db_snapshot.config.max_attempts = 120
# Add a 60 second delay between attempts
waiter_db_snapshot.config.delay = 60
print "Manual snapshot is -pending-"
....
[snip]