Boto3 invocations of long-running Lambda runs break with TooManyRequestsException - amazon-web-services

Experience with "long-running" Lambda's
In my company, we recently ran into this behaviour, when triggering Lambdas, that run for > 60 seconds (boto3's default timeout for connection establishment and reads).
The beauty of the Lambda invocation with boto3 (using the 'InvocationType' 'RequestResponse') is, that the API returns the result state of the respective Lambda run, so we wanted to stick to that.
The issue seems to be, that the client fires to many requests per minute on the standing connection to the API. Therefore, we experimented with the boto3 client configuration, but increasing the read timeout resulted in new (unwanted) invocations after each timeout period and increasing the connection timeout triggered a new invocation, after the Lambda was finished.

Workaround
As various investigations and experimentation with boto3's Lambda client did not result in a working setup using 'RequestResponse' invocations,
we circumvented the problem now by making use of Cloudwatch logs. For this, the Lambda has to be setup up to write to an accessible log group. Then, these logs can the queried for the state. Then you would invoke the Lambda and monitor it like this:
import boto3
lambda_client = boto3.client('lambda')
logs_clients = boto3.client('logs')
invocation = lambda_client.invoke(
FunctionName='your_lambda',
InvocationType='Event'
)
# Identifier of the invoked Lambda run
request_id = invocation['ResponseMetadata']['RequestID']
while True:
# filter the logs for the Lambda end event
events = logs_client.filter_log_events(
logGroupName='your_lambda_loggroup',
filterPattern=f'"END RequestId: {request_id}"'
).get('events', [])
if len(events) > 0:
# the Lambda invocation finished
break
This approach works for us now, but it's honestly ugly. To make this approach slightly better, I recommend to set the time range filtering in the filter_log_events call.
One thing, that was not tested (yet): The above approach only tells, whether the Lambda terminated, but not the state (failed or successful) and the default logs don't hold anything useful in that regards. Therefore, I will investigate, if a Lambda run can know its own request id during runtime. Then the Lambda code can be prepared to also write error messages with the request id, which then can be filtered for again.

Related

AWS lambda: Execute function on timeout

I am developing a lambda function that migrates logs from an SFTP server to an S3 bucket.
Due to the size of the logs, the function sometimes is timing out - even though I have set the maximum timeout of 15 minutes.
try:
logger.info(f'Migrating {log_name}... ')
transfer_to_s3(log_name, sftp)
logger.info(f'{log_name} was migrated succesfully ')
If transfer_to_s3() fails due to timeoutlogger.info(f'{log_name} was migrated succesfully') line won't be executed.
I want to ensure that in this scenario, I will somehow know that a log was not migrated due to timeout.
Is there a way to force lambda to perform an action, before exiting, in the case of a timeout?
Probably a better way would be to use SQS for that:
Logo info ---> SQS queue ---> Lambda function
If lambda successful moves the files, it removes the log info from SQS queue. If it fails, the log info persists in the SQS queue (or goes to DLQ for special handling), so the next lambda invocation can handle it.

What happens if timeout handler is not cancelled inside lambda function?

I have a lambda function that sets timeout handler with a certain delay (60 seconds) at the beginning.
I 'd like to know what is the exact behavior of lambda when the timeout handler is not cancelled till the lambda returns response (in less than 60 seconds). Especially, when there are hundreds of lambda invocation, the uncancelled timeout handler in the previous lambda execution will affect the next process that runs on the same instance? More info - lambda function is invoked asynchronously.
You haven't mentioned which language you're using or provided any code indicating how you're creating timeouts, but the general process is described at AWS Lambda execution environment.
Lambda freezes the execution environment following an invocation and it remains frozen, up to a certain maximum amount of time (15 mins afaik), and is thawed if a new invocation happens quickly enough, and the prior execution environment is re-used.
A key quote from the documentation is:
Background processes or callbacks that were initiated by your Lambda function and did not complete when the function ended [will] resume if Lambda reuses the execution environment. Make sure that any background processes or callbacks in your code are complete before the code exits.
As you wrote in the comments, the lambda is written in python.
This simple example shows that the event is passing to the next invocation:
The code:
import json
import signal
import random
def delayed(val):
print("Delayed:", val)
def lambda_handler(event, context):
r = random.random()
print("Generated", r)
signal.signal(signal.SIGALRM, lambda *args: delayed(r))
signal.setitimer(signal.ITIMER_REAL, 1)
return {'statusCode': 200}
Yields:
Cloudwatch logs
Think about the way that AWS implements lambdas:
When a lambda is being invoked, a container is being raised and the environment starts to initialize (this is the cold-start phase).
During this initialization, the python interpreter is starting, and behind the scene, an AWS code fetches events from the lambda service and triggers your handler.
This initialization is costly, so AWS prefers to wait with the same "process" for the next event. On the happy flow, it arrives "fast enough" after the previous finished, so they spared the initialization and everyone is happy.
Otherwise, after a small period, they will shutdown the container.
As long as the interpreter is still on - the signal that we fired in one invocation will leak to the next invocation.
Note also the concurrency of the lambdas - two invocations that run in parallel are running on different containers, thus have different interpreters and this alarm will not leak.

Lambda Low Latency Messaging Options

I have a Lambda that requires messages to be sent to another Lambda to perform some action. In my particular case it is passing a message to a Lambda in order for it to perform HTTP requests and refresh cache entries.
Currently I am relying on the AWS SDK to send an SQS message. The mechanics of this are working fine. The concern that I have is that the SQS send method call takes around 50ms on average to complete. Considering I'm in a Lambda, I am unable to perform this in the background and expect for it to complete before the Lambda returns and is frozen.
This is further compounded if I need to make multiple SQS send calls, which is particularly bad as the Lambda is responsible for responding to low-latency HTTP requests.
Are there any alternatives in AWS for communicating between Lambdas that does not require a synchronous API call, and that exhibits more of a fire and forget and asynchronous behavior?
Though there are several approaches to trigger one lambda from another, (in my experience) one of the fastest methods would be to directly trigger the ultimate lambda's ARN.
Did you try invoking one Lambda from the other using AWS SDKs?
(for e.g. in Python using Boto3, I achieved it like this).
See below, the parameter InvocationType = 'Event' helps in invoking target Lambda asynchronously.
Below code takes 2 parameters (name, which can be either your target Lambda function's name or its ARN, params is a JSON object with input parameters you would want to pass as input). Try it out!
import boto3, json
def invoke_lambda(name, params):
lambda_client = boto3.client('lambda')
params_bytes = json.dumps(params).encode()
try:
response = lambda_client.invoke(FunctionName = name,
InvocationType = 'Event',
LogType = 'Tail',
Payload = params_bytes)
except ClientError as e:
print(e)
return None
return response
Hope it helps!
For more, refer to Lambda's Invoke Event on Boto3 docs.
Alternatively, you can use Lambda's Async Invoke as well.
It's difficult to give exact answers without knowing what language are you writing the Lambda function in. To at least make "warm" function invocations faster I would make sure you are creating the SQS client outside of the Lambda event handler so it can reuse the connection. The AWS SDK should use an HTTP connection pool so it doesn't have to re-establish a connection and go through the SSL handshake and all that every time you make an SQS request, as long as you reuse the SQS client.
If that's still not fast enough, I would have the Lambda function handling the HTTP request pass off the "background" work to another Lambda function, via an asynchronous call. Then the first Lambda function can return an HTTP response, while the second Lambda function continues to do work.
You might also try to use Lambda Destinations depending on you use case. With this you don't need to put things in a queue manually.
https://aws.amazon.com/blogs/compute/introducing-aws-lambda-destinations/
But it limits your flexibility. From my point of view chaining lambdas directly is an antipattern and if you would need that, go for step functions

Is there any way to connecto multiple request/trigger form SQS to single thread lambda function?

My app is using lambda function (1) to import data to a third database server. Sometime (1) will throw errors, and I use SQS to store messages throw from (1). And I use lambda function (2) to read all messages in SQS and re-import by recall (1). (2) will triggered whenever SQS receives the message.
Full error flow: Lambda (1) => SQS => Lambda (2) => Lambda (1).
The problem is, if DB server is maintained, it will be infinite loop until DB server active again.
My solution is, create a lambda function (3) doing like a flag, checks DB server status. It will run when SQS receives new message, run repeatedly until DB server active again. This time Lambda (2) is called.
And I want this Lambda (3) is a single thread (singleton ?), all request from SQS are in one thread.
=> With this solution, system only need retry one thread if DB server down.
New flow: Lambda (1) => SQS => Single thread Lambda (3) => Lambda (2) => Lambda (1)
My question is:
My solution is possible or not?
If it's possible then how to setup Lambda (3) ?
If it's impossilbe then is there any way to do resolve my problem?
Please help, Thank you!
It is possible by using throttling and CloudWatch scheduled event triggers.
You can set up CloudWatch scheduled event to periodically run lambda function 3 (the one responsible for DB status checking). I am not sure what you mean by single threaded but I guess that you mean that at most one instance of that function will be run simultaneously. This is easy because CloudWatch scheduled event will run that function just once per x - amount of time which you can specify.
Once the above function (3) detects that the DB is unhealthy, it can set concurrency limit on you lambda function that reads messages from SQS (2) and throttle it down to 0 so that lambda function (2) cannot be executed at all.
When the function (3) detects that the DB is healthy, it will remove this concurrency limit from function (2).
So the code of the lambda function (3) could look something like this
if db_is_not_healthy:
lambda.put_function_concurrency(
FunctionName=function_2,
ReservedConcurrentExecutions=0
)
else:
lambda.delete_function_concurrency(
FunctionName=function_2
)
How exactly you are going to setup your lambda health checks, when to start them, when to stop them, how often to ping the DB depends on your particular use case and how much you are willing to pay for it.
For example, you could start pinging the DB only after there are some errors with it. Once the lambda function (1) receives error response, it can then enable health checks - lambda (3) by unthrottling it and once lambda (3) decides that DB is healthy again, it can throttle itself so that this health checks are performed only when there are problems with the DB.
This is definitely not the most elegant solution but it should work after some tweaking.

Can I schedule a lambda function execution with a lambda function?

I'm looking for the ability to programmatically schedule a lambda function to run a single time with another lambda function. For example, I made a request to myFirstFunction with date and time parameters, and then at that date and time, have mySecondFunction execute. Is that possible only with stateless AWS services? I'm trying to avoid an always-on ec2 instance.
Most of the results I'm finding for scheduling a lambda functions have to do with cloudwatch and regularly scheduled events, not ad-hoc events.
This is a perfect use case for aws step functions.
Use Wait state with SecondsPath or TimestampPath to add the required delay before executing the Next State.
What you're tring to do (schedule Lambda from Lambda) it's not possible with the current AWS services.
So, in order to avoid an always-on ec2 instance, there are other options:
1) Use AWS default or custom metrics. You can use, for example, ApproximateNumberOfMessagesVisible or CPUUtilization (if your app fires a big CPU utilization when process a request). You can also create a custom metric and fire it when your instance is idle (depending on the app that's running in your instance).
The problem with this option is that you'll waste already paid minutes (AWS always charge a full-hour, no matter if you used your instance for 15 minutes).
2) A better option, in my opinion, would be to run a Lambda function once per minute to check if your instances are idle and shut them down only if they are close to the full hour.
import boto3
from datetime import datetime
def lambda_handler(event, context):
print('ManageInstances function executed.')
environments = [['instance-id-1', 'SQS-queue-url-1'], ['instance-id-2', 'SQS-queue-url-2'], ...]
ec2_client = boto3.client('ec2')
for environment in environments:
instance_id = environment[0]
queue_url = environment[1]
print 'Instance:', instance_id
print 'Queue:', queue_url
rsp = ec2_client.describe_instances(InstanceIds=[instance_id])
if rsp:
status = rsp['Reservations'][0]['Instances'][0]
if status['State']['Name'] == 'running':
current_time = datetime.now()
diff = current_time - status['LaunchTime'].replace(tzinfo=None)
total_minutes = divmod(diff.total_seconds(), 60)[0]
minutes_to_complete_hour = 60 - divmod(total_minutes, 60)[1]
print 'Started time:', status['LaunchTime']
print 'Current time:', str(current_time)
print 'Minutes passed:', total_minutes
print 'Minutes to reach a full hour:', minutes_to_complete_hour
if(minutes_to_complete_hour <= 2):
sqs_client = boto3.client('sqs')
response = sqs_client.get_queue_attributes(QueueUrl=queue_url, AttributeNames=['All'])
messages_in_flight = int(response['Attributes']['ApproximateNumberOfMessagesNotVisible'])
messages_available = int(response['Attributes']['ApproximateNumberOfMessages'])
print 'Messages in flight:', messages_in_flight
print 'Messages available:', messages_available
if(messages_in_flight + messages_available == 0):
ec2_resource = boto3.resource('ec2')
instance = ec2_resource.Instance(instance_id)
instance.stop()
print('Stopping instance.')
else:
print('Status was not running. Nothing is done.')
else:
print('Problem while describing instance.')
UPDATE - I wouldn't recommend using this approach. Things changed in when TTL deletions happen and they are not close to TTL time. The only guarantee is that the item will be deleted after the TTL. Thanks #Mentor for highlighting this.
2 months ago AWS announced DynamoDB item TTL, which allows you to insert an item and mark when you wish for it to be deleted. It will be deleted automatically when the time comes.
You can use this feature in conjunction with DynamoDB Streams to achieve your goal - your first function inserts an item to a DynamoDB table. The record TTL should be when you want the second lambda triggered. Setup a stream that triggers your second lambda. In this lambda you will identify deletion events and if that's a delete then run your logic.
Bonus point - you can use the table item as a mechanism for the first lambda to pass parameters to the second lambda.
About DynamoDB TTL:
https://aws.amazon.com/blogs/aws/new-manage-dynamodb-items-using-time-to-live-ttl/
It does depend on your use case, but the idea that you want to trigger something at a later date is a common pattern. The way I do it serverless is I have a react application that triggers an action to store a date in the future. I take the date format like 24-12-2020 and then convert it using date(), having researched that the date format mentioned is correct, so I might try 12-24-2020 and see what I get(!). When I am happy I convert it to a Unix number in javascript React I use this code:
new Date(action.data).getTime() / 1000
where action.data is the date and maybe the time for the action.
I run React in Amplify (serverless), I store that to dynamodb (serverless). I then run a Lambda function (serverless) to check my dynamodb for any dates (I actually use the Unix time for now) and compare the two Unix dates now and then (stored) which are both numbers, so the comparison is easy. This seems to me to be super easy and very reliable.
I just set the crontab on the Lambda to whatever is needed depending on the approximate frequency required, in most cases running a lambda every five minutes is pretty good, although if I was only operating this in a certain time zone for a business weekday application I would control the Lambda a little more. Lambda is free for the first 1m functions per month and running it every few minutes will cost nothing. Obviously things change, so you will need to look that up in your area.
You will never get perfect timing in this scenario. It will, however, for the vast majority of use cases be close enough according to the timing settings of the Lambda function, you could set it up to check every minute or just once per day, it all depends on your application.
Alternatively, If I wanted an instant reaction to an event I might use SMS, SQS, or Kinesis to instantly stream a message, it all depends on your use case.
I'd opt for enqueuing deferred work to SQS using message timers in myFirstFunction.
Currently, you can't use SQS as a Lambda event source, but you can either periodically schedule mySecondFunction to check the queue via scheduled CloudWatch Events (somewhat of a variant of the other options you've found) or use a CloudWatch alarm on the ApproximateNumberOfMessagesVisible to fire an SNS message to a Lambda and avoid constant polling for queues that are frequently inactive for long periods.