When a Lambda times out, it outputs a message to CloudWatch (if enabled) saying "Task timed out".
It would be beneficial to attach additional info (such as the context of the offending call) to the message. Right now I'm writing the context to CloudWatch at the start of the invocation - but it would sometimes be preferable if everything was contained within a single message.
Is something like that possible?
Unfortunately there is no almost-timed-out-hook. You may however be able to inspect the context object you get in the Lambda handler to look at the remaining run time and if it gets close to timing out printing out the additional info.
In python you could use context.get_remaining_time_in_millis() as per the documentation to get that info.
There is no timeout hook for lambda but can be implemented with a little bit of code
import signal
def handler(event, context):
....
signal.alarm((context.get_remaining_time_in_millis())
.....
def timeout_handler(_signal, _frame):
raise Exception('other information')
We implemented something like this for a lot of custom handlers in cloudformation.
Related
Experience with "long-running" Lambda's
In my company, we recently ran into this behaviour, when triggering Lambdas, that run for > 60 seconds (boto3's default timeout for connection establishment and reads).
The beauty of the Lambda invocation with boto3 (using the 'InvocationType' 'RequestResponse') is, that the API returns the result state of the respective Lambda run, so we wanted to stick to that.
The issue seems to be, that the client fires to many requests per minute on the standing connection to the API. Therefore, we experimented with the boto3 client configuration, but increasing the read timeout resulted in new (unwanted) invocations after each timeout period and increasing the connection timeout triggered a new invocation, after the Lambda was finished.
Workaround
As various investigations and experimentation with boto3's Lambda client did not result in a working setup using 'RequestResponse' invocations,
we circumvented the problem now by making use of Cloudwatch logs. For this, the Lambda has to be setup up to write to an accessible log group. Then, these logs can the queried for the state. Then you would invoke the Lambda and monitor it like this:
import boto3
lambda_client = boto3.client('lambda')
logs_clients = boto3.client('logs')
invocation = lambda_client.invoke(
FunctionName='your_lambda',
InvocationType='Event'
)
# Identifier of the invoked Lambda run
request_id = invocation['ResponseMetadata']['RequestID']
while True:
# filter the logs for the Lambda end event
events = logs_client.filter_log_events(
logGroupName='your_lambda_loggroup',
filterPattern=f'"END RequestId: {request_id}"'
).get('events', [])
if len(events) > 0:
# the Lambda invocation finished
break
This approach works for us now, but it's honestly ugly. To make this approach slightly better, I recommend to set the time range filtering in the filter_log_events call.
One thing, that was not tested (yet): The above approach only tells, whether the Lambda terminated, but not the state (failed or successful) and the default logs don't hold anything useful in that regards. Therefore, I will investigate, if a Lambda run can know its own request id during runtime. Then the Lambda code can be prepared to also write error messages with the request id, which then can be filtered for again.
I have about a dozen or so GCF functions (Python) which run in series, once a day. In order to keep the correct sequence, I use PubSub. So for example:
topic1 triggers function1 -> function1 runs -> function1 writes a message to topic2 -> topic2 triggers function2 -> function2 runs -> etc.
This use case is low throughput and a very straightforward (I thought) way to use GCF and PubSub together to each others advantage. The functions use pubsub_v1 in Python to publish messages. There are no problems with IAM, permissions, etc. Code looks like:
from google.cloud import pubsub_v1
# Publish message
publisher = pubsub_v1.PublisherClient()
topic2 = publisher.topic_path('my-project-name', 'topic2_id')
publish_message = '{short json message to be published}'
print('sending message ' + publish_message)
publisher.publish(topic2, publish_message.encode("utf-8"))
And I deploy function1 and other functions using:
gcloud functions deploy function1 --entry-point=my_python_function --runtime=python37 \
--trigger-topic=topic1 --memory=4096MB --region=us-central1 \
--source="url://source-repository-with-my-code"
However, recently I have started to see some really weird behaviour. Basically, function1 runs, the logs look great, message has seemingly been published to topic2...then nothing. function2 doesn't begin execution or show anything in the logs to suggest it's been triggered. No logs suggesting either success or failure. So essentially it seems that either:
the message from function1 to topic2 is not getting published, despite function1 finishing with Function execution took 24425 ms, finished with status: 'ok'
the message from function1 to topic2 is getting published, but topic2 is not triggering function2.
Is this expected behaviour for PubSub? These failures seem completely random. I went months with everything working very reliably, and now suddenly I have no idea whether the messages are going to be delivered or not. It also seems really difficult to track the lifespan of these PubSub messages to see where exactly they're going missing. I've read in the docs about dead letter topics etc, but I don't really understand how to set up something that makes it easy to track.
Is it normal for very low frequency, short messages to "fail" to be delivered?
Is there something I'm missing or something I should be doing, e.g. in the publisher.publish() call to ensure more reliable delivery?
Is there a transparent way to see what's going on and see where these messages are going missing? Setting up a new subscription which I can view in the console and see which messages are being delivered and which are failing, something like that?
If I need 100% (or close to that) reliability, should I be ditching GCF and PubSub? What's better?
The issue here is that you aren't waiting for publisher.publish to actually succeed. This method returns a future and may not complete synchronously. If you want to ensure the publish has completed successfully, you need to call result() on the value returned from publish:
future = publisher.publish(topic2, publish_message.encode("utf-8"))
future.result()
You will also want to ensure that you have "Retry on failure" enabled on your cloud function by passing the --retry argument to gcloud functions deploy. That way, if the publish fails, the message from topic1 will be redelivered to the cloud function to be tried again.
I have a handful of Python Lambda functions with tracing enabled, they start like this:
import json
import boto3
from aws_xray_sdk.core import patch
from aws_xray_sdk.core import xray_recorder
patch(['boto3'])
def lambda_handler(event, context):
...
With this tracing works for every Lambda function itself and I can see the subservice calls to DynamoDB or Kinesis done through boto3.
But how can I connect various Lambda functions together in one trace? I'm thinking of generating a unique string in the first function and write it into the message stored in Kinesis. Another function would then pick up the string from the Kinesis' message and trace it again.
How would this be possible in a way to then see the whole connected trace in X-Ray?
If your upstream service which invokes your Lambda functions has tracing enabled, your functions will automatically send traces. From your question, I'm not sure how your functions are invoked. If one function is directly invoking another function, you'll have a single trace for them.
For your approach of invoking lambdas with Kinesis messages, I'm not sure it would achieve what you want due to several reasons.Firstly, Kinesis is not integrated with X-Ray, which means it will not propagate the trace header to downstream lambda. Secondly, the segment and the trace header for a lambda function is not directly accessible from your function's code since it is generated by the lambda runtime upon invocation and is thus immutable. Explicitly overriding the trace id in a lambda function may result in undesired behavior of your service graph.
Thanks.
I have a Lambda that requires messages to be sent to another Lambda to perform some action. In my particular case it is passing a message to a Lambda in order for it to perform HTTP requests and refresh cache entries.
Currently I am relying on the AWS SDK to send an SQS message. The mechanics of this are working fine. The concern that I have is that the SQS send method call takes around 50ms on average to complete. Considering I'm in a Lambda, I am unable to perform this in the background and expect for it to complete before the Lambda returns and is frozen.
This is further compounded if I need to make multiple SQS send calls, which is particularly bad as the Lambda is responsible for responding to low-latency HTTP requests.
Are there any alternatives in AWS for communicating between Lambdas that does not require a synchronous API call, and that exhibits more of a fire and forget and asynchronous behavior?
Though there are several approaches to trigger one lambda from another, (in my experience) one of the fastest methods would be to directly trigger the ultimate lambda's ARN.
Did you try invoking one Lambda from the other using AWS SDKs?
(for e.g. in Python using Boto3, I achieved it like this).
See below, the parameter InvocationType = 'Event' helps in invoking target Lambda asynchronously.
Below code takes 2 parameters (name, which can be either your target Lambda function's name or its ARN, params is a JSON object with input parameters you would want to pass as input). Try it out!
import boto3, json
def invoke_lambda(name, params):
lambda_client = boto3.client('lambda')
params_bytes = json.dumps(params).encode()
try:
response = lambda_client.invoke(FunctionName = name,
InvocationType = 'Event',
LogType = 'Tail',
Payload = params_bytes)
except ClientError as e:
print(e)
return None
return response
Hope it helps!
For more, refer to Lambda's Invoke Event on Boto3 docs.
Alternatively, you can use Lambda's Async Invoke as well.
It's difficult to give exact answers without knowing what language are you writing the Lambda function in. To at least make "warm" function invocations faster I would make sure you are creating the SQS client outside of the Lambda event handler so it can reuse the connection. The AWS SDK should use an HTTP connection pool so it doesn't have to re-establish a connection and go through the SSL handshake and all that every time you make an SQS request, as long as you reuse the SQS client.
If that's still not fast enough, I would have the Lambda function handling the HTTP request pass off the "background" work to another Lambda function, via an asynchronous call. Then the first Lambda function can return an HTTP response, while the second Lambda function continues to do work.
You might also try to use Lambda Destinations depending on you use case. With this you don't need to put things in a queue manually.
https://aws.amazon.com/blogs/compute/introducing-aws-lambda-destinations/
But it limits your flexibility. From my point of view chaining lambdas directly is an antipattern and if you would need that, go for step functions
In this tutorial is written:
Set reasonable timeout periods, and report when they're about to be exceeded
If an operation doesn't execute within its defined timeout period, the
function raises an exception and no response is sent to
CloudFormation.
To avoid this, ensure that the timeout value for your Lambda functions
is set high enough to handle variations in processing time and network
conditions. Consider also setting a timer in your function to respond
to CloudFormation with an error when a function is about to time out;
this can help prevent function timeouts from causing custom resource
timeouts and delays.
What is the exact solution behind this? Should I implement timeout on AWS Lambda
side or I can just set timeout period in CustomResource properties?
AFAIK, you can't set timeout on CustomResource.
What they are writing about in your citation is it's up to you to signal to Cloudformation just before your function times out.
You know about the remaining time by querying the context object which is the second parameter in your handler function. For example in Python:
def handler(event, context):
print("Time left:", context.get_remaining_time_in_millis())
You will see that the method call is similar in other languages, e.g Java:
context.getRemainingTimeInMillis()
So, you could query the remaining time in a loop and when that value is getting low (e.g 3000ms), check if your resource is still not created and send an error signal to Cloudformation.
Second, do increase your timeout on your function as they recommended.