AWS Lambda Stops Running Randomly - amazon-web-services

Has anyone ever seen a Lambda function stop randomly?
I've got a method that subscribes to an SNS topic that is published to every hour. I recently had 4 messages come through to the subscriber Lambda and 3 of the four worked perfectly.
CloudWatch gives me all of the console logs I have logged, I get responses from all of the APIs the method reaches out to, end with a success message, but the fourth message logs the console log to CloudWatch and then I get the "Request End" log immediately following. None of the following: console.logs, no error from lambda, no insight as to why it would have stopped, no timeout error, it just stopped.
I've never seen this kind of behavior before and have yet to have a Lambda function stop working without logging out the error (everything that runs is written in a try/catch that has reliably logged the errors until now).
Has anyone ever come across this and have any insight as to what it may be?

Related

AWS Lex, Amazon Connect, and Lambda. Dialog flow freezes after intent is fulfilled

I have an Amazon connect instance which uses a Lex v2 bot in the get customer input block. There is only one intent, which is triggered and reaches fulfillment fine. It is fulfilled by a lambda function and there are no errors there. It executes fully and there are not errors in the code.
I tested in the Lex v2 console and the bot works fine and returns the intent fulfilled message and the closing response. In the connect dialog flow, I get the intent fulfilled message and then nothing. No closing response and the dialog flow does not continue to the next prompt block, even when I remove the closing response. The call doesn't even hang up it just keeps the user on the line and is unresponsive. The lambda code fulfillment runs all the way through with no errors.
I suspect the issue is occurring somewhere between fulfilling the intent and sending the closing response. I don't know what could be holding up the bot at this point. I have a closing message I want to play for the user before I disconnect/terminate the call but the dialog flow is never getting to this point for reasons I cannot seem to find.
Any help is appreciated even just a point in the right direction. I have been looking through docs but haven't found anything regarding this specific issue. Thanks!

GCP PubSub mysteriously/silently failing with Cloud Functions

I have about a dozen or so GCF functions (Python) which run in series, once a day. In order to keep the correct sequence, I use PubSub. So for example:
topic1 triggers function1 -> function1 runs -> function1 writes a message to topic2 -> topic2 triggers function2 -> function2 runs -> etc.
This use case is low throughput and a very straightforward (I thought) way to use GCF and PubSub together to each others advantage. The functions use pubsub_v1 in Python to publish messages. There are no problems with IAM, permissions, etc. Code looks like:
from google.cloud import pubsub_v1
# Publish message
publisher = pubsub_v1.PublisherClient()
topic2 = publisher.topic_path('my-project-name', 'topic2_id')
publish_message = '{short json message to be published}'
print('sending message ' + publish_message)
publisher.publish(topic2, publish_message.encode("utf-8"))
And I deploy function1 and other functions using:
gcloud functions deploy function1 --entry-point=my_python_function --runtime=python37 \
--trigger-topic=topic1 --memory=4096MB --region=us-central1 \
--source="url://source-repository-with-my-code"
However, recently I have started to see some really weird behaviour. Basically, function1 runs, the logs look great, message has seemingly been published to topic2...then nothing. function2 doesn't begin execution or show anything in the logs to suggest it's been triggered. No logs suggesting either success or failure. So essentially it seems that either:
the message from function1 to topic2 is not getting published, despite function1 finishing with Function execution took 24425 ms, finished with status: 'ok'
the message from function1 to topic2 is getting published, but topic2 is not triggering function2.
Is this expected behaviour for PubSub? These failures seem completely random. I went months with everything working very reliably, and now suddenly I have no idea whether the messages are going to be delivered or not. It also seems really difficult to track the lifespan of these PubSub messages to see where exactly they're going missing. I've read in the docs about dead letter topics etc, but I don't really understand how to set up something that makes it easy to track.
Is it normal for very low frequency, short messages to "fail" to be delivered?
Is there something I'm missing or something I should be doing, e.g. in the publisher.publish() call to ensure more reliable delivery?
Is there a transparent way to see what's going on and see where these messages are going missing? Setting up a new subscription which I can view in the console and see which messages are being delivered and which are failing, something like that?
If I need 100% (or close to that) reliability, should I be ditching GCF and PubSub? What's better?
The issue here is that you aren't waiting for publisher.publish to actually succeed. This method returns a future and may not complete synchronously. If you want to ensure the publish has completed successfully, you need to call result() on the value returned from publish:
future = publisher.publish(topic2, publish_message.encode("utf-8"))
future.result()
You will also want to ensure that you have "Retry on failure" enabled on your cloud function by passing the --retry argument to gcloud functions deploy. That way, if the publish fails, the message from topic1 will be redelivered to the cloud function to be tried again.

AWS Lambda: is there a way that I can watch live log printed by a function while it is executing

I am new to AWS. I have just developed a lambda function(Python) which print messages while executing. However I am not sure where I can watch the log printed out while the function is executing.
I found CloudWatch log in the function, but it seems that log is only available after function completed.
Hope you can help,
many thanks
You are correct -- the print() messages will be available in CloudWatch Logs.
It is possible that a long-running function might show logs before it has completed (I haven't tried that), but AWS Lambda functions only run for a maximum of 15 minutes and most complete in under one second. It is not expected that you would need to view logs while a function is running.

AWS Lambda triggered twice for a sigle SQS Message

I have a system where a Lambda is triggered with event source as an SQS Queue.Each message gets our own internal unique id to differentiate between two requests .
Now lambda deletes the message from the queue automatically after sqs invocation and keeps the message in inflight while processing it so duplicate processing of a unique message should never occur ideally.
But when I checked my logs a message with the same unique id was processed within 100 milliseconds of the time frame of each other.
So This seems like two lambdas were triggered for one message and something failed at the end of aws it was either visibility timeout or something else.I have read online that few others have gone through the same situation.
Can anyone who has gone through the same situation explain how did they solve it or people with current scalable systems who don't have this kind of issue can help me out with the reasons why I could be having it ?
Note:- One single message was successfully executed Twice this wasn't the case of retry on failure.
I faced a similar issue, where a lambda (let's call it lambda-1) is triggered through a queue, and lambda-1 further invokes lambda-2 'synchronously' (https://docs.aws.amazon.com/lambda/latest/dg/invocation-sync.html) and the message basically goes to inflight and return back after visibility timeout expiry and triggers lambda-1 again. This goes on in a loop.
As per the link above:
"For functions with a long timeout, your client might be disconnected
during synchronous invocation while it waits for a response. Configure
your HTTP client, SDK, firewall, proxy, or operating system to allow
for long connections with timeout or keep-alive settings."
Making async calls in lambda-1 can resolve this issue. In the case above, invoking lambda-2 with InvocationType='Event' returns back, which in-turn deletes the item from queue.

"LAMBDA_RUNTIME" Error on high-volume Lambda Function

I'm currently using a Lambda Function written in Javascript that is setup with an SQS event source to automatically pull messages from an SQS Queue and do some basic processing on the message contents. I cannot show the code but the summary of the lambda function's execution is basically:
For each message in the batch it receives as part of the event:
It parses the body, which is a JSON string, into a Javascript object.
It reads an object from S3 that is listed in the object using getObject.
It puts a record into a DynamoDB table using put.
If there were no errors, it deletes the individual SQS message that was processed from the Queue using deleteMessage.
This SQS queue is high-volume and receives messages in-bulk, regularly building up a backlog of millions of messages. The Lambda is normally able to scale to process hundreds of thousands of messages concurrently. This solution has worked well for me with other applications in the past but I'm now encountering the following intermittent error that reliably begins to appear as the Lambda scales up:
[ERROR] [#############] LAMBDA_RUNTIME Failed to post handler success response. Http response code: 400.
I've been unable to find any information anywhere about what this error means and what causes it. There appears to be not discernible pattern as to which executions encounter it. The function is usually able to run for a brief period without encountering the error and scale to expected levels. But then, as you can see, the error starts to appear quite suddenly and completely destroys the Lambda throughput by forcing it to auto-scale down:
Does anyone know what this "LAMBDA_RUNTIME" error means and what might cause it? My Lambda Function runtime is Node v12.
Your function is being invoked asynchronously, so when it finishes it signals the caller if it was sucessful.
You should have an error some milliseconds earlier, probably an unhandled exception not being logged. If that's the case, your functions ends without knowing about the exception and tries to post a success response.
I have this error only that I get:
[ERROR] [1638918279694] LAMBDA_RUNTIME Failed to post handler success response. Http response code: 413.
I went to the lambda function on aws console and ran the test with a custom event I build and the error I got there was:
{
"errorMessage": "Response payload size exceeded maximum allowed payload size (6291556 bytes).",
"errorType": "Function.ResponseSizeTooLarge"
}
So this is the actual error that cloudwatch doesn't return but the testing section of the lambda function console do.
I think I'll have to return info to an S3 file or something, but that's another matter.