OpenTelemetry Lambda Layer - amazon-web-services

Is there any way to lessen the Lambda Layer dropped events? It keeps on dropping the traces before they reached the central collector. Before it exports the traces, it will then fetch the token to make an authorized sending of traces to the central collector. But it does not push the traces as it is being dropped because the lambda function execution is already done.
Lambda Extension Layer Reference: https://github.com/open-telemetry/opentelemetry-lambda/tree/main/collector
Exporter Error:
Exporting failed. No more retries left. Dropping data.
{
"kind": "exporter",
"data_type": "traces",
"name": "otlp",
"error": "max elapsed time expired rpc error: code = DeadlineExceeded desc = context deadline exceeded",
"dropped_items": 8
}

I encountered the same problem and did some research.
Unfortunately, it is a known issue that has not been resolved yet in the latest version of AWS Distro for OpenTelemetry Lambda (ADOT Lambda)
Github issue tickets:
Spans being dropped github.com/aws-observability/aws-otel-lambda issue #229
ADOT collector sends response to Runtime API instead of waiting for sending traces github.com/open-telemetry/opentelemetry-lambda issue #224
The short answer: currently the otel collector extension does not work reliably as it gets frozen by the lamda environment while it is still sending data to the exporters. As a workaround, you can send the traces directly to a collector running outside the lambda container.
The problem is:
the lambda sends the traces to the collector extension process during its execution
the collector queues them for sending them on to the configured exporters
the collector extension does not wait for the collector to finish processing its queue before telling the lambda environment that the extension is done; instead it always immediately tells the environment immediately that it's done, without looking at what the collector is doing
when the lambda is done, the extension is already done, so the lambda container is frozen until the next lambda invocation.
the container is thawed when the next lambda invocation arrives. if the next invocation comes soon and takes long enough, the collector may be able to finish sending the traces to the exporters. if not, the connection to the backend system times out before sending is complete.
What complicates the solution is that it is very hard for an extension to detect whether the main lambda has finished processing.
Ideally, a telemetry extension would:
Wait for the lambda to finish processing
Check if the lambda sent it any data to process and forward
Wait for all processing and forwarding to complete (if any)
Signal to the lambda environment that the extension is done
The lambda extension protocol doesn't tell the extension when the main lambda has finished processing (it would be great if AWS could add that to the extension protocol as a new event type).
There is a proposed PR that tries to work around this by assuming that lambdas always send traces, so instead of waiting for the lambda to complete, it waits for a TCP request to the OTLP receiver to arrive. This works, but it makes the extension hang forever if the lambda never sends any traces.
Note: the same problem that we see here for traces also exists for metrics.

Related

AWS lambda: Execute function on timeout

I am developing a lambda function that migrates logs from an SFTP server to an S3 bucket.
Due to the size of the logs, the function sometimes is timing out - even though I have set the maximum timeout of 15 minutes.
try:
logger.info(f'Migrating {log_name}... ')
transfer_to_s3(log_name, sftp)
logger.info(f'{log_name} was migrated succesfully ')
If transfer_to_s3() fails due to timeoutlogger.info(f'{log_name} was migrated succesfully') line won't be executed.
I want to ensure that in this scenario, I will somehow know that a log was not migrated due to timeout.
Is there a way to force lambda to perform an action, before exiting, in the case of a timeout?
Probably a better way would be to use SQS for that:
Logo info ---> SQS queue ---> Lambda function
If lambda successful moves the files, it removes the log info from SQS queue. If it fails, the log info persists in the SQS queue (or goes to DLQ for special handling), so the next lambda invocation can handle it.

Is it necessary for a Lambda to delete messages from an SQS queue after processing?

I'm looking at the the AWS SQS documentation here: https://docs.aws.amazon.com/sdk-for-net/v3/developer-guide/ReceiveMessage.html#receive-sqs-message
My understanding is that we need to delete the message using AmazonSQSClient.DeleteMessage() once we're done processing it, but is this necessary when we're working with an SQS triggered Lambda?
I'm testing with a Lambda function that's triggered by an SQSEvent, and unless I'm mistaken, it appears that if the Lambda function runs to completion without throwing any errors, the message does NOT return to the SQS queue. If this is true, the I would rather avoid making that unnecessary call to AmazonSQSClient.DeleteMessage().
Here is a similar question from 2019 with the top answer saying that the SDK does not delete messages automatically and that they need to be explicitly deleted within the code. I'm wondering if anything has changed since then.
Thoughts?
The key here is that you are using the AWS Lambda integration with SQS. In that instance AWS Lambda handles retrieving the messages from the queue (making them available via the event object), and automatically deletes the message from the queue for you if the Lambda function returns a success status. It will not delete the message from the queue if the Lambda function throws an error.
When using AWS Lambda integration with SQS you should not be using the AWS SDK to interact with the SQS queue at all.
Update:
Lambda now supports partial batch failure for SQS whereby the Lambda function can return a list of failed messages and only those will become visible again.
Amazon SQS doesn't automatically delete a message after retrieving it for you, in case you don't successfully receive the message (for example, if the consumers fail or you lose connectivity). To delete a message, you must send a separate request which acknowledges that you've successfully received and processed the message.
This has not changed and likely won’t change in the future as there us no way for SQS to definitively know in all cases if messages have successfully been processed. If SQS started to “assume” what happens downstream it risk becoming unreliable in many scenarios.
Yes, otherwise the next time you ask for a set of messages, you will get the same messages back - maybe not on the next call, but eventually you will. You likely don't want to keep processing the same set of messages over and over.

AWS Lambda triggered twice for a sigle SQS Message

I have a system where a Lambda is triggered with event source as an SQS Queue.Each message gets our own internal unique id to differentiate between two requests .
Now lambda deletes the message from the queue automatically after sqs invocation and keeps the message in inflight while processing it so duplicate processing of a unique message should never occur ideally.
But when I checked my logs a message with the same unique id was processed within 100 milliseconds of the time frame of each other.
So This seems like two lambdas were triggered for one message and something failed at the end of aws it was either visibility timeout or something else.I have read online that few others have gone through the same situation.
Can anyone who has gone through the same situation explain how did they solve it or people with current scalable systems who don't have this kind of issue can help me out with the reasons why I could be having it ?
Note:- One single message was successfully executed Twice this wasn't the case of retry on failure.
I faced a similar issue, where a lambda (let's call it lambda-1) is triggered through a queue, and lambda-1 further invokes lambda-2 'synchronously' (https://docs.aws.amazon.com/lambda/latest/dg/invocation-sync.html) and the message basically goes to inflight and return back after visibility timeout expiry and triggers lambda-1 again. This goes on in a loop.
As per the link above:
"For functions with a long timeout, your client might be disconnected
during synchronous invocation while it waits for a response. Configure
your HTTP client, SDK, firewall, proxy, or operating system to allow
for long connections with timeout or keep-alive settings."
Making async calls in lambda-1 can resolve this issue. In the case above, invoking lambda-2 with InvocationType='Event' returns back, which in-turn deletes the item from queue.

SNS > AWS Lambda asyncronous invocation queue vs. SNS > SQS > Lambda

Background
This archhitecture relies solely on Lambda's asyncronous invocation mechanism as described here:
https://docs.aws.amazon.com/lambda/latest/dg/invocation-async.html
I have a collector function that is invoked once a minute and fetches a batch of data in that can vary drastically in size (tens of of KB to potentially 1-3MB). The data contains a JSON array containing one-to-many records. The collector function segregates these records and publishes them individually to an SNS topic.
A parser function is subribed the SNS topic and has a concurrency limit of 3. SNS asynchronously invokes the parser function per record, meaning that the built-in AWS managed Lambda asyncronous queue begins to fill up as the instances of the parser maxes out at 3. The Lambda queueing mechanism initiates retries at incremental backups when throttling occurs, until the invocation request can be processed by the parser function.
It is imperitive that a record does not get lost during this process as they can not be resurrected. I will be using dead letter queues where needed to ensure they ultimately end up somewhere in case of error.
Testing this method out resulted in no lost invocation. Everything worked as expected. Lambda reported hundreds of throttle responses but I'm relying on this to initiate the Lambda retry behaviour for async invocations. My understanding is that this behaivour is effectively the same as that which I'd have to develop and initiate myself if I wanted to retry consuming a message coming from SQS.
Questions
1. Is the built-in AWS managed Lambda asyncronous queue reliable?
The parser could be subject to a consistent load of 200+ invocations per minute for prelonged periods so I want to understand whether the Lambda queue can handle this as sensibly as an SQS service. The main part that concerns me is this statement:
Even if your function doesn't return an error, it's possible for it to receive the same event from Lambda multiple times because the queue itself is eventually consistent. If the function can't keep up with incoming events, events might also be deleted from the queue without being sent to the function. Ensure that your function code gracefully handles duplicate events, and that you have enough concurrency available to handle all invocations.
This implies that an incoming invocation may just be deleted out of thin air. Also in my implementation I'm relying on the retry behaviour when a function throttles.
2. When a message is in the queue, what happens when the message timeout is exceeded?
I can't find a difinitive answer but I'm hoping the message would end up in the configured dead letter queue.
3. Why would I use SQS over the Lambda queue when SQS presents other problems?
See the articles below for arguments against SQS. Overpulling (described in the second link) is of particular concern:
https://lumigo.io/blog/sqs-and-lambda-the-missing-guide-on-failure-modes/
https://medium.com/#zaccharles/lambda-concurrency-limits-and-sqs-triggers-dont-mix-well-sometimes-eb23d90122e0
I can't find any articles or discussions of how the Lambda queue performs.
Thanks for reading!
Quite an interesting question. There's a presentation that covered queues in detail. I can't find it at the moment. The premise is the same as this queues are leaky buckets
So what if I add more Leaky Buckets. We'll you've delayed the leaking, however it's now leaking into another bucket. Have you solved the problem or delayed it?
What if I vibrate the buckets at different frequencies?
Further reading:
operate lambda
message expiry
message timeout
DDIA / DDIA Online
SQS Performance
sqs failure modes
mvce is missing from this question so I cannot address the the precise problem you are having.
As for an opinion on which to choose for SQS and Lambda queue I'll point to the Meta on this
sqs faq mentions Kinesis streams
sqs sns kinesis comparison
TL;DR;
It depends
I think the biggest advantage of using your own queue is the fact that you as a user have visibility into the state of the your backpressure.
Using the Lambda async invoke method, you have the potential to get throttled exceptions with the 'guarantee' that lambda will retry over an interval. If using a SQS source queue instead, you have complete visibility into the state of your message processing at all times with no ambiguity.
Secondly regarding overpulling. In theory this is a concern but in practice its never happened to me. I've run applications requiring thousands of transactions per second and never once had problems with SQS -> Lambda. Obviously set your retry policy appropriately and use a DLQ as transient/unpredictable errors CAN occur.

How can I know if SQS has a new message?

I am writing an application that use lambda function that send request to a spring boot application which will call other service. I have to use sqs (required). So sqs is between lambda and spring. The question is how do my spring application know if there is new message in sqs.
I heard about long pooling, but I don't know if this is what I need.
Do I need to set a loop that open the long pooling forever or something?
Is it efficient? I mean if there are 10 message in sqs, The connection will be opened ten times?
I aslo find using while loop here: Check for an incoming message in aws sqs
Thanks
The answer you linked is accurate.
You must write a program that polls SQS for a message (or up to 10 messages). It is more efficient to use long polling because you require less calls.
If you wish to know about a message very quickly, then you will need to poll continually. That is, as soon as it comes back and says "nothing to receive", you should call it again. To reduce the frequency of these calls, you can set long polling, up to a maximum of 20 seconds. This means that, if there are no messages in the queue, the ReceiveMessages() option will take 20 seconds before it returns a response of "no messages". If, however, a message arrives in the meantime, it will respond immediately. The long polling option is specified when making the ReceiveMessages() request.
If you do not require instant notification, your application could call less often (eg every minute, or every few minutes). This would involve less calls to Amazon SQS.
When making the ReceiveMessages() call, your application can request up to 10 messages. This means that multiple messages might be returned.
Once your application has finished processing a message, it must call DeleteMessage() to have the message removed from the queue. This is a failsafe that will automatically put the message back on the queue if there is a problem with the application and the message doesn't get correctly processed.
This is a great video from the AWS re:Invent conference that explains Amazon SQS (and Amazon SNS) in detail: AWS re:Invent SVC 105: AWS Messaging