Terrible performance spikes updating AWS S3 files from Lambda

Terrible performance spikes updating AWS S3 files from Lambda - amazon-web-services

I am calling S3.putObject from inside a JavaScript Lambda updating small JSON files 2->10K.
Usual performance is OK - typically well under 100ms. However I get occasional worrying spikes of 10-20 seconds. I have had a few spikes in the range of minutes. The worst so far is 587 seconds!!.
Has anyone ever seen anything similar?
Looks like the problem is not specific to S3 - but caused by network glitches and that the default SDK Http timeout is set to 120 secs. The fix is to set maxRetries and httpOptions as shown below.
import aws from "aws-sdk";
aws.config.logger = console;
const config = {
maxRetries: 5,
httpOptions: { timeout: 4 * 1000, connectTimeout: 2 * 1000 }
};
const S3 = new aws.S3(config);
Also note setting the logger causes time and retries to be written to Cloudwatch log.

Related

AWS S3 Client throws "Operator called default onErrorDropped\njava.util.concurrent.CompletionException"

I am using the below AWS S3 AsyncClient from my Spring webflux application.As part of performance testing I had to introduce eventLoopGroup in S3 Client Config below to enable the S3 operations when the request rate was really high with 50 concurrent users triggering 20 request each having 5 documents to upload to S3.Before this I was getting the below error
reactor.core.publisher.Operators : Operator called default onErrorDropped\njava.util.concurrent.CompletionException: software.amazon.awssdk.core.exception.SdkClientException: Unable to execute HTTP request: Acquire operation took longer than the configured maximum time. This indicates that a request cannot get a connection from the pool within the specified maximum time. This can be due to high request rate.\nConsider taking any of the following actions to mitigate the issue: increase max connections, increase acquire timeout, or slowing the request rate.\nIncreasing the max connections can increase client throughput (unless the network interface is already fully utilized), but can eventually start to hit operation system limitations on the number of file descriptors used by the process
Introducing eventLoopGroup with numberOfThreads as 15 drastically reduced the error count to 6 from around a 1000. However once it fails, it didn't allow the next requests and failed with the same error.I was assuming that after sometime the used threads will be returned to the thread pool and would be available for use.But I have to restart the spring webflux application to get rid of the errors.Kindly let me know that the changes I have made below are correct.
#Bean
public S3AsyncClient s3client() {
SdkAsyncHttpClient httpClient =
NettyNioAsyncHttpClient.builder()
.readTimeout(Duration.ofSeconds(30))
.writeTimeout(Duration.ZERO)
.maxConcurrency(300)
.eventLoopGroup(SdkEventLoopGroup.builder().numberOfThreads(15).build())
.connectionAcquisitionTimeout(Duration.ofSeconds(30))
.build();
S3Configuration serviceConfiguration =
S3Configuration.builder()
.checksumValidationEnabled(false)
.chunkedEncodingEnabled(true)
.build();
S3AsyncClientBuilder s3AsyncClientBuilder =
S3AsyncClient.builder()
.httpClient(httpClient)
.region(Region.AP_SOUTHEAST_2)
.serviceConfiguration(serviceConfiguration);
return s3AsyncClientBuilder.build();
}

AWS SQS Polling Issue

I have encountered a weird sqs situation that I can't find a satisfying answer.
I created a delay queue that should delay (what a surprise) incoming events for 4 seconds and then they should be processed by lambda. Order is not an issue here.
The issue though is that the "approximate age of the oldest message" metric (stat. Max) sometimes it reaches over 1 minute which is weird since there aren't so many message as you can see in the picture. My expectation would be that the event gets processed immediately after the 4 secs delay time.
The reserved concurrency level of that lambda is 50 so the sqs poller should have no problem invoking more lambda instances if there is too much traffic. But traffic isn't really a problem as you can see.
The queue is configured like this:
Default visibility timeout: 120 sec
Delivery delay: 4 sec
Dead-letter-queue: No (It is only one event generated by aws, so no
bad pills)
Message retention period: 4 days
The lambda config:
Batch size: 5 (Tried also 1 or 10. Not much of a difference for the mentioned metric)
Batch window: None
reserved concurrency: 50
timeout: 20 secs
I can't explain the reason for those old messages (ApproximateAgeOfOldestMessage). Any help would be highly appreciated
Best
Patrick

I contacted the AWS Support. Apparently it is a bug on the aws side:
Response from AWS Support:
I have just received an update from the backend service team and the
team has confirmed that they have identified an issue of unexpected
spikes in "ApproximateAgeOfOldestMessage" metrics that triggers when
messages are sent to SQS with a configured delay. This issue's root
cause is that our internal system uses recently processed delayed
messages to calculate the "ApproximateAgeOfOldestMessage," which
results in a higher than the actual value for
"ApproximateAgeOfOldestMessage" metrics. They have now identified a
fix for this issue and will start deploying the fix soon. After this
update, when messages are sent to Amazon SQS with a configured delay,
you may see the "ApproximateAgeOfOldestMessage" metrics value come
down for the queues to the accurate value.
So if you encounter the same problem you have to wait for that mentioned fix. Hope it will come soon.

AWS API Gateway Slowness

This has me perturbed. I have a basic API gateway that is supposed to be capped at 10,000 requests per second with 5,000 request bursts. However, when hooked up to Lambdas, best I can hit currently is ~70 requests / second.
The end-points I have are basic Lambda proxies created with Serverless framework (HTTP EDGE).
I know that the lambda itself is not the bottleneck as I have the same issue when I replace the lambda with an empty function. I have 100+ allocated concurrency for the lambda, but the lambda never appears to hit the limit.
functions:
loadtest:
handler: loadtest/index.handler
reservedConcurrency: 200
events:
- http: POST load_test
I'm wondering if there's something that I am overlooking here. My test runs for a minute and attempts to hit 200 req / sec (works fine with other so it's not my bandwidth). The delays grow to be as much as 20-30s at some point, so clearly something is choking up.
If it's a warm up issue - how long am I expected to run such load until everything is running warm?
Any ideas on where to look or additional information that I could share?
[Edit] I am using node12.x and I even tried with this code:
const AWS = require('aws-sdk');
AWS.config.update({region: '<my-region>'});
var sqs = new AWS.SQS({apiVersion: '2012-11-05'});
exports.handler = async (event, context) => {
return {"status":"ok", ... }
};
The results were basically the same. I'm not sure where the bottle neck is, to be honest. I can try further testing with concurrency on the lambda side, but going from 100 to 200 had no effect - the completed requests clocks at around 70/s for an empty function.
Also, I'm using loadtest npm package to perform the loadtest and this is what the output looks like:
{ totalRequests: 8200,
totalErrors: 0,
totalTimeSeconds: 120.00341689999999,
rps: 68,
meanLatencyMs: 39080.6,
maxLatencyMs: 78490,
minLatencyMs: 427,
percentiles: { '50': 38327, '90': 70684, '95': 74569, '99': 77679 },
errorCodes: {},
instanceIndex: 0 }
Here's a picture of how provisioned concurrency looked like over that period of time. I ran this over 2 minutes with the target at 200 req/sec.

Appears that this is actually an issue with WSL2 and NodeJS. The exact nature of it is still unclear, but it is not an issue with the API gateway itself. I demonstrated this by running it on MacBook and everything worked fine and request counts were high.
There are posts suggesting that this is an issue with the Node HTTP client & DNS, so perhaps that's a good starting point, but the above question is moot.

Google Cloud Function Timeout Setting doesn't work

I can't get a Google Cloud Function to run for more than 60secs, even when the timeout is set to 540secs!! Any suggestions?
I set the timeout flag on deployment to --timeout=540, and I know the setting goes through, because the 540 sec timeout setting appears in the GCP WEB UI. I have also tried to manually edit the timeout to 540 through the GCP WEB UI. But in any case i still get the DEADLINE_EXCEEDED after just ~ 62000 ms.
I have tried both the pub/sub and https methods as the func trigger, but still get the premature function timeout at ~60s.
Im running the latest CLI, with these these func settings:
trigger: http/pubsub (both tested, same result)
availableMemoryMb: 2048
runtime: nodejs6
status: ACTIVE
timeout: 540s
Thanks for any inputs!
Br Markus

I have used the documentation code for delay and executed a Cloud Function with the same specifications as yours. In the documentation, the execution is delayed 120000 ms (2 mins). I edited that and put it at 500000 ms. This plus the normal time that the CF takes to execute, will reach the desired execution time (around 9 minutes). If you add 540000 to test the code, it will execute with timeout error at ~540025, because the value itself is exceeding the timeout limit of the Cloud Function and at the same time the default maximum timeout limit of a Cloud Function, which is 9 minutes.
I also tried the creating the function using this command
gcloud functions deploy [FUNCTION_NAME] --trigger-http --timeout=540.
After successful deployment, I updated the code manually in the GCP Cloud Function UI as follows
exports.timeoutTest = (req, res) => {
setTimeout(() => {
let message = req.query.message || req.body.message || 'Hello World today!';
res.status(200).send(message);
res.end();
}, 500000);
};
Both times the Cloud Function was executed and returned with status code 200. This means that you can set a timeout to be more than 60 secs which is the default value.
If you revised everything correctly and you still have this issue, I recommend you to start afresh, create a new CF and use the documentation link I provided.

The 60 seconds timeout is not resulting from GCP Cloud Function setting. For instance if this is a Django/Gunicorn App, the timeout is coming from the timeout of gunicorn that is set in app.yaml
entrypoint: gunicorn -t 3600 -b :$PORT project_name.wsgi
For instance, this will achieve a timeout of 3600 seconds for gunicorn.

I believe I'm some years late but here is my suggestion.
If you're using the "Test the function" button in the "Testing tab" of the Cloud Function (in the gcp "Cloud Console") it says right next to the button that:
Testing in the Cloud Console has a 60s timeout. Note that this is different from the limit set in the function configuration.
I hope you fixed it and this answer can help someone in the future.

Update: Second try ("Test the function") was precisely 9 minutes
From: 23:15:38
Till: 23:24:38
And it is exactly the 9 minutes, although the message again was about 60 seconds only and popped up much earlier than the actual stop.
Function execution took 540004 ms, finished with status: 'timeout'
This time with a lot of memory (2 GB), timeout clearly made it stop. The message is perhaps just popping up earlier since it has not been programmed in detail, my guess. You should always look at the logs to see what is happening.
I guess that the core of your question is outdated then: At least in 01/2022, you do have the demanded timeout time regardless of the what you may read, and you just should not care about the messages.
First try ("Test the function") 8 minutes after reached memory limit
A screenshot of how it looks like in 2022/01 if you get over the 60 seconds (with 540s maximum timeout for this example function set in the "Edit" menu of the CF):
Function being tested has exceeded the 60s timeout imposed by the Cloud Functions testing utility.
Yet, in reality, when using just the "Testing tab" the timeout is at least after 300s / 5 minutes which can be seen next to the "Test the function" button:
Testing in the Cloud Console has a 5 minute timeout. Note that this is different from the limit set in the function configuration.
But it is even more. I know from testing (started from the "Testing tab" --> "Test Function" in the Cloud Function) that you have at least 8 minutes:
From 22:31:43:
Till 22:39:53
And this was at first stopped by the 256 MB limit, secondly only by time (a bit unclear why there were both messages).
Therefore, your question about why you get only 60 seconds timeout time might rather ask why these messages are wrong (like in my case). Perhaps GCP did not make the effort to parametrize the messages for each function.
Perhaps you get even slightly more time when you start with gcloud from terminal, but that is not so likely since 9 minutes are the maximum anyway.

Amazon API gateway timeout

I have some issue with API gateway. I made a few API methods, sometimes they work longer than 10 seconds and Amazon returns 504 error. Here is screenshot below:
Please help! How can I increase timeout?
Thanks!

Right now the default limit for Lambda invocation or HTTP integration is 30s according to http://docs.aws.amazon.com/apigateway/latest/developerguide/limits.html and this limit is not configurable.

As of Dec/2017, the maximum value is still 29 seconds, but should be able to customize the timeout value.
https://aws.amazon.com/about-aws/whats-new/2017/11/customize-integration-timeouts-in-amazon-api-gateway/
This can be set in "Integration Request" of each method in APIGateway.

Finally in 2022 we have a workaround. Unfortunately AWS did not change the API Gateway so that's still 29 seconds but, you can use a built-in HTTPS endpoint in the lambda itself: Built-in HTTPS Endpoints for Single-Function Microservices
which is confirmed to have no timeout-so essentially you can have the full 15 minute window of lambda timeout: https://twitter.com/alex_casalboni/status/1511973229740666883
For example this is how you define a function with the http endpoint using aws-cdk and typescript:
const backendApi = new lambda.Function(this, 'backend-api', {
memorySize: 512,
timeout: cdk.Duration.seconds(40),
runtime: lambda.Runtime.NODEJS_16_X,
architecture: Architecture.ARM_64,
handler: 'lambda.handler',
code: lambda.Code.fromAsset(path.join(__dirname, '../dist')),
environment: {
...parsedDotenv
}
})
backendApi.addFunctionUrl({
authType: lambda.FunctionUrlAuthType.NONE,
cors: {
// Allow this to be called from websites on https://example.com.
// Can also be ['*'] to allow all domain.
allowedOrigins: ['*']
}
})

You can't increase the timeout, at least not now. Your endpoints must complete in 10 seconds or less. You need to work on improving the speed of your endpoints.
http://docs.aws.amazon.com/apigateway/latest/developerguide/limits.html

Lambda functions will timeout after a max. of 5 min; API Gateway requests will timeout after 29 sec. You can't change that, but you can workaround it with asynchronous execution pattern, I wrote I blog post about:
https://joarleymoraes.com/serverless-long-running-http-requests/

I wanted to comment on "joarleymoraes" post but don't have enough reputation. The only thing to add to that is that you don't HAVE to refactor to use async, it just depends on your backend and how you can split it up + your client side retries.
If you aren't seeing a high percentage of 504's and you aren't ready for async processing, you can implement client side retries with exponential backoff on them so they aren't permanent failures.
The AWS SDK automatically implements retries with backoff, so it can help to make it easier, especially since Lambda Layers will allow you to maintain the SDK for your functions without having to constantly update your deployment packages.
Once you do that it will result in less visibility into those timeouts, since they are no longer permanent failures. This can buy you some time to deal with the core problem, which is that you are seeing 504's in the first place. That certainly can mean refactoring your code to be more response, splitting up large functions into more "micro service" type concepts and reducing external network calls.
The other benefit to retries is that if you retry all 5xx responses from an application, it can cover a lot of different issues which you might see during normal execution. It is generally considered in all applications that these issues are never 100% avoidable so it's best practice to go ahead and plan for the worst!
All of that being said, you should still work on reducing the lambda execution time or going async. This will allow you to set your timeout values to a much smaller number, which allows you to fail faster. This helps a lot for reducing the impact on the front end, since it doesn't have to wait 29 seconds to retry a failed request.

Timeouts can be decreased but cannot be increased more than 29 seconds. The backend on your method should return a response before 29 seconds else API gateway will throw 504 timeout error.
Alternatively, as suggested in some answers above, you can change the backend to send status code 202 (Accepted) meaning the request has been received successfully and the backend then continues further processing. Of course, we need to consider the use case and it's requirements before implementing the workaround

Lambda functions have 15 mins of max execution time, but since APIGateway has strict 29 second timeout policy, you can do following things to over come this.
For an immediate fix, try increasing your lambda function size. Eg.: If your lambda function has 128 MB memory, you can increase it to 256 MB. More memory helps function to execute faster.
OR
You can use lambdaInvoke() function which is part of the "aws-sdk". With lambdaInvoke() instead of going through APIGateway you can directly call that function. But this is useful on server side only.
OR
The best method to tackle this is -> Make request to APIGateway -> Inside the function push the received data into an SQS Queue -> Immediately return the response -> Have a lambda function ready which triggers when data available in this SQS Queue -> Inside this triggered function do your actual time complex executions -> Save the data to a data store -> If call is comes from client side(browser/mobile app) then implement long-polling to get the final processed result from the same data store.
Now since api is immediately returning the response after pushing data to SQS, your main function execution time will be much less now, and will resolve the APIGateway timeout issue.
There are other methods like using WebSockets, Writing event driven code etc. But above methods are much simpler to implement and manage.

While you cannot increase the timeout, you can link lambda's together if the work is something that could be split up.
Using the aws sdk:
var aws = require('aws-sdk');
var lambda = new aws.Lambda({
region: 'us-west-2' //change to your region
});
lambda.invoke({
FunctionName: 'name_of_your_lambda_function',
Payload: JSON.stringify(event, null, 2) // pass params
}, function(error, data) {
if (error) {
context.done('error', error);
}
if(data.Payload){
context.succeed(data.Payload)
}
});
Source: Can an AWS Lambda function call another
AWS Documentation: http://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/Lambda.html

As of May 21, 2021 This is still the same. The hard limit for the maximum time is 30 seconds. Below is the official document on quotas for API gateway.
https://docs.aws.amazon.com/apigateway/latest/developerguide/limits.html#http-api-quotas

The timeout limits cannot be increased so a response should be returned within 30 seconds. The workaround I usually do :
Send the result in an async way. The lambda function should trigger
another process and sends a response to the client saying
Successfully started process X and the other process should notify
the client in async way once it finishes (Hit an endpoint, Send a
slack notification or an email..). You can found a lot of interesting resources concerning this topic
Utilize the full potential of the multiprocessing in your lambda
function and increase the memory for a faster computing time
Eventually, if you need to return the result in a sync way and one
lambda function cannot do the job, you could integrate API gateway
directly with step function so you would have multiple lambda
function working in parallel. It may seem complicated but in fact it
is quite simple

Custom timeout between 50 and 29,000 milliseconds for WebSocket APIs and between 50 and 30,000 milliseconds for HTTP APIs. The default timeout is 29 seconds for WebSocket APIs and 30 seconds for HTTP APIs

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js