AWS S3 Client throws "Operator called default onErrorDropped\njava.util.concurrent.CompletionException" - amazon-web-services

I am using the below AWS S3 AsyncClient from my Spring webflux application.As part of performance testing I had to introduce eventLoopGroup in S3 Client Config below to enable the S3 operations when the request rate was really high with 50 concurrent users triggering 20 request each having 5 documents to upload to S3.Before this I was getting the below error
reactor.core.publisher.Operators : Operator called default onErrorDropped\njava.util.concurrent.CompletionException: software.amazon.awssdk.core.exception.SdkClientException: Unable to execute HTTP request: Acquire operation took longer than the configured maximum time. This indicates that a request cannot get a connection from the pool within the specified maximum time. This can be due to high request rate.\nConsider taking any of the following actions to mitigate the issue: increase max connections, increase acquire timeout, or slowing the request rate.\nIncreasing the max connections can increase client throughput (unless the network interface is already fully utilized), but can eventually start to hit operation system limitations on the number of file descriptors used by the process
Introducing eventLoopGroup with numberOfThreads as 15 drastically reduced the error count to 6 from around a 1000. However once it fails, it didn't allow the next requests and failed with the same error.I was assuming that after sometime the used threads will be returned to the thread pool and would be available for use.But I have to restart the spring webflux application to get rid of the errors.Kindly let me know that the changes I have made below are correct.
#Bean
public S3AsyncClient s3client() {
SdkAsyncHttpClient httpClient =
NettyNioAsyncHttpClient.builder()
.readTimeout(Duration.ofSeconds(30))
.writeTimeout(Duration.ZERO)
.maxConcurrency(300)
.eventLoopGroup(SdkEventLoopGroup.builder().numberOfThreads(15).build())
.connectionAcquisitionTimeout(Duration.ofSeconds(30))
.build();
S3Configuration serviceConfiguration =
S3Configuration.builder()
.checksumValidationEnabled(false)
.chunkedEncodingEnabled(true)
.build();
S3AsyncClientBuilder s3AsyncClientBuilder =
S3AsyncClient.builder()
.httpClient(httpClient)
.region(Region.AP_SOUTHEAST_2)
.serviceConfiguration(serviceConfiguration);
return s3AsyncClientBuilder.build();
}

Related

How to implement resiliency (retry) in a nested service call chain

We have a webpage that queries an item from an API gateway which in turn calls a service that calls another service and so on.
Webpage --> API Gateway --> service#1 --> service#2 --> data store (RDMS, S3, Azure blob)
We want to make the operation resilient so we added a retry mechanism at every layer.
Webpage --retry--> API Gateway --retry--> service#1 --retry--> service#2 --retry--> data store.
This however could case a cascading failure because if the data store doesn't response on time, it will cause every layer to timeout and retry. In other words, if each layer has the same connection timeout and is configured to retry 3 times, then there will be a total of 81 retries to the data store (which is called a retry storm).
One way to fix this is to increase the timeout at each layer in order to give the layer below time to retry.
Webpage --5m timeout--> API Gateway --2m timeout--> service#1
This however is unacceptable because the timeout at the webpage will be too long.
How should I address this problem?
Should there only be one layer that retries? Which layer? And how can the layer know if the error is transient?
A couple possible solutions (and you can/should use both) would be to retry on different conditions and implement rate limiters/circuit breakers.
Retry On is a technique where you don't retry on every condition, but only specific conditions. This could be a specific error code or a specific header value. E.g. in your current situation, DO NOT retry on timeouts; only retry on server failures. In addition, you could have each layer retry on different conditions
Rate limiting would be to stick either a local or global rate limiter service inline to the connections. This would just help to short-circuit the thundering herd in the case that it starts up. E.g. rate limit the data layer to X req/s (insert real values here) and the gateway to Y req/s and then even if a service attempts lots of retries it won't pass too far down the chain. Similarly to this is circuit breaking, where each layer only permits X active connections to any downstream, so just another way to slow those retry storms.

How to fix aws lambda timeouts on synchronous invokes with C++ SDK?

When I invoke my lambda function it takes between 1 and 15 seconds to execute. If I invoke the function via the C++ SKD, I get timeouts. These timeouts seem to occur after a few seconds (this is human-judgment only, I did not actually time it).
Question: How do I tell the SDK to wait for slow lambdas to return and not to timeout?
Things that did not work:
In the JS SDK you can change this in the HTTP settings. This is is no such option in the C++ SDK HTTPOptions.
It does not help to give the lambda client a config with a larger connectionTimeoutMS (socket timeout). Also, the httpRequestTimeoutMs of the client is set to 0 by default, meaning it will wait forever.
I am using synchronous requests, which do not seem to have an extra option for timeouts.
Additional information:
I am using a single client to run multiple requests in parallel.
Error also happens if I am using async requests.
Related:
How do I troubleshoot retry and timeout issues when invoking a Lambda function using an AWS SDK?
Same thing gave me hard time once. You may have got the solution but for others, here is what I did. There is client configuration which can edit default connection time. Default connection time for sending request is 1 sec and for receiving it is 3 sec, if you get the request done in this time period than it is good otherwise a retry according to lambda setting will be invoked. The behavior of these two is well explained in their respective header file.
you can also play with memory size of lambda higher the memory of lambda lower the response time for the same.
Aws::Client::ClientConfiguration m_ClientConfig;
m_ClientConfig.requestTimeoutMs = 300000; // i.e. 300 seconds
m_ClientConfig.connectTimeoutMs = 300000;
/**
* Socket read timeouts for HTTP clients on Windows. Default 3000 ms. This should be more than adequate for most services. However, if you are transfering large amounts of data
* or are worried about higher latencies, you should set to something that makes more sense for your use case.
* For Curl, it's the low speed time, which contains the time in number milliseconds that transfer speed should be below "lowSpeedLimit" for the library to consider it too slow and abort.
*/
long requestTimeoutMs;
/**
* Socket connect timeout. Default 1000 ms. Unless you are very far away from your the data center you are talking to. 1000ms is more than sufficient.
*/
long connectTimeoutMs;

API Gateway occasionally spikes 5XX errors in Production

Our API Gateway and Lambdas are regularly used and work just fine most of the time, however we see spikes in 5XX errors now and then which causes a spike in customer complaints and other issues. When I look at the logs during this time I see a flood of the following error:
Execution failed due to configuration error: Malformed Lambda proxy response
There are no other details beyond this. After 10 or 15 minutes it will go away along with customer complaints. I've read that it may happen if you exceed your concurrency limit, but looking at the dashboard and it doesn't look like we ever break above 150 concurrent executions.
The calls themselves being hit work consistently as well, aside from these random spikes in 5XXs.
What else might be causing this inconsistency?
Looking through logs to try and get this figured out. I have made the logs as verbose as possible and there is nothing there. We'll have a normal call with a success response then a few minutes later this error comes up with no other logging, just the error alone. Then a few minutes after that we have logs starting for the next successful call.
10:25:42 Successfully completed execution
10:25:42 Method completed with status: 200
10:42:01 Execution failed due to configuration error: Malformed Lambda
proxy response
12:21:21 Successfully completed execution
12:21:21 Method completed with status: 200
Logging can't go further because the lambdas are never even executed. So we have no details on the payload sent to it, or any internal logging for the call, etc. It just immediately fails at the API Gateway level.
Edit: We still get these spikes but we are working on splitting the lambdas out more. We have an ExpressJS app that handles the lion's share of all requests. So we are breaking more off, especially high traffic requests, into their own lambdas to see if this helps. In-case there is an issue where a container gets too backlogged or times-out because it was handling long running requests (that takes upwards of 20s) as well as being hammered by requests that finish < 500ms.
Other theory is that maybe there is an error that gets triggered somewhere that kills the process or something else and that container is bad until it gets destroyed and respawned. As these spike and then go away in a few minutes. So breaking the lambdas up more should reduce the odds of errors from one cascading and impacting all other requests.
We also increase the resources of the lambda to see if that would help with it handling so many requests.
This usually happens when there is a timeout with your call and if there is a delay with your lambda execution.
If you are accessing an external resource such as RDS or an external network call, wrap that with a promise and handle with a timeout. This way you can identify which resource is having a bottleneck or taking a long time to execute.
exports.handler = function(event, context, callback) {
var response = {}; // set the response object
var err = "An error occured";
setTimeout(function () {
callback(err, response);
}, 3000); // 3000 ms is the timeout
}
// Actual code here
};
Also, check for any missing callbacks. That will also cause this issue.
Hope this helps.

Speech-Api daily requests quota counter increses too much after one request

As I am aware of the limitations listed here, I need to receive some clarifications on the quota limit.
I'm using the Node.js library to make a simple asynchronous speech-to-text API, using a .raw file which is stored in my bucket.
After the request is done, when checking the API Manager Traffic, the daily requests counter is increased by 50 to 100 requests.
I am not using any kind of requests libraries or other frameworks. Just the code from the gCloud docs.
var file = URL.bucket + "audio.raw"; //require from upload.
speech.startRecognition(file, config).then((data) => {
var operation = data[0];
operation.on('complete', function(transcript) {
console.log(transcript);
});
})
I believe this has to do with the operation.on call, which registers a listener for the operation to complete, but continues to poll the service until it does finally finish.
I think, based on looking at some code, that you can change the interval at which the service will poll using the longrunning.initialRetryDelayMillis setting, which should reduce the number of requests you see in your quota consumption.
Some places to look:
Speech client constructor: https://github.com/GoogleCloudPlatform/google-cloud-node/blob/master/packages/speech/src/index.js#L70
GAX Speech Client's constructor: https://github.com/GoogleCloudPlatform/google-cloud-node/blob/master/packages/speech/src/v1/speech_client.js#L67
Operations client: https://github.com/googleapis/gax-nodejs/blob/master/lib/operations_client.js#L74
GAX's Operation: https://github.com/googleapis/gax-nodejs/blob/master/lib/longrunning.js#L304

web service best practice - server timeout longer than http client timeout

I am trying to build a web service on top of hbase, so the code looks roughly like:
#GET
#Path("/blabla")
#Override
public List<String> getEvents($$$params$$$) {
......
//calling hbase query the events
......
}
When Hbase service is down, the hbase Java API keeps retrying to connect to Hbase region server util eventually it times out and throws a RT Exception:
NoServerForRegionException: Unable to find region for event,,99999999999999 after 10 tries.
The logic has no problem, my issue here is that the HttpClient times out way before hbase times out the retries. Then my web service API consumer gets no response, ugly.
Question -
What's the best practice here if you have server's timeout potentially longer than the http connection itself? How to have the web service respond to client gracefully in this case?
set the cashing for you scan object to some reasonable value. another thing, since you are using a web service to show the results to your users, i am assuming that you must be showing only a few rows(or records) at a time. you can use Hbase PageFilter so that you get only a specified no of rows each time and don't have to wait to get all the rows in one shot.