AWS S3 rate limit and SlowDown errors

AWS S3 rate limit and SlowDown errors - amazon-web-services

I'm refactoring a job that uploads ~1.2mln small files to AWS; previously this upload was made file by file on a 64 CPUs machine with processes. I switched to an async + multiprocess approach following S3 rate limits and best practices and performance guidelines to make it faster. With sample data I can achieve execution times as low as 1/10th. With production loads S3 is returning "SlowDown" errors.
Actually the business logic makes the folder structure like this:
s3://bucket/this/will/not/change/<shard-key>/<items>
The objects will be equally splitted across ~30 shard-keys, making every prefix contain ~40k items.
We have every process writing on its own prefix and launching batches of 3k PUT requests in async until completion. There is a sleep after the batch write operation to ensure that we do not send another batch before 1.1sec has passed, so we will respect the 3500 PUT requests per second.
The problem is that we receive SlowDown errors for ~1 hour and then the job writes all the files in ~15 minutes. If we lower the limit to 1k/sec this gets even worse, running for hours and never finishing.
This is the distribution of the errors over time for the 3k/sec limit:
We are using Python 3.6 with aiobotocore to run async.
Doing some sort of trial and error to try to understand how to mitigate this takes forever on production data and testing with a lower quantity of data gives us different results (flawlessly works).
Did I miss any documentation regarding how to make the system scale up correctly?

Related

How can I cancel a download that's too big or slow in a lambda?

I'm using axios in a lambda function to download a file from a user provided url. Obviously that file could be any size, and might be served at any speed. I am concerned that might create Denial of Service and Denial of Wallet risks.
I don't know if aws have any charges for lambda ingress, I haven't been able to find a definitive answer yet. Even if they don't though, large uploads could still force my lambdas to run for longer (costing me money) and potentially pushing me up against the rate limits I have set, in part, to mitigate flooding attack risk (denying people service).
Likewise, very slow downloads might cause my lambdas to run til they time out. My timeouts are set fairly high because there is processing to do once the file is downloaded. I'd rather bale after a small handful of seconds as the input data should always be small and fast.
So what I want is for downloads to abort if they hit a preset maximum size in bytes OR a maximum download time.
If adding these limits isn't possible with Axios then I'm open to using different libraries like node-fetch.

At the axios side itself, you can set a timeout and maxContentLength to limit the request time and download time. Lambda max timeout us 15 minutes.
If you possibly have many lengthy request, it is better to use EC2. Huge numbers of Lambda requests at high memory and high duration ends up more costly than EC2. Basically Serverless is indeed cost-effective and easy operationally especially for spiky type of workload. For steady 24/7 workload, long processing-times, better use VM.

AWS Lambda time out in 6 seconds

I am using the serverless framework with nodejs(Version 4.4) to create AWS lambda functions. The default timeout is 6 seconds for lambda execution. I am connecting to mysql database using sequelize ORM. I see errors like execution timed out. Sometimes my code works properly even with this error. But sometimes nothing works after this timeout error. Its really hard for me make sense out of this timeout. I am afraid increasing the timeout will incur more charge.

If you are seeing errors like 'execution timed out' than you are probably cutting the execution of your Lambdas with a too low timeout.
There might be several reasons for this:
The initialization of the container can be slow, this should only occur for the first call of container. If you have a low memory setting and load lots of libraries it can happen that it takes quite a while(usually this shouldn't be a problem with node)
Connecting to a database can be slow
If you reuse database connections, it's possible that they are stale and this can lead to a timeout.
Your database queries may be slow.
To mitigate the problem you should temporarily add some logging to your Lambda and increase the timeout, so that you can figure out what actually takes so long. Unless you are already a heavy Lambda user you are unlikely to use up your 400.000 free GB-seconds a month. If you run your Lambdas with 128 MB this equates to 3.200.000 seconds per month / 103.225 seconds per day / 28.5 hours per day. Try to test with higher memory settings as well, depending on case this can even reduce the total GB/s consumed.
As others pointed out already you only pay for the time actually used, so if your Lambda finishes faster than the timeout you only pay for the actual time consumed(in 100 ms increments).

Get a Constant rps in Jmeter

I have been trying to use JMeter to test my server. I have a cloudsearch endpoint on AWS. I have to test if it can scale upto 25000 requests per second without failing. I have tried JMeter with a constant throughput timer with throughput = 1500000 per second and running 1000 threads. I ran it for 10 mins. But When I review the aggregate report it shows an average of only 25 requests per second. How do i get an average of around 25000 requests per second?

Constant Throughput Timer can only pause the threads to reach specified "Target Throughput" value so make sure you provide enough virtual users (threads) to generate desired "requests per minute" value.
You don't have enough Threads to achieve such Requests per second!!!
To get an average (~25000) requests per second, you have to increase the Number of threads.
Remember, The number of threads will impact results if your server faces slowdowns. If so and you don't have enough threads then you will not be injecting the expected load and end up with fewer transactions performed.

You need to increase the number of concurrent users to be at least 25000 (it assumes 1 second response time, if you have 2 seconds response time - you will need 50000)
JMeter default configuration is not suitable for high loads, it is good for tests development and debugging, however when it comes to the real load you need to consider some constraints, i.e.:
Run JMeter test in non-GUI mode
Increase JVM Heap Size and tune other parameters
Disable all listeners during the test
If above tips don't help follow other recommendations from 9 Easy Solutions for a JMeter Load Test “Out of Memory” Failure article or consider Distributed Testing

Counting number of requests per second generated by JMeter client

This is how application setup goes -
2 c4.8xlarge instances
10 m4.4xlarge jmeter clients generating load. Each client used 70 threads
While conducting load test on a simple GET request (685 bytes size page). I came across issue of reduced throughput after some time of test run. Throughput of about 18000 requests/sec is reached with 700 threads, remains at this level for 40 minutes and then drops. Thread count remains 700 throughout the test. I have executed tests with different load patterns but results have been same.
The application response time considerably low throughout the test -
According to ELB monitor, there is reduction in number of requests (and I suppose hence the lower throughput ) -
There are no errors encountered during test run. I also set connect timeout with http request but yet no errors.
I discussed this issue with aws support at length and according to them I am not blocked by any network limit during test execution.
Given number of threads remain constant during test run, what are these threads doing? Is there a metrics I can check on to find out number of requests generated (not Hits/sec) by a JMeter client instance?
Testplan - http://justpaste.it/qyb0

Try adding the following Test Elements:
HTTP Cache Manager
and especially DNS Cache Manager as it might be the situation where all your threads are hitting only one c4.8xlarge instance while the remaining one is idle. See The DNS Cache Manager: The Right Way To Test Load Balanced Apps article for explanation and details.

Auto scaling batch jobs on Elastic Beanstalk

I am trying to set up a scalable background image processing using beanstalk.
My setup is the following:
Application server (running on Elastic Beanstalk) receives a file, puts it on S3 and sends a request to process it over SQS.
Worker server (also running on Elastic Beanstalk) polls the SQS queue, takes the request, load original image from S3, processes it resulting in 10 different variants and stores them back on S3.
These upload events are happening at a rate of about 1-2 batches per day, 20-40 pics each batch, at unpredictable times.
Problem:
I am currently using one micro-instance for the worker. Generating one variant of the picture can take anywhere from 3 seconds to 25-30 (it seems first ones are done in 3, but then micro instance slows down, I think this is by its 2 second bursty workload design). Anyway, when I upload 30 pictures that means the job takes: 30 pics * 10 variants each * 30 seconds = 2.5 hours to process??!?!
Obviously this is unacceptable, I tried using "small" instance for that, the performance is consistent there, but its about 5 seconds per variant, so still 30*10*5 = 26 minutes per batch. Still not really acceptable.
What is the best way to attack this problem which will get fastest results and will be price efficient at the same time?
Solutions I can think of:
Rely on beanstalk auto-scaling. I've tried that, setting up auto scaling based on CPU utilization. That seems very slow to react and unreliable. I've tried setting measurement time to 1 minute, and breach duration at 1 minute with thresholds of 70% to go up and 30% to go down with 1 increments. It takes the system a while to scale up and then a while to scale down, I can probably fine tune it, but it still feels weird. Ideally I would like to get a faster machine than micro (small, medium?) to use for these spikes of work, but with beanstalk that means I need to run at least one all the time, since most of the time the system is idle that doesn't make any sense price-wise.
Abandon beanstalk for the worker, implement my own monitor of of the SQS queue running on a micro, and let it fire up larger machine(or group of larger machines) when there are enough pending messages in the queue, terminate them the moment we detect queue is idle. That seems like a lot of work, unless there is a solution for this ready out there. In any case, I lose the benefits of beanstalk of deploying the code through git, managing environments etc.
I don't like any of these two solutions
Is there any other nice approach I am missing?
Thanks

CPU utilization on a micro instance is probably not the best metric to use for autoscaling in this case.
Length of the SQS queue would probably be the better metric to use, and the one that makes the most natural sense.
Needless to say, if you can budget for a bigger base-line machine everything would run that much faster.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js