I'm trying to reach the limit of 3500 PUT req/s on a prefix in AWS S3. However, I'm getting throttled too early. I'm using the Java AWS SDK.
The application made a ramp on the number of parallel requests. I noticed that if I reach a throughput of 2500 - 2600 req/s too early, after 40 seconds from the start, AWS throttles me. However, if I slow down the pace, there is no throttle.
I'm using a rate limiter (the Resilience4j rate-limiter, indeed), limiting the requests per second to 3499.
Why can't I reach the limit? Has anyone ever reached that limit? There are a lot of resources on the Internet pointing to the same page, Best practices design patterns: optimizing Amazon S3 performance, but no one is saying that that limit is effectively reachable :/
PS: No throttling on GET requests, even if I reach a throughput higher than 5500 req/s...)
Related
I am currently using basic version of cluster on Confluent cloud and I only have one topic with 9 partitions. I have a REST Api that’s setup using AWS lambda service which publishes messages to Kafka.
Currently i am stress testing pipeline with 5k-10k requests per second, I found that Latency is shooting up to 20-30 seconds to publish a record of size 1kb. Which is generally 300 ms for a single request.
I added producer configurations like linger.ms - 500 ms and batch.size to 100kb. I see some improvement (15-20 seconds per request) but I feel it’s still too high.
Is there anything that I am missing or is it something with the basic cluster on confluent cloud? All of the configurations on the cluster were default.
Identified that the issue is with API request which is getting throttled. As mentioned by Chris Chen, due to the exponential back-off strategy by AWS SDK the avg time is shooting up. Requested AWS for increase in concurrent executions. I am sure it should solve the issue.
I have been using the AmazonS3 service to store some files.
I have uploaded 4 videos and they are public. I'm using a third party video player for those videos (JW Player). As a new user on the AWS Free Tier, my free PUT, POST and LIST requests are almost used up from 2000 allowed requests, and for four videos that seems ridiculous.
Am I missing something or shouldn't one upload be one PUT request, I don't understand how I've hit that limit already.
The AWS Free Tier for Amazon S3 includes:
5GB of standard storage (normally $0.023 per GB)
20,000 GET requests (normally $0.0004 per 1,000 requests)
2,000 PUT requests (normally $0.005 per 1,000 requests)
In total, it is worth up to 13.3 cents every month!
So, don't be too worried about your current level of usage, but do keep an eye on charges so you don't get too many surprises. You can always Create a Billing Alarm to Monitor Your Estimated AWS Charges.
The AWS Free Tier is provided to explore AWS services. It is not intended for production usage.
It would be very hard to find out the reason for this without debugging a bit. So I would suggest you try the following debugging :
See if you have cloudtrail enabled. If yes, then you can track the API calls to S3 to see if anything is wrong there.
If you have cloudtrail enabled then it itself put data into the S3 bucket that might also take up some of the requests.
See if you have logging enabled at the bucket level, that might give you more insight on what all requests are reaching your bucket.
Your vides are public and that is the biggest concern here as you don't know who all can access it.
Setup cloudwatch alarms to avoid any surprises and try to look at logs to find out the issue.
I've had a Cloud Function that makes Vision API requests, specifically Document Text Detection requests. My peak requests rate is usually around ~120-150 requests per minute on an average day.
I've suddenly been getting resource quota exceeded errors for Vision API requests with a request rate at 2500 requests per minute. Some things to note:
I've had no code changes in 3 months
I deleted and redeployed the Cloud Function making these requests to stop any problematic image that was causing a runaway loop
My code calling the API nor the cloud functions themselves were getting retried so there really wasn't a way that I could exponentially increase my request rate overnight with no changes introduced.
The service account making the Vision calls is making the normal amount of requests and is only in use by the cloud function i.e. not being used by someone's local script
I've since turned on retries to mitigate this issue since it'll "work" with exponential back off but this is expensive to do, especially with the vision API. Is there anything I can do to find out the root cause of this issue?
To identify the specific quota being exceeded, Stackdriver API helps by using Monitoring quota metrics as explained here .
GCP lets you specify quota being exceeded in greater depth using the Stackdriver API and UI, with quota metrics appearing in the Metrics Explorer.
We're using RDS on a db.t2.large instance. And an auto-scaling group of EC2's is writing data to the database during the day. In rush hours we're having about 50.000 HTTP requests each which read/write MySQL data.
This varies each day, but for today's example, during an hour:
We're seeing "Connect Error (2002) Connection timed out" from our PHP instances, about 187 times a minute.
RDS CPU won't raise above 50%
DB Connections won't go above 30 (max is set to 5000).
Free storage is ~ 300G (Disk size is large to provide high IOPS )
Write IOPS hit 1500 burst but drop to 900 because burst limit has expired, after rush hours.
Read IOPS are hitting 300 each 10mins and around 150 in between.
Disk Write Throughput averages between 20 and 25 MB/Sec
Disk Read Throughput between 0,75 and 1,5 MB/Sec
CPU Credit Balance is around 500, so we don't have a need for the CPU burst.
And when it comes to the network, I see a potential limit we're hitting:
Network Receive Throughput reaches 1.41 MB/Second and stays around 1.5 MB/Seconds during an hour.
During this time Network Transmit 5 a 5.2 MB/Second with drops to 4 MB/Second each 10 min which concurs with our cronjobs which are processing data (mainly reading)
I've tried placing the EC2's in different or the same AZ's, but this has no effect
During this time I can connect fine from my local workstation via SSH Tunnel (EC2 -> RDS). And from the EC2 to the RDS as well.
The PHP scripts are set to time-out after 5 sec of trying to connect to ensure a fast response. I've increased this limit to 15 sec now for some scripts.
But which limit are we hitting on RDS? Before we start migrating or changing instances types we'd like to known the source of this problem. I've also just enabled Enhanced Monitoring to get more details on this issue.
If more info needed, I'll gladly elaborate where needed.
Thanks!
Update 25/01/2016
On recommendation of datasage we increased the RDS disk size to 500 GB which gives us 1500 IOPS with 3600 burst, it uses around 1200 IOPS (so not even bursting now) and the time outs still occur.
Connection time-outs are set to 5 sec and 15 sec as mentioned before, shows no difference.
Update 26/01/2016
RDS Screenshot from our peak hours:
Update 28/01/2016
I've changed the setting sync_bin_log to 0, because initially I thought we were hitting the EBS throughput limits (GP-SSD 160 Mbit/s), this gives us a significant drop in disk throughput and the IOPS are lower as well, but we still see the connection time outs occur.
When we plot the times that the errors occur we're seeing that each minute around :40 seconds the time-outs start happening during about 25seconds, then no errors for about 35 secs again and it starts again. This during the peak hour of our incoming traffic.
Apparently it was the Network Performance keeping us back. When we upgraded our RDS instance to an m4.xlarge (with High Network Performance) the issues were resolved.
This was a last resort for us, but it solved our problem in the end.
In the DynamoDB documentation and in many places around the internet I've seen that single digit ms response times are typical, but I cannot seem to achieve that even with the simplest setup. I have configured a t2.micro ec2 instance and a DynamoDB table, both in us-west-2, and when running the command below from the aws cli on the ec2 instance I get responses averaging about 250 ms. The same command run from my local machine (Denver) averages about 700 ms.
aws dynamodb get-item --table-name my-table --key file://key.json
When looking at the CloudWatch metrics in the AWS console it says the average get latency is 12 ms though. If anyone could tell me what I'm doing wrong or point me in the direction of information where I can solve this on my own I would really appreciate it. Thanks in advance.
The response times you are seeing are largely do to the cold start times of the aws cli. When running your get-item command the cli has to get loaded into memory, fetch temporary credentials (if using an ec2 iam role when running on your t2.micro instance), and establish a secure connection to the DynamoDB service. After all that is completed then it executes the get-item request and finally prints the results to stdout. Your command is also introducing a need to read the key.json file off the filesystem, which adds additional overhead.
My experience running on a t2.micro instance is the aws cli has around 200ms of overhead when it starts, which seems inline with what you are seeing.
This will not be an issue with long running programs, as they only pay a similar overhead price at start time. I run a number of web services on t2.micro instances which work with DynamoDB and the DynamoDB response times are consistently sub 20ms.
There are a lot of factors that go into the latency you will see when making a REST API call. DynamoDB can provide latencies in the single digit milliseconds but there are some caveats and things you can do to minimize the latency.
The first thing to consider is distance and speed of light. Expect to get the best latency when accessing DynamoDB when you are using an EC2 instance located in the same region. It is normal to see higher latencies when accessing DynamoDB from your laptop or another data center. Note that each region also has multiple data centers.
There are also performance costs from the client side based on the hardware, network connection, and programming language that you are using. When you are talking millisecond latencies the processing time on your machine can make a difference.
Another likely source of the latency will be the TLS handshake. Establishing an encrypted connection requires multiple round trips and computation on both sides to get the encrypted channel established. However, as long as you are using a Keep Alive for the connection you will only pay this overheard for the first query. Successive queries will be substantially faster since they do not incur this initial penalty. Unfortunately the AWS CLI isn't going to keep the connection alive between requests, but the AWS SDKs for most languages will manage this for you automatically.
Another important consideration is that the latency that DynamoDB reports in the web console is the average. While DynamoDB does provide reliable average low double digit latency, the maximum latency will regularly be in the hundreds of milliseconds or even higher. This is visible by viewing the maximum latency in CloudWatch.
They recently announced DAX (Preview).
Amazon DynamoDB Accelerator (DAX) is a fully managed, highly available, in-memory cache for DynamoDB that delivers up to a 10x performance improvement – from milliseconds to microseconds – even at millions of requests per second. For more information, see In-Memory Acceleration with DAX (Preview).