RDS connection time out in rush hours (with ~50.000 HTTP requests) - amazon-web-services

We're using RDS on a db.t2.large instance. And an auto-scaling group of EC2's is writing data to the database during the day. In rush hours we're having about 50.000 HTTP requests each which read/write MySQL data.
This varies each day, but for today's example, during an hour:
We're seeing "Connect Error (2002) Connection timed out" from our PHP instances, about 187 times a minute.
RDS CPU won't raise above 50%
DB Connections won't go above 30 (max is set to 5000).
Free storage is ~ 300G (Disk size is large to provide high IOPS )
Write IOPS hit 1500 burst but drop to 900 because burst limit has expired, after rush hours.
Read IOPS are hitting 300 each 10mins and around 150 in between.
Disk Write Throughput averages between 20 and 25 MB/Sec
Disk Read Throughput between 0,75 and 1,5 MB/Sec
CPU Credit Balance is around 500, so we don't have a need for the CPU burst.
And when it comes to the network, I see a potential limit we're hitting:
Network Receive Throughput reaches 1.41 MB/Second and stays around 1.5 MB/Seconds during an hour.
During this time Network Transmit 5 a 5.2 MB/Second with drops to 4 MB/Second each 10 min which concurs with our cronjobs which are processing data (mainly reading)
I've tried placing the EC2's in different or the same AZ's, but this has no effect
During this time I can connect fine from my local workstation via SSH Tunnel (EC2 -> RDS). And from the EC2 to the RDS as well.
The PHP scripts are set to time-out after 5 sec of trying to connect to ensure a fast response. I've increased this limit to 15 sec now for some scripts.
But which limit are we hitting on RDS? Before we start migrating or changing instances types we'd like to known the source of this problem. I've also just enabled Enhanced Monitoring to get more details on this issue.
If more info needed, I'll gladly elaborate where needed.
Thanks!
Update 25/01/2016
On recommendation of datasage we increased the RDS disk size to 500 GB which gives us 1500 IOPS with 3600 burst, it uses around 1200 IOPS (so not even bursting now) and the time outs still occur.
Connection time-outs are set to 5 sec and 15 sec as mentioned before, shows no difference.
Update 26/01/2016
RDS Screenshot from our peak hours:
Update 28/01/2016
I've changed the setting sync_bin_log to 0, because initially I thought we were hitting the EBS throughput limits (GP-SSD 160 Mbit/s), this gives us a significant drop in disk throughput and the IOPS are lower as well, but we still see the connection time outs occur.
When we plot the times that the errors occur we're seeing that each minute around :40 seconds the time-outs start happening during about 25seconds, then no errors for about 35 secs again and it starts again. This during the peak hour of our incoming traffic.

Apparently it was the Network Performance keeping us back. When we upgraded our RDS instance to an m4.xlarge (with High Network Performance) the issues were resolved.
This was a last resort for us, but it solved our problem in the end.

Related

AWS S3 PUT requests get throttled

I'm trying to reach the limit of 3500 PUT req/s on a prefix in AWS S3. However, I'm getting throttled too early. I'm using the Java AWS SDK.
The application made a ramp on the number of parallel requests. I noticed that if I reach a throughput of 2500 - 2600 req/s too early, after 40 seconds from the start, AWS throttles me. However, if I slow down the pace, there is no throttle.
I'm using a rate limiter (the Resilience4j rate-limiter, indeed), limiting the requests per second to 3499.
Why can't I reach the limit? Has anyone ever reached that limit? There are a lot of resources on the Internet pointing to the same page, Best practices design patterns: optimizing Amazon S3 performance, but no one is saying that that limit is effectively reachable :/
PS: No throttling on GET requests, even if I reach a throughput higher than 5500 req/s...)

AWS Application Load Balancer (ALB): How many http2 persistent-connections can it keep alive at the same time?

Essentially what the subject says.
I'm new to this sport and need some high-level pieces of information to figure out the behaviour of ALB towards http2 persistent connections.
I know that ALB supports http2 persistent connections:
https://docs.aws.amazon.com/elasticloadbalancing/latest/application/application-load-balancers.html#connection-idle-timeout
I can't find anything in the docs explaining how the size(s) of the http2-connection-pools (maintained by ALB) are configured (if at all). Any links on this specific aspect?
Does the ALB, by default, maintain a fixed-size http2-connection-pool between itself and the browsers (clients) or are these connection-pools dynamically sized? If they are fixed-size how big are they by default? If they are dynamic what rules govern their expansion/contraction and what's the max amount of persistent http2-connections that they can hold? 30k? 40k?
Let's assume we have 20k http2-clients that run single-page-applications (SPAs) with sessions lasting up to 30mins. These clients need to enjoy ultra-low latency for their semi-frequent http2-requests through AWS ALB (say 1 request per 4secs which translates to about 5k requests/second landing on the ALB):
Does it make sense to configure the ALB to have a hefty
http2-connection-pool so as to ensure that all of these 20k
http2-connections from our clients will indeed be kept alive
throughout the lifetime of the client-session?
Reasoning: In this way no http2-connection will be closed and reopened (guarantees lower jitter because reestablishing a new http2-connection involves some extra latency - at least that's my intuition about this and I'd be happy to stand corrected if I miss something)
I asked this question in the amazon forums:
https://repost.aws/questions/QULRcA_-73QxuAOyGYWhExng/aws-application-load-balancer-and-http-2-persistent-connections-keep-alive
And I got this answer which covers every aspect in question in good detail:
<< So, when it comes to the concurrent connection limits of an Application Load Balancer, there is no upper limitations on the amount of traffic it can serve; it can scale automatically to meet the vast majority of traffic workloads.
An ALB will scale up aggressively as traffic increases, and scale down conservatively as traffic decreases. As it scales up, new higher capacity nodes will be added and registered with DNS, and previous nodes will be removed. This effectively gives an ALB a dynamic connection pool to work with.
When working with the client behavior you have described, the main attribute you'll want to look at when configuring your ALB will be the Connection Idle Timeout setting. By default, this is set to 60 seconds, but can be set to a value of up to 4000 seconds. In your situation, you can set a value that will meet your need to maintain long-term connections of up to 30 minutes without the connection being terminated, in conjunction with utilizing HTTP keep-alive options within your application.
As you might expect, an ALB will start with an initial capacity that may not immediately meet your workload. But as stated above, the ALB will scale up aggressively, and scale down conservatively, scaling up in minutes, and down in hours, based on the traffic received. I highly recommend checking out our best practices for ELB evaluation page to learn more about scaling and how you can test your application to better understand how an ALB will behave based on your traffic load. I will highlight from this page that depending on how quickly traffic increases, the ALB may return an HTTP 503 error if it has not yet fully scaled to meet traffic demand, but will ultimately scale to the necessary capacity. When load testing, we recommend that traffic be increased at no more than 50 percent over a five minute interval.
When it comes to pricing, ALBs are charged for each hour that the ALB is running, and the number of Load Balancer Capacity Units (LCU) used per hour. LCUs are measured based on a set of dimensions on which traffic is processed; new connections, active connections, processed bytes, and rule evaluations, and you are charged based only on the dimension with the highest usage in a particular hour.
As an example using the ELB Pricing Calculator, assuming the ~20,000 connections are ramped up by 10 connections per second, with an average connection duration of 30 minutes (1800 seconds) and sending 1 request every 4 seconds for a total of 1GB of processed data per hour, you could expect a rough cost output of:
1 GB per hour / 1 GB processed bytes per hour per LCU for EC2
instances and IP addresses as targets
= 1 processed bytes LCUs for EC2 instances and IP addresses
as targets
10 new connections per second / 25 new connections
per second per LCU = 0.40 new connections LCUs
10 new connections per second x 1,800 seconds
= 18,000 active connections
18,000 active connections / 3000 connections per LCU
= 6 active connections LCUs
1 rules per request - 10 free rules = -9 paid rules per request
after 10 free rules Max (-9 USD, 0 USD) = 0.00 paid rules per
request Max (1 processed bytes LCUs, 0.4 new connections LCUs,
6 active connections LCUs, 0 rule evaluation LCUs)
= 6 maximum LCUs
1 load balancers x 6 LCUs x 0.008 LCU price per hour x 730 hours
per month = 35.04 USD
Application Load Balancer LCU usage charges (monthly): 35.04 USD
<<

Why does AWS RDS still shows burst balance 0 with disk size 2TB gp2?

According to what I know about gp2 from AWS docs (link), gp2 disks have burst capabilily when they are smaller than 1000GB.
After disk is bigger 1000GB, baseline performance exceeeds 3000 IOPS burst performance, so that "burst" term cannot apply.
However, as I see on my current prod database with 2TB gp2 storage, burst balance still somehow apply to me, and storage is considerably faster while burst balance is more than 0.
Apparently, there are changes in AWS Burst term. Does anybody knows modern terms, so I can plan my hardware accordingly?
I made request to AWS support about this.
It was a lengthy thread where I got to know several important facts.
I have saved my conversation at this link, so it's not lost for community.
Answer: burts balance may still apply for storage bigger than 1TB, because there may be several volumes to serve storage space. If volume is smaller than 1TB - burst balance gets utilized for that volume.
Other facts that were obscure for me:
database may look like it's capped by IOPS limits (due internal IOPS submerge operation), but in reality it may be capped by network throughput.
network throughput is gueranteed by EBS-Optimized. At RDS docs you won't find explicit tables how instances relate to throughput, but it's there on EBS docs
For some of the instances that are nitro-based, EBS-Optimized allows to work at maximum throughput for class for 30 minutes each 24 hours. For smaller instances it means that for 30 minutes database may go skyrocket performance, comprared to poor baseline.
I've run into that issue with EFS, provisioning enough capacity (storage and throughput) is one thing, provisioning burst capacity is something else. In this case it appears that you are running into the same issue. Exceeding your burst capacity. If you have a read-heavy application, consider using a replica or a caching scheme. Alternatively you can increase your 2TB disk to 4TB or look into provisioned iops solution.
From the screen capture, I can see that AWS is already delivering the performance they promised for your instance ( 6K IOPS, consistently )
So the question remains is why there is still burst performance that let you burst up to > 11K IOPS ( the 7:00 - 9:00 timeframe ) for a limitted time
My guess is that the 3K IOPS burst limit is only for instances with less than 1TB. For instance of bigger size, you can burst up to "Baseline performance + 3k IOPS" ( around 9k in your case ) until the IO credit runs out. I have not seen any document around this though

Azure Event Hub - Outgoing message/bytes doesn't increase after increasing throughput units and IEventProcessor instances

I have an EventHub instance with 200 partitions (Dedicated cluster). Originally, I have a consumer group with 70 instances of IEventProcessor + 1 throughput unit.
It appears that I can only have 30M outgoing messages per hour while there are double of that amount for incoming. So I increased to 20 throughput units and 100 processor instances. But the outgoing messages don't increase beyond 30M. I don't see any throttle messages.
Are there other EventHub limits that I should adjust here?
EDIT 1:
After setting prefetch and batch size to 1000 I still only see moderate increase: Imgur
Couple things I can recommend to check and do:
Increase batch and prefetch size.
Check client side resources like CPU and available memory. See if there is any high resource utilization which may become a bottleneck.
If hosts are on in a different region than the eventhub, network latency can slow down the receives. Co-locate hosts with eventhub if so.
Consider creating a support ticket so PG can do a proper investigation.

EC2 - Huge latency in saving streaming data to local file

I've written some python code to receive streaming tick trading data from an API (TWS API of Interactive Brokers using IB Gateway) and append data to file in the local machine. On a daily basis, the amount of data is roughly no more than 1GB. In addition, the 1GB of steaming data per day is composed by several millions of read/write operations for a few 100s of bytes each.
When I run the code in my local machine, the latency between the timestamp associated with the received tick data and when the data is appended to file is in the order of 0.5 to 2 seconds. However, when I run the same code in an EC2 instance, the latency explodes to minutes or hours.
At 4:30 UTC the markets open. The first chart shows that the latency is not due to RAM, CPU and presumedly IOPS. Volume type is gp2 with 100 IOPS for t2.micro and 900 IOPS for m5.large.
How to find what's causing the huge latency?