We're gathering logs into Graylog and then store these logs in Elasticsearch. There are 3 i3.xlarge nodes collecting the last 30 days of logs, and 3 r5a.xlarge nodes holding another 700 days of logs older than one month. The warm storage drives are 6144 GiB GP3 EBS with 3000 IOPS and 125 MiB/s throughput.
Just these 6 instances take up about two thirds of our monthly budget.
I'd like to know if storage IOPS limits on both instance types are saturated enough, as I'm looking for possible savings.
CloudWatch gives me numbers like:
Up to 40k DiskReadOps on instance store volumes
Up to 29k DiskWriteOps on instance store volumes
Up to 30k EBSReadOps on warm logs EBSs
Up to 27k EBSWriteOps on warm logs EBSs
Given the EBS limits, I'm not quite sure if the numbers mean that only 3k out of 30k requests per second get processed and the rest is queued. Or is there any burst even with GP3s?
Is my storage saturated IOPS-wise, or is there any space for optimization?
Related
According to what I know about gp2 from AWS docs (link), gp2 disks have burst capabilily when they are smaller than 1000GB.
After disk is bigger 1000GB, baseline performance exceeeds 3000 IOPS burst performance, so that "burst" term cannot apply.
However, as I see on my current prod database with 2TB gp2 storage, burst balance still somehow apply to me, and storage is considerably faster while burst balance is more than 0.
Apparently, there are changes in AWS Burst term. Does anybody knows modern terms, so I can plan my hardware accordingly?
I made request to AWS support about this.
It was a lengthy thread where I got to know several important facts.
I have saved my conversation at this link, so it's not lost for community.
Answer: burts balance may still apply for storage bigger than 1TB, because there may be several volumes to serve storage space. If volume is smaller than 1TB - burst balance gets utilized for that volume.
Other facts that were obscure for me:
database may look like it's capped by IOPS limits (due internal IOPS submerge operation), but in reality it may be capped by network throughput.
network throughput is gueranteed by EBS-Optimized. At RDS docs you won't find explicit tables how instances relate to throughput, but it's there on EBS docs
For some of the instances that are nitro-based, EBS-Optimized allows to work at maximum throughput for class for 30 minutes each 24 hours. For smaller instances it means that for 30 minutes database may go skyrocket performance, comprared to poor baseline.
I've run into that issue with EFS, provisioning enough capacity (storage and throughput) is one thing, provisioning burst capacity is something else. In this case it appears that you are running into the same issue. Exceeding your burst capacity. If you have a read-heavy application, consider using a replica or a caching scheme. Alternatively you can increase your 2TB disk to 4TB or look into provisioned iops solution.
From the screen capture, I can see that AWS is already delivering the performance they promised for your instance ( 6K IOPS, consistently )
So the question remains is why there is still burst performance that let you burst up to > 11K IOPS ( the 7:00 - 9:00 timeframe ) for a limitted time
My guess is that the 3K IOPS burst limit is only for instances with less than 1TB. For instance of bigger size, you can burst up to "Baseline performance + 3k IOPS" ( around 9k in your case ) until the IO credit runs out. I have not seen any document around this though
I've written some python code to receive streaming tick trading data from an API (TWS API of Interactive Brokers using IB Gateway) and append data to file in the local machine. On a daily basis, the amount of data is roughly no more than 1GB. In addition, the 1GB of steaming data per day is composed by several millions of read/write operations for a few 100s of bytes each.
When I run the code in my local machine, the latency between the timestamp associated with the received tick data and when the data is appended to file is in the order of 0.5 to 2 seconds. However, when I run the same code in an EC2 instance, the latency explodes to minutes or hours.
At 4:30 UTC the markets open. The first chart shows that the latency is not due to RAM, CPU and presumedly IOPS. Volume type is gp2 with 100 IOPS for t2.micro and 900 IOPS for m5.large.
How to find what's causing the huge latency?
I am using Immutable deployment.
I have included script as mentioned at http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/mon-scripts.html to monitor ec2 memory at autoscale level.
Now while every deployment for temporary autoscale cloudwatch metric is created.
How do I delete them when temporary autoscale is deleted?
Or how do I delete cloudwatch metrics created while deployment.
This is because my metric list will increase on every deployment.
It is not possible to delete metrics from Amazon CloudWatch. Metrics will eventually rotate out.
Yes, this will increase the list of metrics, but typically AWS users would ask for a specific filter of metrics, so it doesn't matter how many different metrics are actually being stored by CloudWatch.
From the CloudWatch FAQs:
Data points with a period of less than 60 seconds are available for 3 hours. These data points are high-resolution custom metrics.
Data points with a period of 60 seconds (1 minute) are available for 15 days
Data points with a period of 300 seconds (5 minute) are available for 63 days
Data points with a period of 3600 seconds (1 hour) are available for 455 days (15 months)
Data points that are initially published with a shorter period are aggregated together for long-term storage. For example, if you collect data using a period of 1 minute, the data remains available for 15 days with 1-minute resolution. After 15 days this data is still available, but is aggregated and is retrievable only with a resolution of 5 minutes. After 63 days, the data is further aggregated and is available with a resolution of 1 hour. If you need availability of metrics longer than these periods, you can use the GetMetricStatistics API to retrieve the datapoints for offline or different storage.
We're using RDS on a db.t2.large instance. And an auto-scaling group of EC2's is writing data to the database during the day. In rush hours we're having about 50.000 HTTP requests each which read/write MySQL data.
This varies each day, but for today's example, during an hour:
We're seeing "Connect Error (2002) Connection timed out" from our PHP instances, about 187 times a minute.
RDS CPU won't raise above 50%
DB Connections won't go above 30 (max is set to 5000).
Free storage is ~ 300G (Disk size is large to provide high IOPS )
Write IOPS hit 1500 burst but drop to 900 because burst limit has expired, after rush hours.
Read IOPS are hitting 300 each 10mins and around 150 in between.
Disk Write Throughput averages between 20 and 25 MB/Sec
Disk Read Throughput between 0,75 and 1,5 MB/Sec
CPU Credit Balance is around 500, so we don't have a need for the CPU burst.
And when it comes to the network, I see a potential limit we're hitting:
Network Receive Throughput reaches 1.41 MB/Second and stays around 1.5 MB/Seconds during an hour.
During this time Network Transmit 5 a 5.2 MB/Second with drops to 4 MB/Second each 10 min which concurs with our cronjobs which are processing data (mainly reading)
I've tried placing the EC2's in different or the same AZ's, but this has no effect
During this time I can connect fine from my local workstation via SSH Tunnel (EC2 -> RDS). And from the EC2 to the RDS as well.
The PHP scripts are set to time-out after 5 sec of trying to connect to ensure a fast response. I've increased this limit to 15 sec now for some scripts.
But which limit are we hitting on RDS? Before we start migrating or changing instances types we'd like to known the source of this problem. I've also just enabled Enhanced Monitoring to get more details on this issue.
If more info needed, I'll gladly elaborate where needed.
Thanks!
Update 25/01/2016
On recommendation of datasage we increased the RDS disk size to 500 GB which gives us 1500 IOPS with 3600 burst, it uses around 1200 IOPS (so not even bursting now) and the time outs still occur.
Connection time-outs are set to 5 sec and 15 sec as mentioned before, shows no difference.
Update 26/01/2016
RDS Screenshot from our peak hours:
Update 28/01/2016
I've changed the setting sync_bin_log to 0, because initially I thought we were hitting the EBS throughput limits (GP-SSD 160 Mbit/s), this gives us a significant drop in disk throughput and the IOPS are lower as well, but we still see the connection time outs occur.
When we plot the times that the errors occur we're seeing that each minute around :40 seconds the time-outs start happening during about 25seconds, then no errors for about 35 secs again and it starts again. This during the peak hour of our incoming traffic.
Apparently it was the Network Performance keeping us back. When we upgraded our RDS instance to an m4.xlarge (with High Network Performance) the issues were resolved.
This was a last resort for us, but it solved our problem in the end.
I browsed the Amazon RDS pricing site today and now do want to know how they actually calculate the I/O rate? What does "$0.10 per 1 million requests" really mean?
Can anyone give some simple examples how many I/Os a simple query from EC2 to a MySQL on RDS produces?
In general it is a price for EBS storage service. Amazon claims something like this for EBS (section Projecting Costs):
As an example, a medium sized website database might be 100 GB in size
and expect to average 100 I/Os per second over the course of a month.
This would translate to $10 per month in storage costs (100 GB x
$0.10/month), and approximately $26 per month in request costs (~2.6
million seconds/month x 100 I/O per second * $0.10 per million I/O).
If you have a running application on Linux, here is an article how to measure cost for EBS: