Understanding IOPS and IOPS Burst - amazon-web-services

Ok so I am going through A Cloud Guru's course for the solutions architect associate and I am having trouble understanding what IOP burst are. Here are the notes from the course:
EBS Volume Types
General Purpose SSD (GP2)
General purpose, balances both price and performance.
Ratio of 3 IOPS per GB with up to 10,000 IOPS and the ability to burst up
to 3000 IOPS for extended periods of time for volumes under 1Gib.
After doing some research I understand IOPS to mean input/output operations. Meaning read and write to disk I assume. What I don't understand about this is, what does it mean to have 3IOPS per Gig. Does that mean for every gig of space on the drive you can read/write 3 times to the disk? That doesn't seem right. The other part I don't understand is what does "the ability to burst" mean? My guess is that means how much can be read/written at once over the course of the read/write operation but I'm just guessing.

Actually, IOPS means Input/Output PER SECOND. When you choose your EBS Type it has a baseline IOPS value, meaning that the quantity of operations per second is limited by the volume architecture.
With 3000 IOPS/Gib you have 3000 operations of input/output per second, with the capacity to transfer 1Gib. There are some techinal details here that I'm no the best one to tell you about block sizes and all. But this is a summary.
One thing you must understand is that the IOPS is not just a techical thing but also a comercial thing for Amazon. They sell the EBS with a limited IOPS (baseline), but if you need more IOPS you can pay an extra and create volumes with Provisioned IOPS (that can fo up to 20000).
About the burst, what I tell here is not exactly how it works, but can help you understand a little more that you undersand now. Some volumes can upgrade your IOPS for a brief periodo of time in case is needed without extra costs. This is a temporary burst, an normally with works for just a few minutes. It means that if receive a request for data that extends your defined IOPS, the service will provide with more of your defined for a brief period, to keep your service quality consistent. But if you need it for a long period will need to pay for it.
I think that this burst is based on some type of credits, like if you don't use your availabe IOPS for a time it generate some credits to be used in burst. But this last information must be confirmed with more experienced people.

Related

Why does AWS RDS still shows burst balance 0 with disk size 2TB gp2?

According to what I know about gp2 from AWS docs (link), gp2 disks have burst capabilily when they are smaller than 1000GB.
After disk is bigger 1000GB, baseline performance exceeeds 3000 IOPS burst performance, so that "burst" term cannot apply.
However, as I see on my current prod database with 2TB gp2 storage, burst balance still somehow apply to me, and storage is considerably faster while burst balance is more than 0.
Apparently, there are changes in AWS Burst term. Does anybody knows modern terms, so I can plan my hardware accordingly?
I made request to AWS support about this.
It was a lengthy thread where I got to know several important facts.
I have saved my conversation at this link, so it's not lost for community.
Answer: burts balance may still apply for storage bigger than 1TB, because there may be several volumes to serve storage space. If volume is smaller than 1TB - burst balance gets utilized for that volume.
Other facts that were obscure for me:
database may look like it's capped by IOPS limits (due internal IOPS submerge operation), but in reality it may be capped by network throughput.
network throughput is gueranteed by EBS-Optimized. At RDS docs you won't find explicit tables how instances relate to throughput, but it's there on EBS docs
For some of the instances that are nitro-based, EBS-Optimized allows to work at maximum throughput for class for 30 minutes each 24 hours. For smaller instances it means that for 30 minutes database may go skyrocket performance, comprared to poor baseline.
I've run into that issue with EFS, provisioning enough capacity (storage and throughput) is one thing, provisioning burst capacity is something else. In this case it appears that you are running into the same issue. Exceeding your burst capacity. If you have a read-heavy application, consider using a replica or a caching scheme. Alternatively you can increase your 2TB disk to 4TB or look into provisioned iops solution.
From the screen capture, I can see that AWS is already delivering the performance they promised for your instance ( 6K IOPS, consistently )
So the question remains is why there is still burst performance that let you burst up to > 11K IOPS ( the 7:00 - 9:00 timeframe ) for a limitted time
My guess is that the 3K IOPS burst limit is only for instances with less than 1TB. For instance of bigger size, you can burst up to "Baseline performance + 3k IOPS" ( around 9k in your case ) until the IO credit runs out. I have not seen any document around this though

How to compute initial Auto-scaling limits for DynamoDb table

Our table has bursty writes, expected once a week. We have auto-scaling enabled, with provisioned capacity as 5 WCU's, with 70% target utilization. This suffices for our off-peak (non-bursty) traffic. However, during the bursty writes, the WCU's reach around 1.5-2k, which leads to a lot of throttled writed and ultimately failures to write as well.
1) Is the auto-scaling suitable for such an use-case?
2) If yes, what should our initial provisioned capacity be?
This answer will tell you why auto-scaling is not working for you:
https://stackoverflow.com/a/53005089/4985580
This answer will tell you how you can configure your SDK to retry operations over a much longer period (and therefore stop your operation failures furing peak requests).
What should be done when the provisioned throughput is exceeded?
Ultimately you should probably move your tables to on-demand.
For tables using on-demand mode, DynamoDB instantly accommodates
customers’ workloads as they ramp up or down to any previously
observed traffic level. If the level of traffic hits a new peak,
DynamoDB adapts rapidly to accommodate the workload.
No, auto-scaling is not suitable for your needs. It takes a few minutes to scale up and it does that by increasing a fixed percentage of your current capacity at each time. There's also a limited number of times it scales up or down per day, so you can't get from 5 to 2,000 in a matter of minutes. You may not even get that in a matter of hours.
I'd suggest to try on demand mode, or manually setting capacity to 2,000 some time before you actually need it (it doesn't really scale instantly).
I strongly advise to read the ENTIRE dynamo documentation with regards to best practices for primary key, GSI, data architecture. Depending on the size of your table (lager that 10 Gb), the 2,000 units may get spread across partitions and you could potentially still have throttled requests.

Does increasing the index cause the write IOPS of AWS RDS to rise?

Does increasing the index cause the write IOPS of AWS RDS to rise?
The AWS RDS I use is db.m3.xlarge . The storage of RDS is 50G.
Now the write IOPS of AWS RDS is 120
50G RDS write IOPS peak is 150
According to the Official document :
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_Storage.html
Baseline I/O performance for General Purpose SSD storage is 3 IOPS for each GiB, which means that larger volumes have better performance.
How do I query the reason why RDS write IOPS rises?
Let me try to answer this by dividing it into two parts :
Do I have a I/O Problem ?
Finding reason behind high write IOPS on MySQL (RDS) server ?
Do I have a I/O Problem ?
When using AWS RDS, one does not have traditional OS tools such as systat, iostat, dtstat or sar. The tool to understand what is happening in RDS is cloudwatch metrics and the graphs provided.
Read and Write IOPS metrics:
By summing up the ReadIOPS and WriteIOPS you will see how much IOPS your operations consume.
DiskQueueDepth Metric: The DiskQueueDepth metric provides the number of outstanding IOs (read/write requests) waiting to access the disk. If this metrics is frequently above 2, then you should expect sooner or later to face performance issues.
Using the above two graphs it is easy to identify if you are under-provisioned or over-provisioned in IOPS.
If your DiskQueueDepth is consistsently between 0 and 0.5 you are over provisioned.
If your DiskQueueDepth is consistsently above 2 then you are under provisioned.
Finding reason behind high write IOPS on MySQL (RDS) server ?
There are several ways to profile your queries, but as you are using RDS with MySQL , I would recommend you to use PERFORMANCE_SCHEMA to do it easily, as you won't need external software (some of which is not fully RDS-compatible).
You can refer to this video with an introduction to query profiling, with examples like IOPS and temporary table creation monitoring by query pattern, user and table. For a more specific guide (specially for configuration of metrics), you can have a look at the official manual and the sys schema documentation.
If you need to have quick look what is going you can have quick look at the SHOW GLOBAL STATUS like 'com\_%'; and SHOW GLOBAL STATUS like 'Hand%'; at time interval to see if you have an increase on the number of SQL queries per unit of time or on the number of engine row operations per unit of time.
To Conclude, an increase on Write IOPS normally may mean extra SQL load (obviously), but also many other things, such as,too many temporary tables or worse query plans being executed due to a change on the query optimiser plan or on your data cardinality/size. It is critical to identify the underlying cause first before taking any action.
Hope this Helps you !

How to calculate Target Utilization in DynamoDB table?

We know the minimum, maximum provisioned capacity for a certain table.
For example our minimum capacity is 200 reads per second and maximum is 1000 read per second, so what should be the target utilization percentage ?
Some background for a complete answer; DynamoDB provides an Autoscaling option for managing throughput capacity. With autoscaling you define a minimum, maximum and target utilization.
DynamoDB Autoscaling will then vary the provisioned throughput between the maximum and mimumum bounds set. It will aim to keep this throughput provision at the utilization capacity.
Target utilization is the ratio of consumed capacity units to
provisioned capacity units, expressed as a percentage
A good starting point is to ask why not set target utilization to 100%? This sounds efficient, because you will only be paying for the throughput you use. But there is a problem to this:
DynamoDB auto scaling modifies provisioned throughput settings only
when the actual workload stays elevated (or depressed) for a sustained
period of several minutes
So, imagine your target utilization is 100% and you have increased demand on your table for 15 minutes. For the first 5 minutes you might be saved by burst capacity, in the second lot of 5 minutes you are likely to see database read/write failures as your throughput is exceeded, and then after around 10 minutes Autoscaling should kick in and increase your throughput.
This is the problem you are trying to avoid by setting target utilization (i.e. an increase in demand causing throttling). You need to consider two things
1) What is the biggest change in throughput capacity usage you see over a time period of 15 minutes expressed as a percentage? Leave this amount of room in your target utilization.
2) How much do you care if you have some database throttling? (i.e. some database read/writes fail?) Adjust your target utilization higher or lower depending on your appetite for cost saving versus throttling.
Lets say you look over one week of data, and find that in a 15 minute period, the largest increase in throughput you see is 20%. That gives you a target utilization of 80% (because then your increased demand is absorbed by autoscaling)*. However lets say you are cautious and you really aren't OK with database throttling, so to be on the safe side, you might go with 70%.
Hope that helps make some decisions. In summary, your target utilization should be a function of how quickly your throughput capacity changes, and how averse you are to throttling.
EDIT:*The maths isn't perfect here, but you get the idea I think. And its probably a close enough approximation.

Are Amazon's Provisioned IOPS a hard limit or guaranteed minimum?

Are Provisioned IOPS a hard limit? Could IO spikes exceed the number of provisioned IOPS?
I couldn't easily find a definitive answer on the question. While I pretty much assume that it's enforced as a hard limit the documentation isn't very conclusive.
According to published benchmarking results and some own testing it isn't possible to exceed the number of provisioned IOPS. This very much suggests a hard limit.
UPDATE:
I've just seen an RDS instance with 4000 provisioned IOPS getting around 4400 IOPS for about 20 minutes while concurrently importing dumps (few GB). Thus provisioned IOPS are not a hard limit. But still, although it's possible to exceed the provisioned limit it's probably not a good idea to expect that.