Creating a complete copy of an AWS EBS volume from a snapshot

Creating a complete copy of an AWS EBS volume from a snapshot - amazon-web-services

To Devs,
I am getting lazy reads when I create an EBS volume from a snapshot and attach it to an EC2 node.
I would like to create an EBS volume with a complete copy so that the first read is not slow.
Is there a way to do this?
Thanks,
Marc

You and everybody else. According to an AWS rep that I talked with at the AWS Summit in NYC, Amazon is well aware of the issue. Of course, there's a difference between "being aware" of an issue and actually fixing it ...
For now, the best you can do is follow the AWS instructions and use dd or fio to touch every block on the device before it's mounted. The benefit of fio is that it will run parallel threads.
Beware that you will be limited by the IO performance of your volume. One IO is 16k on an gp2 volume, so divide your volume size by that to determine how many IOs it will take to touch every block, and then divide that by the IOPS for your volume (taking into account burst IOPS).
For example (and these are rough numbers!), a 1 TB volume will require 67,108,864 IOs to read fully. The default non-provisioned performance of a 1TB gp2 volume is 3,000 IOPS, this will take 22,369 seconds or somewhat more than 6 hours. Smaller volumes will be able to use burst IOPS to get above their basic allotment, but may run into throughput limits.

Related

Why is disk IO on my new AWS EC2 instance so much slower?

I have a regular EC2 instance A with a 200GB SSD filled with data. I used this disk to create an AMI and used that AMI to spin up another EC2 instance B with the same specs.
B started almost instantaneously which surprised me since I thought there would be a delay while AWS copies my 200GB EBS to the SSD corresponding to the new instance. However I noticed IO is extremely slow on B. It takes 3x as long to parse data on B.
Why is this, and how can I overcome this? It's too slow for my application which requires fast disk IO.

This happens because a newly-created EBS volume is built from S3 on-demand: when EC2 first reads a block from that volume it's retrieved from S3. You only get the "full" EBS performance once all blocks have been loaded. This is a huge problem, btw, for big databases restored from snapshot.
One solution may be fast snapshot restore. Although the docs don't describe what's happening behind the scenes, my guess is that they do a parallel disk copy from an existing EBS image. However, you will pay $0.75 per hour per snapshot, and are limited to 10 restores per hour.
Given the use-case that you described in another question, I think that the best solution is to keep an on-demand instance that you start and stop for your job. Assuming you're using Linux, you are charged per-second, so if you only run for 10-20 minutes out of the hour, you'll pay a pro-rated price. And unlike spot instances, you'll know that the machine will always be available and always be able to finish the job.
Another alternative is to just leave the spot instance running. If you're running for a significant portion of every hour, you're not really saving that much by shutting the instance down.

Issue with EBS burst balance AWS

I've got an EBS volume (16GB) attached to a EC2 instance that has full access to an RDS instance. The thing is I've extracted the DB to the RDS instance, so I don't use the EC2 instance for storing the web application database anymore. I did this because I was having a lot of problems with the EBS credits (they were consuming very quickly). I thought that by having the DB on a separate instance (RDS) this will decrease to almost cero the EBS credit consumption because I'm not reading nor writing on the EBS but on the RDS. However, the EBS credits keep consuming (and decrease to 0) every time users access to the web application and I don't understand why. Perhaps is because I still don't fully understand how EBS credit usage works... Can anyone enlighten me with this? Thanks a lot in advance.

You can review volume types including info on their burst credits here. You should also review I/O Characteristics and Monitoring. From that page:
If your I/O latency is higher than you require, check
VolumeQueueLength to make sure your application is not trying to drive
more IOPS than you have provisioned. If your application requires a
greater number of IOPS than your volume can provide, you should
consider using a larger gp2 volume with a higher base performance
level or an io1 volume with more provisioned IOPS to achieve faster
latencies.
You should review that metric and the others it mentions if this is causing you performance problems. If your IOPs are constantly above your baseline and causing them to queue you will always consume credits as fast as they are given.

What does IOPS (in Amazon EBS) mean in practice?

I have some images needed for an app. There are many images (50,000+) but the overall size is small (40 Mb). Initially, I thought I would simply use S3 but it is painfully slow to upload. As a temporary solution, I wanted to attach an EBS containing the images and that would be fine. However, reading a bit about EBS General Purpose (gp2) I noticed the following description:
GP2 is the default EBS volume type for Amazon EC2 instances. These
volumes are backed by solid-state drives (SSDs) and are suitable for a
broad range of transactional workloads, including dev/test
environments, low-latency interactive applications, and boot volumes.
GP2 is designed to offer single-digit millisecond latencies, deliver a
consistent baseline performance of 3 IOPS/GB to a maximum of 10,000
IOPS, and provide up to 160 MB/s of throughput per volume.
It is that 3 IOPS/GB quantity that is worrying me. What does this mean in practical terms? Suppose that you need an e-commerce site for a small amount of users (e.g. < 10,000 requests per minute) and these images need to be retrieved. Amazon describes how IOPS are measured:
When small I/O operations are physically contiguous, Amazon EBS
attempts to merge them into a single I/O up to the maximum size. For
example, for SSD volumes, a single 1,024 KiB I/O operation would count
as 4 operations, while 256 I/O operations at 4 KiB each would count as
256 operations.
Does this actually mean that if I want to retrieve 50 images of 10kB each in under a second, I would require 50 IOPS and easily exceed the baseline of 3 IOPS?
UPDATE:
Thanks to Mark B's suggestion, I was able to use S3 to upload my files. However, I'm still wondering about the amount of IOPS needed to perform common tasks such as running a database or serving other files for a web application. I would be glad to hear some reference values regarding the minimal values of IOPS based on your experience.

You are missing the "/GB" part of that statement. The baseline is 3 IOPS per GB. If your EBS volume is 100GB, then you would have a baseline of 300 IOPS. For a GP2 EBS volume you have to multiple the size of the volume by 3 to get the IOPS.
Note that any GP2 volume under 1TB is also able to burst at up to 3,000 IOPS, so any limited increases in IO should still perform very well.
Also, I will add that S3 sounds like a better fit for your use case. If you are seeing slow upload speeds to S3, that is a problem that can be solved. You can use CloudFront to provide a nearby edge location that you can upload to.
In my experience uploads to S3 are never any slower than uploads to an EC2 instance that your EBS volume would be attached to.
Update:
To answer your additional question, the minimum IOPS needed will depend on many variables such as the amount of RAM available, the type of application you are running, how well the application caches values in memory, the average size of your IO operations, etc. It's really difficult to pin-down an exact number and state that you need exactly X IOPS for an application.
You also need to remember that any volume under 1TB in size can still burst up to 3,000 IOPS for several seconds. So even if your application needs high IOPS when it is in use, if it doesn't see much usage the IOPS burst feature might be all it ever needs.
In general I usually start with something like a 100GB volume with 300 IOPS and test the performance of my app against that. A web server that operates entirely within RAM might never need more than that. For something like a database you would probably start out with the amount of disk space you think you will need and then start performance testing. CloudWatch will show the amount of IOPS your application is using, and if you see it maxing out at the limits of your volume then you would know you need to increase the available IOPS. Rinse and repeat until you no longer max out the available IOPS during your performance tests.

#Mark B's answer is probably correct, in that it points out your IOPs are based on the size of your EBS volume. For what you want, S3 is the best option.
But depending on your use case and requirements, EBS may be needed. This is especially true if you want to run a database. In that case, you have a couple of options.
You can get Provisioned IOPS - if you know you need 5000 IOPS, but only need say 100GB of storage (which with gp2 would normally provide you with around 300 IOPS), you can use io1 volumes. There is an extra cost to this, and you'll want to make sure that it's attached to an EBS optimized instance, but you can get up to 20k IOPS if needed.
If you're doing a lot of sequential reads (reading in a large data set?) then there's a new type of EBS, st1. This is good for 500MB/s, and is less than 1/2 the cost of gp2.
Finally, there's one other scenario you could consider (say, you're a bit of a madman, and want to try doing strange things). If you can grab an archive from somewhere, and all you care about is serving them up from a really fast file system, you could put them on an instance that has instance storage. This is a locally-attached SSD, so it's very fast. The only drawback is that when your instance stops, you data is gone.
To address your update, "how many IOPS do you need for a database", the answer is "it depends". Every database engine has different requirements, and every database use has different usage patterns. Take a look at this if you want more information. But basically, test & monitor. If you're worried, over provision at launch, and scale down as needed. Or take a guess, and increase if you run into problems - is it more important to minimize costs, or provide good performance to your end users?

As per your use case, s3 is a better option but if one wants to use an EBS volume and thinks that they require more IOPS, they can choose gp3 volume type instead of gp2. In gp3 volume, one can increase upto 16,000 IOPS independent of throughput (also, throughput can be increase upto 1000 MiB/s independently of IOPS).

General Purpose SSD (gp2) volumes offer cost-effective storage that is ideal for a broad range of workloads. These volumes deliver single-digit millisecond latencies and the ability to burst to 3,000 IOPS for extended periods of time. Between a minimum of 100 IOPS (at 33.33 GiB and below) and a maximum of 16,000 IOPS (at 5,334 GiB and above), baseline performance scales linearly at 3 IOPS per GiB of volume size. AWS designs gp2 volumes to deliver 90% of the provisioned performance 99% of the time. A gp2 volume can range in size from 1 GiB to 16 TiB.
link:
Link
Sometimes performance also varies:
According to AWS Doc, instance types can support maximum performance for 30 minutes at least once every 24 hours. If you have a workload that requires sustained maximum performance for longer than 30 minutes, select an instance type according to baseline performance
link:
Link

AWS t2.medium performance issues after adding 32 gb volume

On AWS EC2 t2.medium instance we run a site http://www.pricesimply.com/
Our database is installed on the same machine.
By default we had 8 gb storage and the site speed was lightning quick.
Then, we added a 32 gb volume Type - general purpose volume.
Only difference between these 2 volumes -
8 gb default volume - IOPS 24/3000
32 gb new added volume - IOPS 96/3000
Volume type & Availability zones are same.
The site MUCH SLOWER now compared to earlier.

Some random ideas:
1) Performance does vary between volumes. Do some benchmarks to see if it's really slower. (Unlikely, but possible.)
2) Perhaps the volume is a red herring -- Maybe your entire dataset was small enough to fit into RAM before, and now that you've expanded and grown, your data doesn't fit, creating constant I/O?
3) If the drive was created from a snapshot, it may be fetching your data from your snapshot in the background, slowing the drive.

Adding an additional disk shouldn't slow down your machine, you'll need to investigate things a bit more to identify the bottleneck.
Poor performing infrastructure generally fits into one of three categories:
CPU: Check your CPU utilization to see if the t2.medium instance is a suitable size. Amazon CloudWatch can show you CPU history.
Memory (RAM): Your application may be short on memory, causing page swaps to disk. You'll need to monitor memory utilization from within your instance. (CloudWatch cannot see memory utilization.)
Disk IO: If you are reading & writing to disk a lot, then this could be your bottleneck. CloudWatch can give you some metrics, especially the Queue Length, which indicates that IO was waiting to be processed.
Once you've identified which of these three factors appears to be the bottleneck, try to improve them:
CPU: Use a larger instance type
Memory: Use an instance type with more RAM
Disk: Use a faster disk
You are using General Purpose (SSD) EBS volumes. These volumes have an IOPS (Input/Output per second) related to volume size. So, your "96/3000" volume gives a guaranteed 96 IOPS (about the speed of a magnetic hard disk) with the ability to burst up to 3000 IOPS if you have enough IO 'credits'. If you are continually using more than 96 IOPS, you will run out of credits and will be limited to 96 IOPS.
See: Amazon EBS Volume Types

Does taking a snapshot of an EBS volume increase reliability?

The EBS documentation states:
As an example, volumes that operate with 20 GB or less of modified data since their most recent Amazon EBS snapshot can expect an annual failure rate (AFR) of between 0.1% – 0.5%, where failure refers to a complete loss of the volume.
..but this doesn't give any indication of the AFR for a volume with, for example:
No snapshot at all; or
A fresh snapshot with no modified data.
I've seen it suggested that missing or damaged blocks can be automatically/silently recovered from snapshots but I can't see any reference to this in the documentation. Is this true?
Can I assume that if I have a volume with no changed data and a fresh snapshot, my AFR for the volume matches S3's reliability?

I took a three day class from AWS last year, and they told us unequivocally that taking snapshots greatly increases the reliability of an EBS volume. They did not explain why that was so, but hinted that EBS volumes store changes from the latest snapshot and that the snapshot itself is very stable (stored in S3). Successive snapshots apparently use little storage, as AWS is smart enough to store diffs.
They did not give any hard numbers on failure rates, though. They suggested configuring multiple EBS volumes using RAID if reliability of the volume is essential. However, they also recommended architecting your application so that it can tolerate failure of any instance, making it less important for each EBS volume to be durable.

Snapshots taken of EBS Volumes are stored in S3. These snapshots get all the durability and availability benefits of S3. You can also copy snapshots to other regions, which is a nice insurance policy against a regional level outage.
If your EBS volume fails, you can then recover from your last snapshot. The more recent your snapshot, the more up-to-date your recovery story is. With the incremental nature of EBS snapshots performing them on a frequent basis is very practical.
EBS also provides "recovery volumes", which you can see from this AWS forum thread.
To my knowledge, the act of taking a snapshot doesn't directly impact the AFR of an active, running EBS volume. Rather, it just makes it easier for you to recover in the event of a failure.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js