Why is disk IO on my new AWS EC2 instance so much slower? - amazon-web-services

I have a regular EC2 instance A with a 200GB SSD filled with data. I used this disk to create an AMI and used that AMI to spin up another EC2 instance B with the same specs.
B started almost instantaneously which surprised me since I thought there would be a delay while AWS copies my 200GB EBS to the SSD corresponding to the new instance. However I noticed IO is extremely slow on B. It takes 3x as long to parse data on B.
Why is this, and how can I overcome this? It's too slow for my application which requires fast disk IO.

This happens because a newly-created EBS volume is built from S3 on-demand: when EC2 first reads a block from that volume it's retrieved from S3. You only get the "full" EBS performance once all blocks have been loaded. This is a huge problem, btw, for big databases restored from snapshot.
One solution may be fast snapshot restore. Although the docs don't describe what's happening behind the scenes, my guess is that they do a parallel disk copy from an existing EBS image. However, you will pay $0.75 per hour per snapshot, and are limited to 10 restores per hour.
Given the use-case that you described in another question, I think that the best solution is to keep an on-demand instance that you start and stop for your job. Assuming you're using Linux, you are charged per-second, so if you only run for 10-20 minutes out of the hour, you'll pay a pro-rated price. And unlike spot instances, you'll know that the machine will always be available and always be able to finish the job.
Another alternative is to just leave the spot instance running. If you're running for a significant portion of every hour, you're not really saving that much by shutting the instance down.

Related

What does EC2 store and why does it even need a storage solution like EBS or Instance Store?

If you use EC2 and launch instances, you can add EBS volumes. So a storage option. However, what I still don't understand exactly is why. Why is there or does EC2 even need a storage option like EBS or Instance Store? What does EC2 store anyway? And why it makes sense that there is EBS?
I know that EBS volume is persistent block storage and data is not lost after exit, unlike instance store. I just don't really understand what EBS is useful for. For which cases and applications is EBS used? Or does using EBS have more to do with creating snapshots that you can create to cache data and then save it to S3?
I've already read a lot and tried to make it understandable somehow, but somehow I can't get any further here. I would be really happy if someone could shed some light on this for me.
Thank you already!
Think of an Amazon EC2 instance as a normal computer. Inside, there is CPU, RAM and (perhaps) a hard disk.
When an EC2 instance has a hard disk, it is called Instance Storage and it behaves just like a normal hard disk in a computer. However, when you turn off the instance and stop paying for it, the EC2 instance can give that computer to somebody else. Rather than giving your data to somebody else, the disk is erased. So, anything you stored on Instance Store is gone! (In truth, instance store is also a virtualised disk, but this is close enough.)
In fact, in the early days of EC2, this was the only storage available. If you wanted to keep data after the instance was turned off, you first had to copy it to Amazon S3. People didn't like this, so they invented Amazon EBS.
If you want to keep your data so that it is still there when you turn on the instance in future, it needs to be stored on a network disk and that is what Amazon EBS provides. Think of it a bit like a USB drive that you can plug into one computer, then disconnect it and plug it into another computer. However, rather than being a physical device, it uses a storage service that keeps multiple copies of the data (in case a disk fails) and lets you modify the size of the disk. You are charged based on the amount of storage space assigned and how long the data is kept ("GB-Month").
Amazon EBS Snapshots are simply a backup of the disk. A snapshot contains all the data currently on the disk, allowing you to create a new disk anytime that will contain an exact copy of the disk as it was when the snapshot was created. This is great for backups, but is also very useful for creating multiple EC2 instances with the same disk content. An Amazon Machine Image (AMI) is actually just an Amazon EBS Snapshot plus a bit of metadata. When a new EC2 instance is launched, it uses an AMI to populate the boot disk rather than loading the operating system from scratch every time.
It is possible to create an AMI that populates an Instance Store disk. This way, you don't actually need to use an Amazon EBS volume. This is good for instances that don't need to permanently keep any data -- they could simply store information in a database or Amazon S3 instead of saving it on disk. Instance Store disks can be very fast since they don't send data across the network, so this is very useful in some situations.
In summary:
Instance Store is a normal disk in a computer (but it gets erased when the instance turns off so nobody else sees your data)
Amazon EBS volumes are network-attached storage that stays around until you delete it

Loading large amount of data from local machine into Amazon Elastic Block Store

I am interested in doing some machine learning using an AWS EC2 instance. I have played around with launching instances with a an attached EBS and I was able to load files into it via scp on my local command line. I will have several gigabytes of data to load onto this EBS (I know that isn't a lot by ML standards but that's not really my point). I would like to know what is the appropriate way to load this data. I'm concerned about racking up large fees because I did something in a silly way.
So far I have just uploaded a few files to the EC2 instance's associated EBS manually via the command line, like this:
scp -i keys/ec2-ml-micro2.pem data/BB000000001.png ubuntu#<my instance ip>:/data
This seems to me to be a rather primitive approach (not that that is always a bad thing). Is it the "right" way? I'm not opposed to letting a batch jbb run overnight like this but I am not sure if it may incur some data transfer fees. I've looked around for information on this, and I have read the page on EBS pricing. I didn't see anything on costs associated with loading data but I just wanted to confirm with someone or some people who have done something similar that this is the correct approach, and if not, what is a a better one
In managing large objects in AWS. Always check for S3 as an initial option, it provides unlimited Storage capacity and best use for object store compared to EBS(block store). EBS billed you from the size of the volume that you provisioned, having said that there is a chance that you over-provisioned(overhead cost) or under-provisioned (can lead to poor performance or even downtime).
Using S3 you are billed for the storage that you consumed per GB per month, pay for what you use model and it's very cheap compared to EBS.
And lastly, try to evaluate first the AWS Machine Learning services that might fit for your use-cases it will save you alot of time and effort.
Data Transfer from S3 to EBS within the same region is free of charge.
AWS Pricing Details

What AWS service can I use to efficiently process large amounts of S3 data on a weekly basis?

I have a large amount of images stored in an AWS S3 bucket.
Every week, I run a classification task on all these images. The way I'm doing it currently is by downloading all the images to my local PC, processing them, then making database changes once the process is complete.
I would like to reduce the amount of time spent downloading images to increase the overall speed of the classification task.
EDIT2:
I actually am required to process 20,000 images at a time to increase performance of the classification engine. This means I can't use Lambdas since the maximum option for RAM available is 3GB and I need 16GB to process all 20,000 images
The classification task uses about 16GB of RAM. What AWS service can I use to automate this task? Is there a service that can be put on the same VLAN as the S3 Bucket so that images transfer very quickly?
The entire process takes about 6 hours to do. If I spin up an EC2 with 16GB of RAM it would be very cost ineffective as it would finish after 6 hours then spend the remainder of the week sitting there doing nothing.
Is there a service that can automate this task in a more efficient manner?
EDIT:
Each image is around 20-40KB. The classification is a neural network, so I need to download each image so I can feed it through the network.
Multiple images are processed at the same time (batches of 20,000), but the processing part doesn't actually take that long. The longest part of the whole process is the downloading part. For example, downloading takes about 5.7 hours, processing takes about 0.3 hours in total. Hence why I'm trying to reduce the amount of downloading time.
For your purpose you can still use EC2 instance. And if you have large amount of data to be downloaded from S3, you can attach and EBS volume to the instance.
You need to setup the instance with all the tools and software required for running your job. And when you don't have any process to run, you can shut down the instance. And boot it up when you want to run the process.
EC2 instances are not charged for the time they are in stopped state. You will be charged for the EBS volume and Elasitc IP attached to the Instance.
You also will be charged for the storage of the EC2 image on S3.
But I think these cost will be less than the cost of running EC2 instance all the time.
You can schedule start and stop the instance using AWS instance scheduler.
https://www.youtube.com/watch?v=PitS8RiyDv8
You can also use AutoScaling but that would be more complex solution than using the Instance Scheduler.
I would look into Kinesis streams for this, but it's hard to tell because we don't know exactly what processing you are doing to the images

cloning an amazon machine instance

I have two amazon machine instances running.Both of them are m3.xlarge instances. One of them has the right software and configuration that I want to use.I want to create a snapshot of the EBS volume for that machine and use that as the EBS volue to boot the second machine from. Can I do that and expect it to work without shutting down the first machine.
It is well described in the AWS documentation...
"You can take a snapshot of an attached volume that is in use. However, snapshots only capture data that has been written to your Amazon EBS volume at the time the snapshot command is issued. This might exclude any data that has been cached by any applications or the operating system. If you can pause any file writes to the volume long enough to take a snapshot, your snapshot should be complete. However, if you can't pause all file writes to the volume, you should unmount the volume from within the instance, issue the snapshot command, and then remount the volume to ensure a consistent and complete snapshot.
I have amazon as well, with 3 different clusters. With one of my clusters after setting up 25 of them I realized there was a small issue in the configuration and had live traffic going to them so I couldn't' shut down.
You can snapshot the first machines volume while it's still running, I had to do this myself. It took a little while, but ultimately it worked out. Please note that amazon cannot guarantee the consistency of the disk when doing this.
I did a snapshot of the entire thing, fixed what needed to be fixed, and spooled up 25 new servers and terminated the other 25 ( easier than modifying volumes, etc ).. But you can create a new volume with the new snapshot, and attach it to an instance and do what needs to be done to get it to boot off that volume without much of a headache.
Being that I went the easy route of spooling up new instances after my snapshot was complete, I can't walk you through on how to get a running instance to boot off a new volume.

Storing data on EC2 vs S3 in AWS

I'm fairly new to AWS and am having trouble understanding what the best practice should be for hosting a product I'm developing...
I have a tool I plan on running on an EC2 instance but it only needs to run a couple of times a year, the rest of the time I can have the instance stopped and not incur charges. However the product utilizes quite a bit of data (approx 15 - 25 GB depending on the run). I understand that S3 is meant for storing data long term. But is there any reason why I can't just leave the data on an EC2 (even when it's stopped). Or do I have to do manual copies from S3 every time I want to execute a run
Even if the instance is off, you are still incurring charges for the EBS volumes. S3 also has the advantage of being somewhat more durable.
It would probably be a good idea to back up your data or snapshot your volume to s3 as a precaution, but I would not worry about transferring data back and forth.