Amazon EC2 and EBS using Windows AMIs - amazon-web-services

I put our application on EC2 (Windows 2003 x64 server) and attached up to 7 EBS volumes. The app is very I/O intensive to storage -- typically we use DAS with NTFS mount points (usually around 32 mount points, each to 1TB drives) so i tried to replicate that using EBS but the I/O rates are bad as in 22MB/s tops. We suspect the NIC card to the EBS (which are dymanic SANs if i read correctly) is limiting the pipeline. Our app uses mostly streaming for disk access (not random) so for us it works better when very little gets in the way of our talking to the disk controllers and handling IO directly.
Also when I create a volume and attach it, I see it appear in the instance (fine) and then i make it into a dymamic disk pointing to my mount point, then quick format it -- when I do this does all the data on the volume get wiped? Because it certainly seems so when i attach it to another AMI. I must be missing something.
I'm curious if anyone has any experience putting IO intensive apps up on the EC2 cloud and if so what's the best way to setup the volumes?
Thanks!

I've had limited experience, but I have noticed one small thing:
The initial write is generally slower than subsequent writes.
So if you're streaming a lot of data to disk, like writing logs, this will likely bite you. But if you make a big file fill it with data, and do a lot of random access I/O to it, it gets better on the second time writing to any specific location.

Related

VMWare: About snapshots: do they usually occupy how much% of the disk space source VM? And are they used to downgrade software?

I would like to update the samba on a 3TB NAS. My boss suggested making a clone, however, there is no storage that will fit him whole. If a snapshot of the VM costs a smaller size, and serves to, in case of failure, restore the samba as it was, making it a better idea.
There's no real guide on how much space snapshots occupy. That will greatly depend on the activity on the VM where the snapshot has been taken. If it's an active VM (database or something of the like), there could be a considerable amount of data written. If it's not a very used VM, there could be limited to no data written to the backend datastore.

GCE : Snapshot restore vs Disk Image

Will restoring a snapshot recreate the environment EXACTLY as it was at the point of the snapshot? I am specifically referring to the operating system and installed software.
If not, then I assume that a disk image is a correct approach
Snapshot and Disk Image use the same process. Both take a point in time copy of a storage device by performing a block copy.
Will either restore exactly (bit-for-bit) as the source? Yes, if you shutdown the VM instance. Maybe if you do not. Google (and AWS, Azure, etc.) strongly recommend that you shutdown your VM before these types of operations. The reason is that file system data could be cached in memory that has not been flushed to disk. A snaphot requires that all applications and the OS participate in the snaphost process. Few applications do.

EBS volume resize precautions

On one of my AWS instances running Ubuntu 16.04, I've a MySQL replica database on a 1TB ext4 EBS volume. I plan to increase it to 2TB. Before I increase the size of the volume and extend the filesystem using the resize2fs command, do I need to take any precautions? Is there any possibility of data corruption? If so would it be sane to create a EBS snapshot of this volume?
Do I need to take any precautions?
You shouldn't need to take any unusual precautions -- just standard best practices, like maintaining backups and having a tested recovery plan. Anything can go wrong at any time, even when you're sitting, doing nothing.
Important
Before modifying a volume that contains valuable data, it is a best practice to create a snapshot of the volume in case you need to roll back your changes.
https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/ebs-modify-volume.html
But this is not indicative of the operation being especially risky. Anecdotally, I've never experienced complications, and have occasionally resized an EBS volume and then its filesystem under a live, master, production database.
Is there any possibility of data corruption?
The possibility of data corruption is always there, no matter what you are doing... but this seems to be a safe operation. The additional space becomes available immediately, and there is no I/O freeze or disruption.
If so would it be sane to create a EBS snapshot of this volume?
As noted above, yes.
Concerns about errors creeping in later are valid, but EBS maintains internal consistency checks and will disable a volume if this fails to help avoid further scrambling of data so that you can do a controlled recovery and repair operation.
This would not help if EBS is prefectly storing data that was corrupted by something on the instance, such as might be caused by a defect in resize2fs, but it seems to be a solid utility. It doesn't move your existing data -- it just fleshes out the filesystem structures as needed to the filesystem use the entire free space that has become available.

Data Intensive process in EC2 - any tips?

We are trying to run an ETL process in an High I/O Instance on Amazon EC2. The same process locally on a very well equipped laptop (with a SSD) take about 1/6th the time. This process is basically transforming data (30 million rows or so) from flat tables to a 3rd normal form schema in the same Oracle instance.
Any ideas on what might be slowing us down?
Or another option is to simply move off of AWS and rent beefy boxes (raw hardware) with SSDs in something like Rackspace.
We have moved most of our ETL processes off of AWS/EMR. We host most of it on Rackspace and getting a lot more CPU/Storage/Performance for the money. Don't get me wrong AWS is awesome but there comes a point where it's not cost effective. On top of that you never know how they are really managing/virtualizing the hardware that applies to your specific application.
My two cents.

Cloud hosting - shared storage with direct access

We have an application deployed across AWS with using EC2, EBS services.
The infrastructure dropped by layers (independent instances):
application (with load balancer)
database (master-slave standard schema)
media server (streaming)
background processing (redis, delayed_job)
Application and Database instance use number of EBS block storage devices (root, data), which help us to attach/detach them and do EBS snapshots to S3. It's pretty default way how AWS works.
But EBS should be located in a specific zone and can be attached to one instance only in the same time.
Media server is one of bottlenecks, so we'd like to scale them with master/slave schema. So for the media server storage we'd like to try distributed file systems can be attached to multiple servers. What do you advice?
If you're not Facebook or Amazon, then you have no real reason to use something as elaborate as Hadoop or Cassandra. When you reach that level of growth, you'll be able to afford engineers who can choose/design the perfect solution to your problems.
In the meantime, I would strongly recommend GlusterFS for distributed storage. It's extremely easy to install, configure and get up and running. Also, if you're currently streaming files from local storage, you'll appreciate that GlusterFS also acts as local storage while remaining accessible by multiple servers. In other words, no changes to your application are required.
I can't tell you the exact configuration options for your specific application, but there are many available such as distributed, replicated, striped data. You can also play with cache settings to avoid hitting disks on every request, etc.
One thing to note, since GlusterFS is a layer above the other storage layers (particularly with Amazon), you might not get impressive disk performance. Actually it might be much worst than what you have now, for the sake of scalability... basically you could be better-off designing your application to serve streaming media from a CDN who already has the correct infrastructure for your type of application. It's something to think about.
HBase/Hadoop
Cassandra
MogileFS
Good same question (if I understand correctly):
Lustre, Gluster or MogileFS?? for video storage, encoding and streaming
There are many distributed file systems, just find the one you need.
The above are just part which I personally know (haven't tested them).