AWS : Splitting software & data in different volumes - amazon-web-services

AWS recommends keeping data & OS on separate EBS volumes. I have a webserver running on EC2 with an EBS volume. On a bare VM, I install the following:
- webserver, wsgi, pip & related software/config (some in /etc some in /home/<user>)
- server code & static assets in /var/www/
- log files are written to /var/log/<respective-folder>
- maintenance scripts in /home/<user>/
Database server is separate. For a webserver, which of the above items would benefit from higher IOPS and for which ones it doesn't matter ? My understanding is that the server code & log files should be moved to a separate EBS volume with higher IOPS. Or should I just move all of my stuff (except the softwares I installed in /etc i.e. webserver) to a separate volume with better IOPS ?

I would recommend that you have a separate EBS volume for code, logs, and maintenance in the case that you need to move it to another server. That allows you a faster TTR (time to resolution), than having to build an entire server.
The code shouldn't be changing largely past deployment, so I would focus on a general purpose SSD here, and look towards a caching layer (Varnish (full page caching) & CDN (static assets) ) more than having disk I/O issues. A CDN is a quick win and mitigates most the I/O for reading static assets. At 50GB, you get 150 IOPS, and with the mitigate of the static assets; the I/O should be fine.
As for logs, if you are a high traffic site, then you should definitely focus on I/O here as you don't want have blocking I/O here. This is mainly focusing on access logs more than error logs, as those shouldn't be past ERROR level on a production systems. If you aren't high traffic, then you should be fine with general purpose SSD, at 10GB, you get 30 IOPS, and that is generally enough.
What are your maintenance scripts doing? If they are generating and outputting files, then you could use SSD, but if you need high I/O, you should revisit the code and optimize the code as these disks can get expensive, and that cost is usually wasteful for maintenance that runs intermittently.
As for your web-server, et cetera, that should be based on infrastructure as code, via OpsWorks or Puppet, and doesn't need much in the term of I/O as those are usually memory-based processes once built and deployed.

Related

AWS EFS for critical small files with frequent updates

Our application's userbase has reached 2M users and we are planning to scale up the application using the AWS.
The main problem we are facing is the handling of shared data which includes cache, uploads, models, sessions, etc.
An option is AWS EFS but it will kill the performance of the application as the files will be really small ranging from few Bytes to few MBs and are being updated very frequently.
We can use Memcache/Redis for sessions and S3 for uploads but still need to manage cache, models, and some other shared files.
Is there any alternative to EFS or any way to make EFS work for this scenario where small files are updated frequently?
Small files and frequent updates should not be a problem for EFS.
The problem some users encountered in the original release was that it had two dimensions tightly coupled together -- the amount of throughput available was a function of how much you were paying, and how much you were paying was a function of the total size of the filesystem (all files combined, regardless of individual file sizes)... so the larger, the faster.
But they have, since then, introduced "provisioned throughput," allowing you to decouple these two dimensions.
This default Amazon EFS throughput bursting mode offers a simple experience that is suitable for a majority of applications. Now with Provisioned Throughput, applications with throughput requirements greater than those allowed by Amazon EFS’s default throughput bursting mode can achieve the throughput levels required immediately and consistently independent of the amount of data.
https://aws.amazon.com/about-aws/whats-new/2018/07/amazon-efs-now-supports-provisioned-throughput/
If you use this feature, you pay for the difference between the throughput you provision, and the throughput that would have been included, anyway, based on the size of the data.
See also Amazon EFS Performance in the Amazon Elastic File System User Guide.
Provisioned throughput can be activated and deactivated, so don't confuse this with the fact that there are also two performance modes, called General Purpose and Max I/O, one of which must be selected when creating the filesystem, and this selection can't be changed later. These are related to an optional tradeoff in the underlying infrastructure and the recommended practice is to select General Purpose unless you have a reason not to, based on observed metrics. The Max I/O mode does not have the same metadata consistency model as general purpose.

Strategies for optimizing the performance of IO intesive jobs on AWS

I wrote a script that analyzes a lot of files on an AWS cluster.
Running it on the cloud seems to be slower than I expected - the filesystem is shared via NFS, so the round-trip through the network seems to be the limiting step here. Bottom line - the processing power of the cluster is limited by the speed of the internal network which is considerably slower than the speed of the SSD the data is located in.
How would you optimize the cluster so that IO intensive jobs will run efficiently?
There isn't much you can do given the circumstances.
Obviously the speed of the NFS itself is the drawback.
Consider:
Chunking - grab only pieces of the files required to do as much as possible
Copying locally - create a locking mechanism and copy the file in full locally, process, push back. This can require a lot of work (what if the worker gives up and doesnt clear the lock)
Optimize the NFS share - increase IO throughput by clustering the NFS, raiding it, etc.
With a remote FS you want to limit the amount of back and forth. You can be creative, but creative can be a problem in itself.

What factors should be considered to move to Amazon Storage?

We have a Django application running on Webfaction. In this application, user uploads lot of images. So, far we have not had any issues. Soon, we expect about 10,000 users. But, I was wondering, should we decide to move to cloud solution like S3? How will the move help us?
thanks
Some of the advantages of moving to a remote storage such as S3 are:
Central storage location: You don't need to worry about managing a shared NFS mount as you bring up new webservers to handle additional load.
Offloading requests: Your servers will not take on the load of serving the media.
Some disadvantages are:
Additional cost: You pay for the storage and the bandwidth.
More moving parts: A file system is fairly easy to understand, manage and test. Remote APIs aren't perfect and some of the problems are out of your control.

Cloud hosting - shared storage with direct access

We have an application deployed across AWS with using EC2, EBS services.
The infrastructure dropped by layers (independent instances):
application (with load balancer)
database (master-slave standard schema)
media server (streaming)
background processing (redis, delayed_job)
Application and Database instance use number of EBS block storage devices (root, data), which help us to attach/detach them and do EBS snapshots to S3. It's pretty default way how AWS works.
But EBS should be located in a specific zone and can be attached to one instance only in the same time.
Media server is one of bottlenecks, so we'd like to scale them with master/slave schema. So for the media server storage we'd like to try distributed file systems can be attached to multiple servers. What do you advice?
If you're not Facebook or Amazon, then you have no real reason to use something as elaborate as Hadoop or Cassandra. When you reach that level of growth, you'll be able to afford engineers who can choose/design the perfect solution to your problems.
In the meantime, I would strongly recommend GlusterFS for distributed storage. It's extremely easy to install, configure and get up and running. Also, if you're currently streaming files from local storage, you'll appreciate that GlusterFS also acts as local storage while remaining accessible by multiple servers. In other words, no changes to your application are required.
I can't tell you the exact configuration options for your specific application, but there are many available such as distributed, replicated, striped data. You can also play with cache settings to avoid hitting disks on every request, etc.
One thing to note, since GlusterFS is a layer above the other storage layers (particularly with Amazon), you might not get impressive disk performance. Actually it might be much worst than what you have now, for the sake of scalability... basically you could be better-off designing your application to serve streaming media from a CDN who already has the correct infrastructure for your type of application. It's something to think about.
HBase/Hadoop
Cassandra
MogileFS
Good same question (if I understand correctly):
Lustre, Gluster or MogileFS?? for video storage, encoding and streaming
There are many distributed file systems, just find the one you need.
The above are just part which I personally know (haven't tested them).

Amazon EC2 and EBS using Windows AMIs

I put our application on EC2 (Windows 2003 x64 server) and attached up to 7 EBS volumes. The app is very I/O intensive to storage -- typically we use DAS with NTFS mount points (usually around 32 mount points, each to 1TB drives) so i tried to replicate that using EBS but the I/O rates are bad as in 22MB/s tops. We suspect the NIC card to the EBS (which are dymanic SANs if i read correctly) is limiting the pipeline. Our app uses mostly streaming for disk access (not random) so for us it works better when very little gets in the way of our talking to the disk controllers and handling IO directly.
Also when I create a volume and attach it, I see it appear in the instance (fine) and then i make it into a dymamic disk pointing to my mount point, then quick format it -- when I do this does all the data on the volume get wiped? Because it certainly seems so when i attach it to another AMI. I must be missing something.
I'm curious if anyone has any experience putting IO intensive apps up on the EC2 cloud and if so what's the best way to setup the volumes?
Thanks!
I've had limited experience, but I have noticed one small thing:
The initial write is generally slower than subsequent writes.
So if you're streaming a lot of data to disk, like writing logs, this will likely bite you. But if you make a big file fill it with data, and do a lot of random access I/O to it, it gets better on the second time writing to any specific location.