Best way to store shared files between ec2 instances - amazon-web-services

My website supports uploading images by the users. I'm trying to figure out what is the best strategy to save those files given that I have more than one ec2 instance running. Amazon Elastic File System sounds perfect but it's still in preview mode. What is the best alternative?

You almost certainly want to use S3 to share images between EC2 instances unless you have some very unique circumstances that won't allow it.
Best to not store any user data on the instance itself if you can avoid it; makes it easier to scale and to recover from crashes. S3 is a perfect super-redundant place to keep 'stuff' that costs next to nothing.

Related

Syncing remote folders from several machines to one AWS instance

I have 3 AWS P instances processing some heavy stuff and saving results to relevant /home/user/folder
Also I have a main server with the same folder where I want to collect results from those 3 instances
Each instance works on its own part of the whole task, their results in sub folders not overlapping
Instances are 2 TB each, so I would like to get results from each instance as soon as they appear
This way when its job is done, I won't spend half a day copying results to the main server
I think one way of solving this is running something like this on each instance:
*/30 * * * * rsync /home/user/folder ubuntu#1.1.1.1:/home/user/folder
Are there any other more smart ways of achieving same results given that all of instances are AWS?
I also thought about (1) detachable storage and (2) storing on S3 but being new to AWS I might overlook some hidden pitfalls in such workflows, especially when it comes to terabytes of data and expensive instances.
How do you collect processed data from remote instances?
I would consider using rclone tool, which can be easy configured for the shared S3 bucket. Just be aware about copy/sync mode. It can rich up to several Gigabit throughput depending on your instance type.
Link for the project: rclone.org
My thoughts on some of the options mentioned in OP and comments, as well as some other ones I thought of:
EFS: create an EFS and mount it as an NFS drive on all the instances. It's the easiest but probably costs the most.
s3fs: have all the instances mount the same S3 bucket using s3fs. This is likely the most inexpensive solution. You also don't need to worry about running out of disk space. The downside is that the performance is not going to be that good compared to mounted NFS drives.
EBS volumes: attach an EBS volume to each worker instance for them to write the results to. When they are done, detach the volumes and attach them to the main server. This will be the fastest and still cheaper than EFS. If you can't or won't do all the detaching/attaching manually you'll need to write some scripts.
Old school NFS shares: there is nothing wrong with a plain vanilla NFS setup without any of those fancy AWS acronyms. :-)

choosing a hosting platform that allows file and directory creation

I am trying to launch a project where my server generates user files and directories. Since heroku doesn't allow that, i am trying to find the best platform that will fit my needs without changing a bunch of my code.
my node server is storing data to firebase along with some files on the server itself. I realize this is not best practice but it is what it is for now
What would you recommend?
You can store your objects in S3. Do not store files on VMs in case of any failure.
Depending on your needs, an EBS volume would be a good start. It is meant to be redundant and the chances of losing any data is very small. The advantage is that it lives on if you terminate or stop an instance.
The newer EFS is very fast and can be mounted to multiple machines, much like an NFS file system. It is redundant across availability zones and will also survive a machine stop/termination.
S3 is an object store and isn't really meant for file system I/O. It can easily store files but it doesn't have nearly the performance of either EBS or EFS. It lives on after machine termination - indeed, it can be accessed with HTTP when properly configured.
Ultimately, you can create files normally on the EC2 with instance store, EBS, or EFS. The instance store data is lost if you terminate or even stop the instance. Be careful with that - you can easily lose tons of data when it is on instance store and not properly backed up.

Which AWS services and specs should I best use for a file sharing web system?

I'm building a web system where 100-150 users will keep uploading/downloading ~10 GB total worth of audio files everyday (average of 150 total uploads and 250 total downloads per day).
I'm still trying to read about the whole AWS ecosystem and I need help with the ff:
For file storage, should I use S3 or EBS volumes mounted to an EC2 instance? From what I read, S3 is much cheaper and more scalable than EBS, but it's also slower. Is the speed difference really that huge or noticable for my use case? What are the advantages of a mounted EBS volume vs. S3?
What would be the best EC2 instance type for my use case? (i.e. frequent uploads and downloads) Will the General Purpose ones (T2, M4 etc) be enough to handle that load? (see above)
I can provide more info on my requirements/use cases if needed. Thanks!
Start with S3. S3 is a web api for putting and retrieving huge amounts of data, whereas EBS would be an NFS-mounted device. S3 will be more scalable from a data warehousing perspective, and in terms of access from multiple concurrent instances (should you do that, in the future.) Only use EBS if you actually need a filesystem for some reason. It doesn't sound like you do.
From there, you can look into some data archiving if you end up having huge amounts of data that doesn't need to be regularly available, to save some money.
Yes, use a t2 to start. Though, you should design your system so that it doesn't really matter, and you can easily teardown/replace instances. Using S3 helps with that pattern. You still need to figure out how you will deploy and configure your application to newly launched instances, though. You should /assume/ that your instance will go down, disappear, etc. So, you should be able to failover to another one on demand.

Process data in AWS S3 from EC2 instance

I'm wondering what is the best way of processing huge amounts of images stored in AWS S3 buckets from an Ec2 instance located in the same availability zone.
Should I download the images that I need each time I have to process them and then delete when I'm done, and do the same thing every time I need to do some processing?
Or is there a better way, like mounting the S3 bucket into the EC2 instance? I have seen tools like Fuse for mounting, but I am not sure if this is the best way of processing the data.
First of all. Note that each EC2 instance can be killed, so keep data, and results at reasonable storage - like S3.
If you fetch whole image into memory, and then processing goes. I can't see needs for fetching to disk. On the other hand if image is quite big - you could fetch each part many times. So there is no easy answer, at least with out more information.
You can look at map reduce solutions. How they are dealing with keeping data close to processing unit. Spark is able to process things in memory.
About mounting resources. There are other options like Elastic File System, or Elastic Block Storage - that can be mounted.

need some guidance on usage of Amazon AWS

every once in a while i read/hear about AWS and now i tried reading the docs.
But such docs seem to be written for people who already know which AWS they need to use and only search for how it can be used.
So, for myself, to understand AWS better i try to sketch a hypothetical Webapplication with a few questions.
The apps purpose is to modify content like videos or images. So a user has some kind of webinterface where he can upload his files, do some settings and a server grabs the file and modifies it (e.g. reencoding). The Service also extracts the audio track of a video and trys to index the spoken words so the customer can search within his videos. (well its just hypothetical)
So my questions:
given my own domain 'oneofmydomains.com' is it possible to host the complete webinterface on AWS? i thought about using GWT to create the interface and just deliver the JS/images via AWS, but which one, simple storage? what about some kind of index.html, is there an EC2 instance needed to host a webserver which has to run 24/7 causing costs?
now the user has the interface with a login form, is it possible to manage logins with an AWS? here i also think about an EC2 instance hosting a database, but it would also cause costs and im not sure if there is a better way?
the user has logged in and uploads a file. which storage solution could be used to save the customers original and modified content?
now the user wants to browse the status of his uploads, this means i need some kind of ACL, so that the customer only sees his own files. do i need to use a database (e.g. EC2) for this, or does amazon provide some kind of ACL, so the GWT webinterface will be secure without any EC2?
the customers files are reencoded and the audio track is indexed. so he wants to search for a video. Which service could be used to create and maintain the index for each customer?
hope someone can give a few answers so i understand AWS better on how one could use it
thx!
Amazon AWS offers a whole ecosystem of services which should cover all aspects of a given architecture, from hosting to data storage, or messaging, etc. Whether they're the best fit for purpose will have to be decided on a case by case basis. Seeing as your question is quite broad I'll just cover some of the basics of what AWS has to offer and what the different types of services are for:
EC2 (Elastic Cloud Computing)
Amazon's cloud solution, which is basically the same as older virtual machine technology but the 'cloud' offers additional knots and bots such as automated provisioning, scaling, billing etc.
you pay for what your use (by hour), for the basic (single CPU, 1.7GB ram) would prob cost you just under $3 a day if you run it 24/7 (on a windows instance that is)
there's a number of different OS to choose from including linux and windows, linux instances are cheaper to run without the license cost associated with windows
once you're set up the server to be the way you want, including any server updates/patches, you can create your own AMI (Amazon machine image) which you can then use to bring up another identical instance
however, if all your html are baked into the image it'll make updates difficult, so normal approach is to include a service (windows service for instance) which will pull the latest deployment package from a storage (see S3 later) service and update the site at start up and at intervals
there's the Elastic Load Balancer (which has its own cost but only one is needed in most cases) which you can put in front of all your web servers
there's also the Cloud Watch (again, extra cost) service which you can enable on a per instance basis to help you monitor the CPU, network in/out, etc. of your running instance
you can set up AutoScalers which can automatically bring up or terminate instances based on some metric, e.g. terminate 1 instance at a time if average CPU utilization is less than 50% for 5 mins, bring up 1 instance at a time if average CPU goes beyond 70% for 5 mins
you can use the instances as web servers, use them to run a DB, or a Memcache cluster, etc. choice is yours
typically, I wouldn't recommend having Amazon instances talk to a DB outside of Amazon because of the round trip is much longer, the usual approach is to use SimpleDB (see below) as the database
the AmazonSDK contains enough classes to help you write some custom monitor/scaling service if you ever need to, but the AWS console allows you to do most of your configuration anyway
SimpleDB
Amazon's non-relational, key-value data store, compared to a traditional database you tend to pay a penalty on per query performance but get high scalability without having to do any extra work.
you pay for usage, i.e. how much work it takes to execute your query
extremely scalable by default, Amazon scales up SimpleDB instances based on traffic without you having to do anything, AND any control for that matter
data are partitioned in to 'domains' (equivalent to a table in normal SQL DB)
data are non-relational, if you need a relational model then check out Amazon RDB, I don't have any experience with it so not the best person to comment on it..
you can execute SQL like query against the database still, usually through some plugin or tool, Amazon doesn't provide a front end for this at the moment
be aware of 'eventual consistency', data are duplicated on multiple instances after Amazon scales up your database, and synchronization is not guaranteed when you do an update so it's possible (though highly unlikely) to update some data then read it back straight away and get the old data back
there's 'Consistent Read' and 'Conditional Update' mechanisms available to guard against the eventual consistency problem, if you're developing in .Net, I suggest using SimpleSavant client to talk to SimpleDB
S3 (Simple Storage Service)
Amazon's storage service, again, extremely scalable, and safe too - when you save a file on S3 it's replicated across multiple nodes so you get some DR ability straight away.
you only pay for data transfer
files are stored against a key
you create 'buckets' to hold your files, and each bucket has a unique url (unique across all of Amazon, and therefore S3 accounts)
CloudBerry S3 Explorer is the best UI client I've used in Windows
using the AmazonSDK you can write your own repository layer which utilizes S3
Sorry if this is a bit long winded, but that's the 3 most popular web services that Amazon provides and should cover all the requirements you've mentioned. We've been using Amazon AWS for some time now and there's still some kinks and bugs there but it's generally moving forward and pretty stable.
One downside to using something like aws is being vendor locked-in, whilst you could run your services outside of amazon and in your own datacenter or moving files out of S3 (at a cost though), getting out of SimpleDB will likely to represent the bulk of the work during migration.