Which AWS services and specs should I best use for a file sharing web system? - amazon-web-services

I'm building a web system where 100-150 users will keep uploading/downloading ~10 GB total worth of audio files everyday (average of 150 total uploads and 250 total downloads per day).
I'm still trying to read about the whole AWS ecosystem and I need help with the ff:
For file storage, should I use S3 or EBS volumes mounted to an EC2 instance? From what I read, S3 is much cheaper and more scalable than EBS, but it's also slower. Is the speed difference really that huge or noticable for my use case? What are the advantages of a mounted EBS volume vs. S3?
What would be the best EC2 instance type for my use case? (i.e. frequent uploads and downloads) Will the General Purpose ones (T2, M4 etc) be enough to handle that load? (see above)
I can provide more info on my requirements/use cases if needed. Thanks!

Start with S3. S3 is a web api for putting and retrieving huge amounts of data, whereas EBS would be an NFS-mounted device. S3 will be more scalable from a data warehousing perspective, and in terms of access from multiple concurrent instances (should you do that, in the future.) Only use EBS if you actually need a filesystem for some reason. It doesn't sound like you do.
From there, you can look into some data archiving if you end up having huge amounts of data that doesn't need to be regularly available, to save some money.
Yes, use a t2 to start. Though, you should design your system so that it doesn't really matter, and you can easily teardown/replace instances. Using S3 helps with that pattern. You still need to figure out how you will deploy and configure your application to newly launched instances, though. You should /assume/ that your instance will go down, disappear, etc. So, you should be able to failover to another one on demand.

Related

Shifting/Migrating PHP Codeigniter project to AWS

Our Company has a Software Product consists of Web App, Android and iOS App.
we have more then 350 clients, that is we have more then 350 databases(MYSQL) of each client and one code file repository(PHP Codeigniter). When new client purchase our software we just copy the the old empty database and client is able to use the software. this is our architecture.
Now we are planing to shift to AWS but we do not know which AWS service we really need for this type of architecture
We have Codeigniter 3.1 version, PHP 7 and MYSQL.
You can implement this sort of system on a single EC2 instance, simply installing the same software as you have on your current server. However in this case you are likely better off to host it somewhere cheaper than AWS.
However, what I recommend is that you implement it using RDS, EC2, S3 and Cloudfront.
RDS
I recommend to run your database on RDS:
the database server competes over completely different resources than PHP, so if you run into performance problems, it is impossible to figure out what is happening when database and PHP are on the same instance. A lack of CPU can lead to a lack of memory and vice versa.
built-in point-in-time recovery for up to 35 days has saved my bacon many many times and is great when you have a bug that is hard to reproduce or when someone (you) has accidentally deleted a large amount of data
On top of this I recommend to also go for Aurora for MySQL instead of MySQL RDS, especially as I expect your database size on disk to be smaller than 50GB:
On MySQL RDS you need to commission at least 100GB of disk to get good enough performance for production. 100GB gives you 100x50kb per second on the EBS disks that are used.
By comparison, on AWS Aurora you get the read performance of 6 different storage locations without having to commit to any amount of disk space. This saves money and is more performant
Aurora is also much faster in restoring point in time as well as with "dumb" queries, ie. table scans.
EC2
I recommend to look at nothing older than the t3, c5 or m5 instances, as they have the new "nitro hypervisor" and are significantly faster, while being cheaper. From experience you can go down a notch from your existing CPU count with these instances
If you can use c6/m6/t4 instances
I also found c5a and equivalents to be just as performant
AWS recommends to always use auto-scaling, but if you are coming a single server somewhere else you are already winning because you can restore within minutes.
Once you hit $600 per month in EC2 charges, definitely look at autoscaling. Virtually every webapp can be written in a way that allows for a server to be replaced at any point in time. With auto scaling you can then use Spot instances at 50-90% discount for your 2nd/3rd etc instance and save serious money.
S3
Store all customer provided files on S3, DO NOT get into a shared file system.
This is much cheaper than any disk or file system and has numerous automation features, such as versioning, cross-region backup, archiving, event triggers etc.
do not ever make your bucket publicly accessible.
Cloudfront
The key benefit of storing all customer provided files on S3 is that you can serve them with Cloudfront without paying for CPU. Cloudfront only charges for traffic delivered. S3 only charges for space used. Every file delivered through Cloudfront does not use your server's CPU, sockets, network bandwidth. On top of this transfer from EC2 to S3 and from S3 to Cloudfront is free of charge. You are only charged for the traffic you already had to pay for anyway.
You need to secure your clients file properly with Signed Urls or Signed Cookies. For this you can either create separate S3 buckets for each client or one single bucket.
Bonus: SQS
Many things in web application do not need to be done right now. They can wait a bit, sometimes a couple of 100 milliseconds, sometimes minutes or hours.
Anything that can wait, I recommend start implementing a background process that reads from an SQS queue for it. Your web application will need minimal time to push the work required and its parameters into an SQS queue. Your background process can then work on it in (rough) order of entry into the queue. When you use your normal web servers to process the background queues you are already getting a better distribution of server load over time. This is because you cannot control the amount of web requests, but you can control the speed in how you process background items (to a degree of course).
Later, when you have a lot of background processing and a lot of traffic, you can consider using different servers for background processing.
There are also lots of ways of how you can hook other event driven code onto the items that go into your queue, including monitoring for limits exceeded for certain items etc.

Loading large amount of data from local machine into Amazon Elastic Block Store

I am interested in doing some machine learning using an AWS EC2 instance. I have played around with launching instances with a an attached EBS and I was able to load files into it via scp on my local command line. I will have several gigabytes of data to load onto this EBS (I know that isn't a lot by ML standards but that's not really my point). I would like to know what is the appropriate way to load this data. I'm concerned about racking up large fees because I did something in a silly way.
So far I have just uploaded a few files to the EC2 instance's associated EBS manually via the command line, like this:
scp -i keys/ec2-ml-micro2.pem data/BB000000001.png ubuntu#<my instance ip>:/data
This seems to me to be a rather primitive approach (not that that is always a bad thing). Is it the "right" way? I'm not opposed to letting a batch jbb run overnight like this but I am not sure if it may incur some data transfer fees. I've looked around for information on this, and I have read the page on EBS pricing. I didn't see anything on costs associated with loading data but I just wanted to confirm with someone or some people who have done something similar that this is the correct approach, and if not, what is a a better one
In managing large objects in AWS. Always check for S3 as an initial option, it provides unlimited Storage capacity and best use for object store compared to EBS(block store). EBS billed you from the size of the volume that you provisioned, having said that there is a chance that you over-provisioned(overhead cost) or under-provisioned (can lead to poor performance or even downtime).
Using S3 you are billed for the storage that you consumed per GB per month, pay for what you use model and it's very cheap compared to EBS.
And lastly, try to evaluate first the AWS Machine Learning services that might fit for your use-cases it will save you alot of time and effort.
Data Transfer from S3 to EBS within the same region is free of charge.
AWS Pricing Details

Is EFS a substitute of HDFS for distributed storage?

Our business requirement is to read from millions of files and process those parallelly (later index those in ES). This is a one time operation and after processing those we won't read those million files again. Now, we want to distribute the file storage and at the same time ensure data retention. I did some research and made the list
EBS: The data is retained even after EC2 instance is shut down. It is accessible from a single EC2 instance from our AWS region. It will be useful if we split the data on our own and provide it to different EC2 instances. It offers redundancy and encryption security. Easy to scale. We can use it if we divide the chunks manually and provide those to the different servers we have.
EFS: It allows us to mount the FS across multiple regions and instances (accessible from multiple EC2 instances). Since EFS is a managed service, we don’t have to worry about maintaining and deploying the FS
S3: Not limited to access from EC2 but S3 is not a file system
HDFS: Extremely good at scale but is only performant with double or triple replication. Scaling down HDFS is painful and buggy. "It also lacks encryption at storage and network levels. It has also been connected to various controversies because cybercriminals can easily exploit the frameworks that are built on Java." Not sure how big of a concern this is considering our servers are pretty secure.
Problem with small files in Hadoop, explained in https://data-flair.training/forums/topic/what-is-small-file-problem-in-hadoop/ Considering most of the files we receive are less then 1 MB; this can cause memory issues if we go beyond a certain number. So it will not give us the performance we think it should.
My confusion is in HDFS:
I went through a lot of resources that talk about "S3" vs "HDFS" and surprisingly there are no clear resources on "EFS" vs "HDFS" which confuses me in understanding if they are really a substitute for each other or are complementary.
For example, one question I found was "Has anyone tried using AWS EFS mounts as yarn scratch and HDFS directories?" -> what does it mean to have EFS mount as HDFS directory?
"Using EBS volumes for HDFS prevents data locality" - What does it mean to use "EBS volume" for HDFS?
What does it mean to run "HDFS in the cloud"?
References
https://databricks.com/blog/2017/05/31/top-5-reasons-for-choosing-s3-over-hdfs.html
https://cloud.google.com/blog/products/storage-data-transfer/hdfs-vs-cloud-storage-pros-cons-and-migration-tips
https://www.knowledgehut.com/blog/big-data/top-pros-and-cons-of-hadoop
https://data-flair.training/blogs/13-limitations-of-hadoop/
There are possibilities by any kind of storage but as your situation is a one time scenario you need a choice with respect to
Cost optimized
well Performed
Secure
I can not answer to all your questions but concerning your use case I consider you use reach the data from EC2 instance and if you had mentioned the producing and processing of these files and the size of each file approximately maybe I could help you better.
Considerations:
EBS has a provisioned or limited Throughput and force you to provision and remove the data after treatment. FYI: you can set retention policy of EBS volume to be deleted by EC2 termination but not on shutdown.
If you need really the fastest way and don't care about costs EBS is a good idea with a good provisioning as you are charged by their life and storage.
EFS is a NAS storage and also needs the data be removed after treatment.
HDFS is a distributed file system and is the best choice for petabyte and distributed file systems but is not used as a one shot solution, you need installation and configuration.
I propose you personally the S3 as you does not have a limited throughput and using VPC endpoint you can achieve up to 25 Gbps, alternatively you can use the S3 life cycle policies to remove your data automatically based on tags or after 1 up to 356 days or archive them if needed.

What is difference between AWS EFS and S3?

AWS releases new Elastic File System this week. See http://aws.amazon.com/efs/
The page doesn't contain many details. I'd like to know its performance comparing to S3, as well as other differences.
You almost can't compare EFS and S3 because they are two very different things, even though there is some overlap in their functionality, or at least their apparent functionality.
They both store things and they both have a storage pricing model that scales linearly with usage over time.
But S3 is an object store with an HTTP interface and a mixed consistency model....
...while EFS is an actual filesystem with an NFS interface and as such will almost certainly offer immediate consistency.
S3, coupled with a utility like s3fs can be used in a way that mimics a filesysem, but not to the point of behaving in all ways like an actual filesystem.
One way of looking at EFS is that it is an answer to the question, "how do I attach an EBS volume to multiple instances at the same time?" Previously, of course, the answer was, "you can't." You can mount the filesystem exposed by EFS on any nunber of instances and the result should be very similar to what you'd see if you had a "shared volume."
Its performance compared to S3 is not really a fair comparison, again, because they are different things for different purposes, but EFS will almost without question be "faster" by any meaningful definition of the word.
Also, no software should be required in order to mount an EFS filesystem on a Linux system.
As already mentioned EFS is completely different to S3.
The simplest way to look at is to look at what the underlying technology is.
S3 is an object store, meaning it is a higher layer data storage system, essentially it is a database "blob" storage, storing data in an underlying simple database as an object.
It's designed for Write once Read many access, perfect for media data like image or video particularly as it is distributed and offers a very high level of redundancy.
EFS is a Network Storage system, underlying it is a storage array (SAN) and it offers the standard protocol for multi session network file systems (NFS)
It's built on high speed SSD drives and is intended for shared storage for your ec2 instances, think file servers.
It's been a long time coming for AWS and IMO this was one of the biggest missing key components for aws to really be a competitor to on-premise enterprise data centers.
Performance for EFS will be scalable and although I have not seen the details yet I am sure it will allow for provisioned IOPS just like EBS.
EFS is also considerably (10x) more expensive than S3 at $0.30 vs $0.03. From an IOPs perspective you should see better performance from EFS as it's SSD based and doesn't have the overheard of HTTP on top as does S3. It's essential NAS as a Service.
Two addition differences between the two:
AWS S3 offers Server-Side Encryption: http://docs.aws.amazon.com/AmazonS3/latest/dev/serv-side-encryption.html
The same is not currently offered in AWS EFS
Files stored in AWS S3 as public, are accessible via a public URL to anyone.
In AWS EFS however, in order to achieve the same, you'll need to deploy a web server that will serve your files.
Choosing between EFS and S3 is depend on your usage pattern
EFS availability and durability is same as S3
but both have different usage patterns
S3 have four common usage patterns:
static web content
host entire static websites
store data for large-scale analytics.
backup and archiving of critical data.
EFS is designed for applications thats concurrently access data from multiple EC2 instances.
simply, by having one EFS you can attach it to multiple EC2.. you can't do that with EBS.
Amazon claim that S3 performance is more than any current users needs.
EFS performance has two modes
General Purpose
Max I/O
General Purpose is the default and it's appropriate for most operations type.
but, if your workload will exceed 7000 file operations per second then Max I/O is your target

need some guidance on usage of Amazon AWS

every once in a while i read/hear about AWS and now i tried reading the docs.
But such docs seem to be written for people who already know which AWS they need to use and only search for how it can be used.
So, for myself, to understand AWS better i try to sketch a hypothetical Webapplication with a few questions.
The apps purpose is to modify content like videos or images. So a user has some kind of webinterface where he can upload his files, do some settings and a server grabs the file and modifies it (e.g. reencoding). The Service also extracts the audio track of a video and trys to index the spoken words so the customer can search within his videos. (well its just hypothetical)
So my questions:
given my own domain 'oneofmydomains.com' is it possible to host the complete webinterface on AWS? i thought about using GWT to create the interface and just deliver the JS/images via AWS, but which one, simple storage? what about some kind of index.html, is there an EC2 instance needed to host a webserver which has to run 24/7 causing costs?
now the user has the interface with a login form, is it possible to manage logins with an AWS? here i also think about an EC2 instance hosting a database, but it would also cause costs and im not sure if there is a better way?
the user has logged in and uploads a file. which storage solution could be used to save the customers original and modified content?
now the user wants to browse the status of his uploads, this means i need some kind of ACL, so that the customer only sees his own files. do i need to use a database (e.g. EC2) for this, or does amazon provide some kind of ACL, so the GWT webinterface will be secure without any EC2?
the customers files are reencoded and the audio track is indexed. so he wants to search for a video. Which service could be used to create and maintain the index for each customer?
hope someone can give a few answers so i understand AWS better on how one could use it
thx!
Amazon AWS offers a whole ecosystem of services which should cover all aspects of a given architecture, from hosting to data storage, or messaging, etc. Whether they're the best fit for purpose will have to be decided on a case by case basis. Seeing as your question is quite broad I'll just cover some of the basics of what AWS has to offer and what the different types of services are for:
EC2 (Elastic Cloud Computing)
Amazon's cloud solution, which is basically the same as older virtual machine technology but the 'cloud' offers additional knots and bots such as automated provisioning, scaling, billing etc.
you pay for what your use (by hour), for the basic (single CPU, 1.7GB ram) would prob cost you just under $3 a day if you run it 24/7 (on a windows instance that is)
there's a number of different OS to choose from including linux and windows, linux instances are cheaper to run without the license cost associated with windows
once you're set up the server to be the way you want, including any server updates/patches, you can create your own AMI (Amazon machine image) which you can then use to bring up another identical instance
however, if all your html are baked into the image it'll make updates difficult, so normal approach is to include a service (windows service for instance) which will pull the latest deployment package from a storage (see S3 later) service and update the site at start up and at intervals
there's the Elastic Load Balancer (which has its own cost but only one is needed in most cases) which you can put in front of all your web servers
there's also the Cloud Watch (again, extra cost) service which you can enable on a per instance basis to help you monitor the CPU, network in/out, etc. of your running instance
you can set up AutoScalers which can automatically bring up or terminate instances based on some metric, e.g. terminate 1 instance at a time if average CPU utilization is less than 50% for 5 mins, bring up 1 instance at a time if average CPU goes beyond 70% for 5 mins
you can use the instances as web servers, use them to run a DB, or a Memcache cluster, etc. choice is yours
typically, I wouldn't recommend having Amazon instances talk to a DB outside of Amazon because of the round trip is much longer, the usual approach is to use SimpleDB (see below) as the database
the AmazonSDK contains enough classes to help you write some custom monitor/scaling service if you ever need to, but the AWS console allows you to do most of your configuration anyway
SimpleDB
Amazon's non-relational, key-value data store, compared to a traditional database you tend to pay a penalty on per query performance but get high scalability without having to do any extra work.
you pay for usage, i.e. how much work it takes to execute your query
extremely scalable by default, Amazon scales up SimpleDB instances based on traffic without you having to do anything, AND any control for that matter
data are partitioned in to 'domains' (equivalent to a table in normal SQL DB)
data are non-relational, if you need a relational model then check out Amazon RDB, I don't have any experience with it so not the best person to comment on it..
you can execute SQL like query against the database still, usually through some plugin or tool, Amazon doesn't provide a front end for this at the moment
be aware of 'eventual consistency', data are duplicated on multiple instances after Amazon scales up your database, and synchronization is not guaranteed when you do an update so it's possible (though highly unlikely) to update some data then read it back straight away and get the old data back
there's 'Consistent Read' and 'Conditional Update' mechanisms available to guard against the eventual consistency problem, if you're developing in .Net, I suggest using SimpleSavant client to talk to SimpleDB
S3 (Simple Storage Service)
Amazon's storage service, again, extremely scalable, and safe too - when you save a file on S3 it's replicated across multiple nodes so you get some DR ability straight away.
you only pay for data transfer
files are stored against a key
you create 'buckets' to hold your files, and each bucket has a unique url (unique across all of Amazon, and therefore S3 accounts)
CloudBerry S3 Explorer is the best UI client I've used in Windows
using the AmazonSDK you can write your own repository layer which utilizes S3
Sorry if this is a bit long winded, but that's the 3 most popular web services that Amazon provides and should cover all the requirements you've mentioned. We've been using Amazon AWS for some time now and there's still some kinks and bugs there but it's generally moving forward and pretty stable.
One downside to using something like aws is being vendor locked-in, whilst you could run your services outside of amazon and in your own datacenter or moving files out of S3 (at a cost though), getting out of SimpleDB will likely to represent the bulk of the work during migration.