Storing data on EC2 vs S3 in AWS - amazon-web-services

I'm fairly new to AWS and am having trouble understanding what the best practice should be for hosting a product I'm developing...
I have a tool I plan on running on an EC2 instance but it only needs to run a couple of times a year, the rest of the time I can have the instance stopped and not incur charges. However the product utilizes quite a bit of data (approx 15 - 25 GB depending on the run). I understand that S3 is meant for storing data long term. But is there any reason why I can't just leave the data on an EC2 (even when it's stopped). Or do I have to do manual copies from S3 every time I want to execute a run

Even if the instance is off, you are still incurring charges for the EBS volumes. S3 also has the advantage of being somewhat more durable.
It would probably be a good idea to back up your data or snapshot your volume to s3 as a precaution, but I would not worry about transferring data back and forth.

Related

Shifting/Migrating PHP Codeigniter project to AWS

Our Company has a Software Product consists of Web App, Android and iOS App.
we have more then 350 clients, that is we have more then 350 databases(MYSQL) of each client and one code file repository(PHP Codeigniter). When new client purchase our software we just copy the the old empty database and client is able to use the software. this is our architecture.
Now we are planing to shift to AWS but we do not know which AWS service we really need for this type of architecture
We have Codeigniter 3.1 version, PHP 7 and MYSQL.
You can implement this sort of system on a single EC2 instance, simply installing the same software as you have on your current server. However in this case you are likely better off to host it somewhere cheaper than AWS.
However, what I recommend is that you implement it using RDS, EC2, S3 and Cloudfront.
RDS
I recommend to run your database on RDS:
the database server competes over completely different resources than PHP, so if you run into performance problems, it is impossible to figure out what is happening when database and PHP are on the same instance. A lack of CPU can lead to a lack of memory and vice versa.
built-in point-in-time recovery for up to 35 days has saved my bacon many many times and is great when you have a bug that is hard to reproduce or when someone (you) has accidentally deleted a large amount of data
On top of this I recommend to also go for Aurora for MySQL instead of MySQL RDS, especially as I expect your database size on disk to be smaller than 50GB:
On MySQL RDS you need to commission at least 100GB of disk to get good enough performance for production. 100GB gives you 100x50kb per second on the EBS disks that are used.
By comparison, on AWS Aurora you get the read performance of 6 different storage locations without having to commit to any amount of disk space. This saves money and is more performant
Aurora is also much faster in restoring point in time as well as with "dumb" queries, ie. table scans.
EC2
I recommend to look at nothing older than the t3, c5 or m5 instances, as they have the new "nitro hypervisor" and are significantly faster, while being cheaper. From experience you can go down a notch from your existing CPU count with these instances
If you can use c6/m6/t4 instances
I also found c5a and equivalents to be just as performant
AWS recommends to always use auto-scaling, but if you are coming a single server somewhere else you are already winning because you can restore within minutes.
Once you hit $600 per month in EC2 charges, definitely look at autoscaling. Virtually every webapp can be written in a way that allows for a server to be replaced at any point in time. With auto scaling you can then use Spot instances at 50-90% discount for your 2nd/3rd etc instance and save serious money.
S3
Store all customer provided files on S3, DO NOT get into a shared file system.
This is much cheaper than any disk or file system and has numerous automation features, such as versioning, cross-region backup, archiving, event triggers etc.
do not ever make your bucket publicly accessible.
Cloudfront
The key benefit of storing all customer provided files on S3 is that you can serve them with Cloudfront without paying for CPU. Cloudfront only charges for traffic delivered. S3 only charges for space used. Every file delivered through Cloudfront does not use your server's CPU, sockets, network bandwidth. On top of this transfer from EC2 to S3 and from S3 to Cloudfront is free of charge. You are only charged for the traffic you already had to pay for anyway.
You need to secure your clients file properly with Signed Urls or Signed Cookies. For this you can either create separate S3 buckets for each client or one single bucket.
Bonus: SQS
Many things in web application do not need to be done right now. They can wait a bit, sometimes a couple of 100 milliseconds, sometimes minutes or hours.
Anything that can wait, I recommend start implementing a background process that reads from an SQS queue for it. Your web application will need minimal time to push the work required and its parameters into an SQS queue. Your background process can then work on it in (rough) order of entry into the queue. When you use your normal web servers to process the background queues you are already getting a better distribution of server load over time. This is because you cannot control the amount of web requests, but you can control the speed in how you process background items (to a degree of course).
Later, when you have a lot of background processing and a lot of traffic, you can consider using different servers for background processing.
There are also lots of ways of how you can hook other event driven code onto the items that go into your queue, including monitoring for limits exceeded for certain items etc.

Loading large amount of data from local machine into Amazon Elastic Block Store

I am interested in doing some machine learning using an AWS EC2 instance. I have played around with launching instances with a an attached EBS and I was able to load files into it via scp on my local command line. I will have several gigabytes of data to load onto this EBS (I know that isn't a lot by ML standards but that's not really my point). I would like to know what is the appropriate way to load this data. I'm concerned about racking up large fees because I did something in a silly way.
So far I have just uploaded a few files to the EC2 instance's associated EBS manually via the command line, like this:
scp -i keys/ec2-ml-micro2.pem data/BB000000001.png ubuntu#<my instance ip>:/data
This seems to me to be a rather primitive approach (not that that is always a bad thing). Is it the "right" way? I'm not opposed to letting a batch jbb run overnight like this but I am not sure if it may incur some data transfer fees. I've looked around for information on this, and I have read the page on EBS pricing. I didn't see anything on costs associated with loading data but I just wanted to confirm with someone or some people who have done something similar that this is the correct approach, and if not, what is a a better one
In managing large objects in AWS. Always check for S3 as an initial option, it provides unlimited Storage capacity and best use for object store compared to EBS(block store). EBS billed you from the size of the volume that you provisioned, having said that there is a chance that you over-provisioned(overhead cost) or under-provisioned (can lead to poor performance or even downtime).
Using S3 you are billed for the storage that you consumed per GB per month, pay for what you use model and it's very cheap compared to EBS.
And lastly, try to evaluate first the AWS Machine Learning services that might fit for your use-cases it will save you alot of time and effort.
Data Transfer from S3 to EBS within the same region is free of charge.
AWS Pricing Details

Why is disk IO on my new AWS EC2 instance so much slower?

I have a regular EC2 instance A with a 200GB SSD filled with data. I used this disk to create an AMI and used that AMI to spin up another EC2 instance B with the same specs.
B started almost instantaneously which surprised me since I thought there would be a delay while AWS copies my 200GB EBS to the SSD corresponding to the new instance. However I noticed IO is extremely slow on B. It takes 3x as long to parse data on B.
Why is this, and how can I overcome this? It's too slow for my application which requires fast disk IO.
This happens because a newly-created EBS volume is built from S3 on-demand: when EC2 first reads a block from that volume it's retrieved from S3. You only get the "full" EBS performance once all blocks have been loaded. This is a huge problem, btw, for big databases restored from snapshot.
One solution may be fast snapshot restore. Although the docs don't describe what's happening behind the scenes, my guess is that they do a parallel disk copy from an existing EBS image. However, you will pay $0.75 per hour per snapshot, and are limited to 10 restores per hour.
Given the use-case that you described in another question, I think that the best solution is to keep an on-demand instance that you start and stop for your job. Assuming you're using Linux, you are charged per-second, so if you only run for 10-20 minutes out of the hour, you'll pay a pro-rated price. And unlike spot instances, you'll know that the machine will always be available and always be able to finish the job.
Another alternative is to just leave the spot instance running. If you're running for a significant portion of every hour, you're not really saving that much by shutting the instance down.

Which AWS services and specs should I best use for a file sharing web system?

I'm building a web system where 100-150 users will keep uploading/downloading ~10 GB total worth of audio files everyday (average of 150 total uploads and 250 total downloads per day).
I'm still trying to read about the whole AWS ecosystem and I need help with the ff:
For file storage, should I use S3 or EBS volumes mounted to an EC2 instance? From what I read, S3 is much cheaper and more scalable than EBS, but it's also slower. Is the speed difference really that huge or noticable for my use case? What are the advantages of a mounted EBS volume vs. S3?
What would be the best EC2 instance type for my use case? (i.e. frequent uploads and downloads) Will the General Purpose ones (T2, M4 etc) be enough to handle that load? (see above)
I can provide more info on my requirements/use cases if needed. Thanks!
Start with S3. S3 is a web api for putting and retrieving huge amounts of data, whereas EBS would be an NFS-mounted device. S3 will be more scalable from a data warehousing perspective, and in terms of access from multiple concurrent instances (should you do that, in the future.) Only use EBS if you actually need a filesystem for some reason. It doesn't sound like you do.
From there, you can look into some data archiving if you end up having huge amounts of data that doesn't need to be regularly available, to save some money.
Yes, use a t2 to start. Though, you should design your system so that it doesn't really matter, and you can easily teardown/replace instances. Using S3 helps with that pattern. You still need to figure out how you will deploy and configure your application to newly launched instances, though. You should /assume/ that your instance will go down, disappear, etc. So, you should be able to failover to another one on demand.

amazon ec2 free server with persistent data

I will install a website in the free EC2 from amazon but I read something not good: I have a simple website which uses a database. Users come inside my website and post information, send commetns... if for some reason the instance breaks or amazon shuts it down, will I lose all information posted in my website and database? All files users uploaded and information saved will be gone?
If so, why would someone use EC2 if you lose all your data if some problem happens, and because problems always happen, sometime I will certainly lose my data!
I know I can save an image of my current OS in AWS but do I need to save the image everytime a user posts something to my website? It's ridiculous. I know I am missing something here, but I looked into google and people all the time say I should use EBS but it's not in the free plan. So how is it good idea using AWS EC2 free plan if my data will always be at risk of being lost?
Typically you would want to use an EBS backed instance. Since the free tier does not support that, but does offer EBS storage, create your database on an EBS partition for data you cannot lose
30 GB of Amazon Elastic Block Storage, plus 2 million I/Os and 1 GB of snapshot storage*
http://aws.amazon.com/free/
You should have a means to quickly launch a new instance, and you should back up the data on your EBS partition because EBS volumes can and do fail from time to time.
UPDATE
It seems that Micro instances are in fact EBS backed.
It is still advisable to attach a separate EBS volume, because it makes it much more convenient to backup the database (you create a snapshot of the EBS volume... you can find scripts online to accomplish that, which vary a bit depending on your choice of database and file system).