Is switching over to Amazon S3 for Drupal 7 image hosting worth it?

Is switching over to Amazon S3 for Drupal 7 image hosting worth it? - amazon-web-services

So I just have a quick question with regards to using amazon s3.
I have a small Drupal 7 site hosted on a VPS with not too much storage space. I put together the site for members of my School's Photographic Arts Committee to upload photos of School events and projects.
The full-quality photos are stored in a private folder on the server, and the images displayed on the site are watermarked 2048px width ones stored publicly.
I'm worried that I'm going to blow my storage space very fast, and I fear that I'm going to blow my not-really-exsistant budget on using amazon s3 with the module in Drupal.
So, I would like to know if it is a worthy investment using amazon s3, I'll be willing to spend +/- $5 dollars on it.
My monthly usage will include 3gigs worth of uploads and probably 20 gigs max downloads. Obviously slowly increasing.
Also, a bit confused about storage billing, do I have to pay for say my 50gigs worth storage from uploads from previous months, or just the 3 gigs of storage I used this month
PS: I live in South Africa and will probably use the Ireland S3 servers as they have the best latency.
Any feedback much appreciated!
Thanks.

S3 may be a good option in your case, given your limited storage space.
You can calculate things fairly easily. Ignoring the 'requests' charge since it's tiny, here's the formula for Ireland:
(gb of storage * 0.3) + (avg image size * requests * 0.09) + (requests * 0.005/1000)
There are some volume discounts and some "first N transfer free", but this is a good ceiling, especially for a low-volume site as you mention. Also note storing the full-size images (and not downloading them) means only the first third of the formula matters. As an example, if you have 5gb in full-size images plus another 1gb in 350kb "2048px" images that sum to 10,000 image views per month:
full-size: 5*.03=.15
2048 hosting/downloads: (1*.03)+(0.00033*10000*.09)+(10000*.0004/10000)=0.3274
So, your monthly costs are about 50 cents.
What happens if your site is slashdotted? Imagine you get 10 million hits:
full-size: 5*.03=.15
2048 hosting/downloads: (1*.03)+(0.00033*10000000*.09)+(10000000*.004/10000)=301.03
So, your monthly cost is now over $300. (this is why billing alarms are important!)
Now, let's imagine you put cloudfront in front of S3 (which is a really good idea for several reasons) and look at the pricing in this scenario. (I've simplified the pricing here a little bit, and assuming nothing is loaded twice by the same browser, so no caching)
full-size: 5*.03=.15
2048 hosting/downloads: (1*.03)+(0.00033*10000000*.085)+(10000000*.009/10000)=289.53
so it saved about $10 but gave you better performance.
If you need more features (image resizing, for instance), you may want to consider a photo host like Flickr or Smugmug. They pay for bandwidth, which makes your costs more predictable.

Related

How much would AWS ec2 cost for a project of my type

I have tried many times to install the R server on an AWS instance using terminal commands without any luck. I can install it using http://www.louisaslett.com/RStudio_AMI/
and following a Youtube video but I cannot get the dropbox sync to stop "syncing". I have tried installing a fresh version using the terminal and Putty and other methods without much success.
What I wanted to use AWS for was to use the bandwidth / computing time.
I basically wanted to run an R script to download a bunch of documents which could take 2 weeks to download. I had hoped to save these on a large dropbox account I have access to but unfortunately library("RStudioAMI")
linkDropbox()
excludeSyncDropbox("*") doesn`t seem to work for me and the whole dropbox folder gets synced onto my AWS instance and I run out of space.
So basically... I think I will forget dropbox and just use AWS storage.
I want to download appox 500GB - or perhaps 1TB worth of data (running an R script to download documents and save them), it just connects to a website and downloads a document, so no ML or high computing power needed. Just a consistent connection. Once the documents are fully downloaded I would like to then just transfer them to an external hard drive I have for further analysis.
So my question is, "approximately" how much do you think this may cost, I don't care about paying 20-30$ I just don`t want to go in with inexperience/without knowledge and rack up hundreds$.
Additionally: What other instances/servers do you suggest I pay for, I feel like I dont need that much power just consistency.
Here is another SO question I opened:
Amazon AWS Dropbox link error: "No directories are being ignored."

There will be three main costs for your scenario:
Amazon EC2, which is charged hourly. You do not need much processing power, so a t3.small would probably be adequate if you're not doing any big computations. It's only about 2c/hour, which is $7 for 2 weeks.
An Amazon EBS disk volume attached to your Amazon EC2 instance for storing the data. A General Purpose volume is 10c/GB/month. So, 1TB for 2 weeks would be $50. If you configure it to use "Cold HDD (sc1)", then it's a quarter of that price.
Data Transfer for when you download from AWS. If you are using AWS in the USA, it is 9c/GB. So, 1TB = $90. This would be your major cost.
There might be some other minor costs, but they won't be significant compared to the above.
Or, given that your basic goal is to collect and download data, you could just do it on a computer at home.

If you are not strictly limited to EC2 ( which I think you are not, considering the requirement you stated and the AMI approach failed for you) , AWS Lightsail would be a much better solution
It has bundled data transfer package and acceptable performance
Here is the 1-month plan
512 MB Memory
1 Core Processor
20 GB SSD Disk
1 TB Transfer ( Data in will cost nothing, only data Out, Ex: From LightSail to your local PC )
Additional SSD - $10 for 1 TB
Average network performance for that instance I see is about 30 Megabyte per second. You can just shutdown everything and only billed for the hours you used in the month

AWS S3 Standard Infrequent Access vs Reduced Redundancy storage class when coupled with CloudFront?

I'm using CloudFront to cache and distribute all of my thumbnails currently stored on S3 in Standard storage class. Since CloudFront caches originals and accesses them only every 24 hours, it makes sense to use a cheaper storage class than Standard: either Standard Infrequent Access (IA) or Reduced Redundancy (RR). But I'm not sure which one would be more suitable and cost effective.
Standard-IA has the cheapest storage among all (58% cheaper than Standard class and 47% cheaper than RR), but 60% more expensive requests than both Standard and RR. However, all files under 128kb stored in Standard-IA class are rounded to 128kb when calculating cost, which would apply to most of my thumbnail images.
Meanwhile, storage in RR class is only 20% cheaper than Standard, but its request cost is 60% cheaper than that of Standard-IA.
I'm unsure which one would be most cost effective in practice and would appreciate anyone with experience using both to give some feedback.

There's a problem with the premise of your question. The fact that CloudFront may cache your objects for some period of time actually has little relevance when selecting an S3 storage class.
REDUCED_REDUNDANCY is sometimes less expensive¹ because S3 stores your data on fewer physical devices, reducing the reliability somewhat in exchange for lower pricing... and in the event of failures, the object is statistically more likely to be lost by S3. If S3 loses the object because of the reduced redundancy, CloudFront will at some point begin returning errors.
The deciding factor in choosing this storage class is whether the object is easily replaced.
Reduced Redundancy Storage (RRS) is an Amazon S3 storage option that enables customers to reduce their costs by storing noncritical, reproducible data at lower levels of redundancy than Amazon S3’s standard storage. It provides a cost-effective, highly available solution for distributing or sharing content that is durably stored elsewhere, or for storing thumbnails, transcoded media, or other processed data that can be easily reproduced.
https://aws.amazon.com/s3/reduced-redundancy/
STANDARD_IA (infrequent access) is less expensive for a different reason: the storage savings are offset by retrieval charges. If an object is downloaded more than once per month, the combined charge will exceed the cost of STANDARD. It is intended for objects that will genuinely be accessed infrequently. Since CloudFront has multiple edge locations, each with its own independent cache,² whether an object is "currently stored in" CloudFront is not a question with a simple yes/no answer. It is also not possible to "game the system" by specifying large Cache-Control: max-age values. CloudFront has no charge for its cache storage, so it's only sensible that an object can be purged from the cache before the expiration time you specify. Indeed, anecdotal observations confirm what the docs indicate, that objects are sometimes purged from CloudFront due to a relative lack of "popularity."
The deciding factor in choosing this storage class is whether the increased data transfer (retrieval) charges will be low enough to justify the storage charge savings that they offset. Unless the object is expected to be downloaded less than once or twice a month, this storage class does not represent a cost savings.
Standard/Infrequent Access should be reserved for things you really don't expect to be needed often, like tarballs and database dumps and images unlikely to be reviewed after they are first accessed, such as (borrowing an example from my world) a proof-of-purchase/receipt scanned and submitted by a customer for a rebate claim. Once the rebate has been approved, it's very unlikely we'll need to look at that receipt again, but we do need to keep it on file. Hello, Standard_IA. (Note that S3 does this automatically for me, after the file has been stored for 30 days, using a lifecycle policy on the bucket).
Standard - IA is ideally suited for long-term file storage, older data from sync and share, backup data, and disaster recovery files.
https://aws.amazon.com/s3/faqs/#sia
Side note: one alternative mechanism for saving some storage cost is to gzip -9 the content before storing, and set Content-Encoding: gzip. I have been doing this for years with S3 and am still waiting for my first support ticket to come in reporting a browser that can't handle it. Even content that is allegedly already compressed -- such as .xlsx spreadsheets -- will often shrink a little bit, and every byte you squeeze out means slightly lower storage and download bandwidth charges.
Fundamentally, if your content is easily replaceable, such as resized images where you still have the original... or reports that can easily be rerun from source data... or content backed up elsewhere (AWS is essentially always my first choice for cloud services, but I do have backups of my S3 assets stored in another cloud provider's storage service, for example)... then reduced redundancy is a good option.
¹ REDUCED_REDUNDANCY is sometimes less expensive only in some regions as of late 2016. Prior to that, it was priced lower than STANDARD, but in an odd quirk of the strange world of competitive pricing, as a result of S3 price reductions announced in November, 2016, in some AWS regions, the STANDARD storage class is now slightly less expensive than REDUCED_REDUNDANCY ("RRS"). For example, in us-east-1, Standard was reduced from $0.03/GB to $0.023/GB, but RRS remained at $0.024/GB... leaving no obvious reason to ever use RRS in that region. The structure of the pricing pages leaves the impression that RRS may no longer be considered a current-generation offering by AWS. Indeed, it's an older offering than both STANDARD_IA and GLACIER. It is unlikely to ever be fully deprecated or eliminated, but they may not be inclined to reduce its costs to a point that lines up with the other storage classes if it's no longer among their primary offerings.
² "CloudFront has multiple edge locations, each with its own independent cache" is still a technically true statement, but CloudFront quietly began to roll out and then announced some significant architectural changes in late 2016, with the introduction of the regional edge caches. It is now, in a sense, "less true" that the global edge caches are indepenent. They still are, but it makes less of a difference, since CloudFront is now a two-tier network, with the global (outer tier) edge nodes sometimes fetching content from the regional (inner tier) edge nodes, instead of directly from the origin server. This should have the impact of increasing the likelihood of an object being considered to be "in" the cache, since a cache miss in the outer tier might be transformed into a hit by the inner tier, which is also reported to have more available cache storage space than some or all of the outer tier. It is not yet clear from external observations how much of an impact this has on hit rates on S3 origins, as the documentation indicates the regional edges are not used for S3 (only custom origins) but it seems less than clear that this universally holds true, particularly with the introduction of Lambda#Edge. It might be significant, but as of this writing, I do not believe it to have any material impact on my answer to the question presented here.

Since CloudFront caches originals and accesses them only every 24 hours
You can actually make CloudFront cache things for much longer if you want. You just need to add metadata to your objects that sets a Cache Control header, and according to the S3 documentation you can specify an age up to 100 years. You simply set a max-age in seconds, so if you really want to have your objects cached for 100 years:
Cache-Control: max-age=3153600000
As for your main question regarding SIA vs. RR, you've pretty much hit on all the differences between the two. It's just a matter of calculating the costs of using one vs. the other. You'll just need to run some calculations and see what the cost estimates are. If you have 100 thumbnails all under 128K then SIA will charge you for 100 * 128K bytes, whereas RR will just charge you for the costs of the total size of those 100 thumbnails. Similarly, if you set a fairly high cache timeout in CloudFront then you may see only 10 fetches from S3 each day, so SIA would charge you for retrieval of 10 * 128K bytes each day while RR would only charge you for the cost of the size of those 10 thumbnails.
Using some real numbers based on the size & quantity of your thumbnails and the amount of traffic you anticipate it should be pretty easy to come up with cost estimates.
FYI, you might also want to take a look at some of these slideshows and/or these videos. These are all from Amazon's re:Invent conferences, and these links should provide you with S3-specific presentations at those conferences.

How to reduce Amazon Cloudfront costs?

I have a site that has exploded in traffic the last few days. I'm using Wordpress with W3 Total Cache plugin and Amazon Cloudfront to deliver the images and files from the site.
The problem is that the cost of Cloudfront is quite huge, near $500 just the past week. Is there a way to reduce the costs? Maybe using another CDN service?
I'm new to CDN, so I might not be implementing this well. I've created a cloudfront distribution and configured it on W3 Total Cache Plugin. However, I'm not using S3 and don't know if I should or how. To be honest, I'm not quite sure what's the difference between Cloudfront and S3.
Can anyone give me some hints here?

I'm not quite sure what's the difference between Cloudfront and S3.
That's easy. S3 is a data store. It stores files, and is super-scalable (easily scaling to serving 1000's of people at once.) The problem is that it's centralized (i.e. served from one place in the world.)
CloudFront is a CDN. It caches your files all over the world so they can be served faster. If you squint, it looks like they are 'storing' your files, but the cache can be lost at any time (or if they boot up a new node), so you still need the files at your origin.
CF may actually hurt you if you have too few hits per file. For example, in Tokyo, CF may have 20 nodes. It may take 100 requests to a file before all 20 CF nodes have cached your file (requests are randomly distributed). Of those 100 requets, 20 of them will hit an empty cache and see an additional 200ms latency as it fetches the file. They generally cache your file for a long time.
I'm not using S3 and don't know if I should
Probably not. Consider using S3 if you expect your site to massively grow in media. (i.e. lots of use photo uploads.)
Is there a way to reduce the costs? Maybe using another CDN service?
That entirely depends on your site. Some ideas:
1) Make sure you are serving the appropriate headers. And make sure your expires time isn't too short (should be days or weeks, or months, ideally).
The "best practice" is to never expire pages, except maybe your index page which should expire every X minutes or hours or days (depending on how fast you want it updated.) Make sure every page/image says how long it can be cached.
2) As stated above, CF is only useful if each page is requested > 100's of times per cache time. If you have millions of pages, each requested a few times, CF may not be useful.
3) Requests from Asia are much more expensive than the from the US. Consider launching your server in Toyko if you're more popular there.
4) Look at your web server log and see how often CF is requesting each of your assets. If it's more often than you expect, your cache headers are setup wrong. If you setup "cache this for months", you should only see a handful of requests per day (as they boot new servers, etc), and a few hundred requests when you publish a new file (i.e. one request per CF edge node).
Depending on your setup, other CDNs may be cheaper. And depending on your server, other setups may be less expensive. (i.e. if you serve lots of small files, you might be better off doing your own caching on EC2.)

You could give cloudflare a go. It's not a full CDN so it might not have all the features as cloudfront, but the basic package is free and it will offload a lot of traffic from your server.
https://www.cloudflare.com

Amazon Cloudfront costs Based on 2 factor
Number of Requests
Data Transferred in GB
Solution
Reduce image requests. For that combine small images into one image and use that image
https://www.w3schools.com/css/tryit.asp?filename=trycss_sprites_img (image sprites)
Don't use CDN for video file because video size is high and this is responsible for too high in CDN coast

What components make up your bill? One thing to check with W3 Total Cache plugin is the number of invalidation requests it is sending to CloudFront. It's known to send a large amount of invalidations paths on each change, which can add up.
Aside from that, if your spend is predictable, one option is to use CloudFront Security Savings Bundle to save up to 30% by committing to a minimum amount for a one year period. It's self-service, so you can sign up in the console and purchase additional commitments as your usage grows.
https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/savings-bundle.html

Don't forget that cloudfront has 3 different price classes, which will influence how far your data is being replicated, but at the same time, it will make it cheaper.
https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/PriceClass.html
The key here is this:
"If you choose a price class that doesn’t include all edge locations, CloudFront might still occasionally serve requests from an edge location in a region that is not included in your price class. When this happens, you are not charged the rate for the more expensive region. Instead, you’re charged the rate for the least expensive region in your price class."
It means that you could use price class 100 (the cheapest one) and still get replication on regions you are not paying for <3

Is there a way to realisticlly model or estimate AWS usage?

This question is specifically for aws and s3 but it could be for other cloud services as well
Amazon charges for s3 by storage (which is easily estimated with the amount of data stored times the price)
But also charges for requests which is really hard to estimate.. a page that has one image stored in s3 technically gets 1 request per user per visit, but using cache it reduces it. Further more, how can I understand the costs with 1000 users?
Are there tools that will extrapolate data of the current usage to give me estimates?

As you mention, its depending on a lot of different factors. Calculating the cost per GB is not that hard, but estimating the amount of requests is a lot more difficult.
There are no tools that I know of that will calculate the AWS S3 costs based on historic access logs or the like. These calculations would also not be that accurate.
What you can best do is calculate the costs based on the worst case scenario. In this calculation, you assume that nothing will be cached and will assume you will get peak requests all the time. In 99% of the cases, the outcome of that calculation will be lower than what will happen in reality.
If the outcome of that calculation is acceptable pricing wise, you're good to go. If it is way more than your budget allows, then you should think about various ways you could lower these costs (caching being one of them).
Cost calculation beforehand is purely to indicate if the project or environment could realistically stay below budget. Its not meant to provide a 100% accurate estimate beforehand. Most important thing to do is to keep track of the costs after everything has been deployed. Setup billing/budget alerts and check for possible savings.
The AWS pricing calculator should help you get started: https://calculator.aws/
Besides using the calculator, I tend to prefer the actual pricing pages of each individual service and calculate it within a spreadsheet. This gives me a more in-depth overview of the actual costs.

Backup: Amazon S3 or Glacier - lots of little files?

I'm trying to understand the complicated Amazon Glacier pricing model. I don't want to store a huge amount of data, a few GB's say 10. I hope never to download the files and if I did need to I don't care how long it takes.
Is there a cost per file I upload? Is it cheaper to zip lots of tiny files and upload in a few chunks or does 10,000 say images not matter? (cannot get a straight answer to this during searching)
Am I able to request the download of a whole Archive/Bucket or is it file-by-file?

I know this is a bit old, but you may still find my answer helpful (I hope). The other answer is based on S3 which wasn't your question I believe.
Glacier is intended for rare file access. Having that in mind they sort of punish you if you need to retrieve many files at once. In your particular case I would suggest uploading 10.000 separate files instead of let's say 100 ZIP files with 100 files each. The reason is very simple. Glacier will let you download for free only 5% of the total archive and is prorated daily. So if, for example, you need to download 10 photos you took on a weekend, you would be able to get those 10 photos for free if they are spread in the vault. On the other hand, if you have a ZIP file that has 100 photos inside, you'll be forced to download that zip that will probably be more than 5% of the total archive meaning you'll be paying some fees for the retrieval.
The only reason it makes sense to upload fewer files is to avoid high upload requests (10.000 files usually mean 10.000 requests). Requests are charged $0,05 per 1000. This fees are much lower that retrieval fees (taking into account the limits imposed), that's why I would always recommend uploading separate files. Of course you may zip files that make sense to be together.
Retrieval costs are very complex in Amazon Glacier. They have a good explanation here:
http://aws.amazon.com/glacier/faqs/#How_much_data_can_I_retrieve_for_free
But even there you'll need to pay attention on the calculations to get a clear idea on how costs are billed.
Regarding this question:
Am I able to request the download of a whole Archive/Bucket or is it file-by-file?
Requests are by file-by-file, although you can select many files at once and download them altogether.
Deciding whether to use S3 or Glacier really depends on your needs on file access. If you will rearly need access to your files then Glacier is your answer. Otherwise for 10GB S3 can still be cheap and be more flexible than Glacier.
In my case I find family photos to be a very precious thing. That's why I have a 100GB backup on glacier with all my family photos. I don't intend to access it unless there is some kind of disaster at home. In that case, I think I would not mind the retrieval cost if that saved something I really care about. But that's just me.

Detailed pricing information for S3 is available here. Specifics of the API functions available are here.
For S3, you are mostly charged for upload bandwidth (bytes sent TO S3), download bandwidth (bytes received FROM S3), and storage (bytes IN S3). You are also charged for the number and type of API calls.
So, if you upload your 10GB of data to S3 in 10,000 1MB files, store it for a month, and then download each of the files once, you'll be charged:
$0.00 for upload bandwidth (this is free)
$0.10 for the 10,000 PUT requests to upload the files
$0.95 for storing the 10GB for a month
$1.08 for 10GB download bandwidth (the first is free, then $0.12/GB)
$0.01 for the 10,000 GET requests to download the files
That's $2.14. If you uploaded and downloaded once each, but kept the data for a year, only the storage cost would go up to 12 * $0.95, or $11.40. If your files averaged only 100KB, so you had 100,000 of them, you'd pay 10 times as much for the PUT and GET requests, or $1.10 instead of $0.11.
You can only upload and download a single file per operation. If you combined your files into one using Zip, you'd only save by using fewer operations, which, as you can see, are pretty cheap to start with.
There is one quirk here, though. I'm pretty sure you are charged for all bandwidth usage when uploading and downloading, including request headers, not just the bodies containing your data. So if your files were really tiny the request headers might become significant, perhaps as much as the files themselves. In that case your bandwidth costs would double.
Glacier pricing is more complicated, and I've never used it myself. Basically, it reduces storage cost by almost ten-fold, leaving the other costs the same, and adding costs to archive and restore per object. Those costs seem to be significant if you have a lot of small objects, need to get a lot of your files at a time, or get files frequently. Glacier seems to be best when you have a lot of data (terabytes or more, not just gigabytes), but few operations. Given that you only have 10GB of data, S3 is so inexpensive it doesn't seem worth it to consider Glacier.
Finally, AWS has a free usage tier for the first year, which looks like it would cover all your costs except for half the storage charges.

Better use few larger files than lot of small ones
There are two approaches to putting files into Amazon Glacier. You either interact with vaults directly, or use S3 as frontend.
I am using S3 (and Amazon Management Console) so that I am able to see content of the archive and at the same time have it stored cheaply in Glacier.
This approach has one drawback - as storing any piece of information in Glacier has some data overhead (which you pay for too), then there is logically a break even point. Before 2014-04 price reduction I made a calculation and critical size is about 16 kB, storing smaller files in Glacier (using AWS S3 as frontend) was more expensive than keeping it only on S3. With price reduction for S3 storage (Glacier did not change) the break even point went even higher.
I guess, that even without S3 as frontend, the situation will be similar, even though a bit more friendly to smaller files.

Since November 21, 2016, Amazon updated the free tier policy for Glacier retrievals and updated the "5% of your average monthly storage" policy in favor of a flat 10GB free per month. However, if your retrieval policy was set prior to that day, then you're still on the "5%" policy and the other answers here still apply to you.
If your retrieval policy was set after Nov 21, 2016, and you're in the OP's shoes:
You're only storing 10GB, so you could retrieve all of your data for free once per month using Standard retrievals. It would make no difference if all 10,000 photos are zipped into one zip file or not (for retrievals).
The only variable in this scenario is number of upload requests. 10,000 requests at a price of $0.05 per 1,000 is only $0.50 and that's a one time fee for your specific case.
More pricing info at AWS Glacier FAQ
UPDATE:
Glacier docs recommend using multipart upload for files larger than 100MB.
I came to this conclusion independently after a couple timeouts when trying to upload an 8GB file.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js