How to optimize download speeds from AWS S3 bucket?

How to optimize download speeds from AWS S3 bucket? - amazon-web-services

We keep user-specific downloadable files on AWS S3 buckets in N.Virginia region. Our clients download the files from these buckets all over the world. Files size ranges from 1-20 GB. For larger files, clients in non-US location face and complain about slow downloads or interrupted downloads. How can we optimize these downloads?
We are thinking about the following approaches:
Accelerated downloads (higher costs)
use of CloudFront CDN with S3 origin (Since our downloads are of different files, each file being downloaded just once or twice, will CDN help since, for 1st time, it will fetch data from US bucket only)
Use of akamai as CDN (same concern as of CloudFront, only thing is we have a better price deal with akamai at org level)
Depending on the user's location (we know where the download will happen), we can keep the file in the specific bucket which was created at that aws region.
So, I want recommendations in terms of cost+download speed. Which may be a better option to explore further?

As each file will only be downloaded a few times, you won't benefit from CloudFront's caching, because the likelihood that the download requests all hit the same CloudFront node and that this node hasn't evicted the file from its cache yet, are probably near zero, especially for such large files.
On the other hand you gain something else by using CloudFront or S3 Transfer Acceleration (the latter one being essentially the same as the first one without caching): The requests enter AWS' network already at the edge, so you can avoid using congested networks from the location of the user to the location of your S3 bucket, which is usually the main reason for slow and interrupted downloads.
Storing the data depending on the users location would improve the situation as well, although CloudFront edge locations are usually closer to a user than the next AWS region with S3. Another reason for not distributing the files to different S3 buckets depending on the users location is the management overhead: You need to manage multiple S3 buckets, store each file in the correct bucket and point each user to the correct bucket. While storing could be simplified by using S3 Replication (you could use a filter to only replicate objects to a specific target bucket meant for this bucket), the overhead with managing multiple endpoints for multiple customers remains. Also while you state that you know the location of the customers, what happens if a customer does change its location and suddenly wants to download an object which is now stored on the other side of the world? You'd have the same problem again.
In your situation I'd probably choose option 2 and set up CloudFront in front of S3. I'd prefer CloudFront over S3 Transfer Acceleration, as it gives you more flexibility: You can use your own domain with HTTPS, you can later on reconfigure origins when the location of the files changes, etc. Depending on how far you want to go you could even combine that with S3 replication and have multiple origins for your CloudFront distribution to direct requests for different files to S3 buckets in different regions.
Which solution to choose depends on your use case and constraints. One constraint seems to be cost for you, another one could for example be the maximum file size of 20GB supported by CloudFront, if you have files to distribute larger than that.

Related

AWS S3 - How to serve (and receive) files to/from user based on their geographical location?

Here's my situation and my goal(s):
I have a SaaS where users (globally) can upload audio files. These audio files are then later streamed (via HTML5 <audio>) to potentially anyone in the world. Currently, the only bucket hosting files is in us-west-2, which is obviously problematic when EU customers upload files, and EU users stream audio.
How can I have AWS:
Serve up audio files to a user, using the appropriate region based on their geographical location
Receive uploads using the S3 bucket (region) closest to the user uploading files
I thought maybe CloudFront would do the trick, but AFAIK, CloudFront requires a file to be downloaded once before it actually caches it, and that won't work for my SaaS. A common use case is that someone in the US might upload an important audio file for someone in Germany to listen to. I would need that person in Germany to experience as fast a streaming experience as possible, and currently I'm getting complaints of slow load times and choppy audio.
S3 cross-region replication might make sense (replicating to eu-central-1 as a good starting point, to cover customers in Scandinavia, other European countries, and the UK), but I'm not sure how to make a single S3 URL pull the file from a specific bucket based on the user's geographical location.
What's the best solution here, and how do I execute it?

To improve file upload performance, you can use Amazon S3 Transfer Acceleration which enables fast, easy, and secure transfers of files over long distances between your client and an S3 bucket. Transfer Acceleration takes advantage of Amazon CloudFront’s globally distributed edge locations. As the data arrives at an edge location, data is routed to Amazon S3 over an optimized network path.
To improve file download performance, you need to use AWS Cloudfront caching. Since Cloudfront caches content from first request onwards, if you need improve even the first request performance per region, you can automatically populate the cache by requesting the URL timely from different regions.

Difference between s3 bucket vs host files for Amazon Cloud Front

Background
We have developed an e-commerce application where I want to use CDN to improve the speed of the app and also to reduce the load on the host.
The application is hosted on an EC2 server and now we are going to use Cloud Front.
Questions
After reading a lot of articles and documents, I have created a distribution for my sample site. After doing all the experience I have come to know the following things. I want to be sure if am right about these points or not.
When we create a distribution it takes all the accessible data from the given origin path. We don't need to copy/ sync our files to cloud front.
We just have to change the path of our application according to this distribution CNAME (if cname is given).
There is no difference between placing the images/js/CSS files on S3 or on our own host. Cloud Front will just take them by itself.
The application will have thousands of pictures of the products, should we place them on S3 or its ok if they are on the host itself? Please share any good article to understand the difference of both the techniques.
Because if S3 is significantly better then I'll have to make a program to sync all such data on S3.
Thanks for the help.

Some reasons to store the images on Amazon S3 rather than your own host (and then serve them via Amazon CloudFront):
Less load on your servers
Even though content is cached in Amazon CloudFront, your servers will still be hit with requests for the first access of each object from every edge location (each edge location maintains its own cache), repeated every time that the object expires. (Refreshes will generate a HEAD request, and will only re-download content that has changed or been flushed from the cache.)
More durable storage
Amazon S3 keeps copies of your data across multiple Availability Zones within the same Region. You could also replicate data between your servers to improve durability but then you would need to manage the replication and pay for storage on every server.
Lower storage cost
Storing data on Amazon S3 is lower cost than storing it on Amazon EBS volumes. If you are planning on keeping your data in both locations, then obviously using S3 is more expensive but you should also consider storing it only on S3, which makes it lower cost, more durable and less for you to backup on your server.
Reasons to NOT use S3:
More moving parts -- maintaining code to move files to S3
Not as convenient as using a local file system
Having to merge log files from S3 and your own servers to gather usage information

Difference between Amazon S3 cross region replication and Cloudfront

After reading some AWS documentations, I am wondering what's the difference between these different use cases if I want to delivery (js, css, images and api request) content in Asia (including China), US, and EU.
Store my images and static files on S3 US region and setup EU and Asia(Japan or Singapore) cross region replication to sync with US region S3.
Store my images and static files on S3 US region and setup cloudfront CDN to cache my content in different locations after initial request.
Do both above (if there is significant performance improvement).
What is the most cost effective solution if I need to achieve global deployment? And how to make request from China consistent and stable (I tried cloudfront+s3(us-west), it's fast but the performance is not consistent)?
PS. In early stage, I don't expect too many user requests, but users spread globally and I want them to have similar experience. The majority of my content are panorama images which I'd expect to load ~30MB (10 high res images) data sequentially in each visit.

Cross region replication will copy everything in a bucket in one region to a different bucket in another region. This is really only for extra backup/redundancy in case an entire AWS region goes down. It has nothing to do with performance. Note that it replicates to a different bucket, so you would need to use different URLs to access the files in each bucket.
CloudFront is a Content Delivery Network. S3 is simply a file storage service. Serving a file directly from S3 can have performance issues, which is why it is a good idea to put a CDN in front of S3. It sounds like you definitely need a CDN, and it sounds like you have tested CloudFront and are unimpressed. It also sounds like you need a CDN with a larger presence in China.
There is no reason you have to chose CloudFront as your CDN just because you are using other AWS services. You should look at other CDN services and see what their edge networks looks like. Given your requirements I would highly recommend you take a look at CloudFlare. They have quite a few edge network locations in China.
Another option might be to use a CDN that you can actually push your files to. I've used this feature in the past with MaxCDN. You would push your files to the CDN via FTP, and the files would automatically be pushed to all edge network locations and cached until you push an update. For your use case of large image downloads, this might provide a more performant caching mechanism. MaxCDN doesn't appear to have a large China presence though, and the bandwidth charges would be more expensive than CloudFlare.

If you want to serve your files in S3 buckets to all around the world, then I believe you may consider using S3 Transfer acceleration. It can be used in cases where you either upload to or download from your S3 bucket . Or you may also try AWS Global Accelerator

CloudFront's job is to cache content at hundreds of caches ("edge locations") around the world, making them more quickly accessible to users around the world. By caching content at locations close to users, users can get responses to their requests more quickly than they otherwise would.
S3 Cross-Region Replication (CRR) simply copies an S3 bucket from one region to another. This is useful for backing up data, and it also can be used to speed up content delivery for a particular region. Unlike CloudFront, CRR supports real-time updating of bucket data, which may be important in situations where data needs to be current (e.g. a website with frequently-changing content). However, it's also more of a hassle to manage than CloudFront is, and more expensive on a multi-region scale.
If you want to achieve global deployment in a cost-effective way, then CloudFront would probably be the better of the two, except in the special situation outlined in the previous paragraph.

Use AWS S3 vs Cloudfront

Since heroku file system is ephemeral , I am planning on using AWS for static assets for my django project on heroku
I am seeing two conflicting articles one which advises on using AWS S3. This one says to use S3
https://devcenter.heroku.com/articles/s3
While another one below says, S3 has disadvantages and to use Cloudfront CDN instead
https://devcenter.heroku.com/articles/using-amazon-cloudfront-cdn
Many developers make use of Amazon’s S3 service for serving static
assets that have been uploaded previously, either manually or by some
form of build process. Whilst this works, this is not recommended as
S3 was designed as a file storage service and not for optimal delivery
of files under load. Therefore, serving static assets from S3 is not
recommended.

Amazon CloudFront is a Content Delivery Network (CDN) that integrates with other Amazon Web Services like S3 that give us an easy way to distribute content to end users with low latency, high data transfer speeds.
CloudFront makes your static files available from data centers around the world (called edge locations). When a visitor requests a file from your website, he or she is invisibly redirected to a copy of the file at the nearest edge location (Now AWS has around 35 edge locations spread across the world), which results in faster download times than if the visitor had accessed the content from S3 bucket located in a particular region.
So if your user base is spread across the world its a better option to use CloudFront else if your users are localized you would not find much difference using CloudFront than S3 (but in this case you need to choose right location for your your S3 bucket: US East, US West, Asia Pacific, EU, South America etc)
Comparative features of Amazon S3 and CloudFront

My recommendation is to use CloudFront on top of Whitenoise. You will be serving the static assets directly from your Heroku app, but CloudFront as the CDN will take over once you reach scale.
Whitenoise radically simplifies build processes and the need to use convoluted caching headers.
Read http://whitenoise.evans.io/en/latest/ for the full manifesto.
(Note that Whitenoise is relevant only for static assets bundled with your app, not for user-uploaded files, which still require S3 for proper storage. You'd still want to use CF though.)

Actually, you should use both.
CloudFront only acts as a CDN, which basically means it caches resources in edge locations all over the world. In order for this to work, it has to initially download those resources from an origin location, whenever they expire or don't yet exist.
CloudFront distributions can have one of two possible origin types. S3 or EC2. In your case, you should store your assets in S3 and connect the bucket to a CloudFront distribution. Use the CloudFront links for actually serving the assets, and S3 for storage.
This will ensure the best possible performance, as well as correct and scalable load handling.
Hope this helps, let me know if you need additional info in the comments section.

can i meter or set a size limit to an s3 folder

I'd like to set up a separate s3 bucket folder for each of my mobile app users for them to store their files. However, I also want to set up size limits so that they don't use up too much storage. Additionally, if they do go over the limit I'd like to offer them increased space if they sign up for a premium service.
Is there a way I can set folder file size limits through s3 configuration or api? If not would I have to use the apis somehow to calculate folder size on every upload? I know that there is the devpay feature in Amazon but it might be a hassle for users to sign up with Amazon if they want to just use small amount of free space.

There does not appear to be a way to do this, probably at least in part because there is actually no such thing as "folders" in S3. There is only the appearance of folders.
Amazon S3 does not have concept of a folder, there are only buckets and objects. The Amazon S3 console supports the folder concept using the object key name prefixes.
— http://docs.aws.amazon.com/AmazonS3/latest/UG/FolderOperations.html
All of the keys in an S3 bucket are actually in a flat namespace, with the / delimiter used as desired to conceptually divide objects into logical groupings that look like folders, but it's only a convenient illusion. It seems impossible that S3 would have a concept of the size of a folder, when it has no actual concept of "folders" at all.
If you don't maintain an authoritative database of what's been stored by clients (which suggests that all uploads should pass through an app server rather than going directly to S3, which is the the only approach that makes sense to me at all) then your only alternative is to poll S3 to discover what's there. An imperfect shortcut would be for your application to read the S3 bucket logs to discover what had been uploaded, but that is only provided on a best-effort basis. It should be reliable but is not guaranteed to be perfect.
This service provides a best effort attempt to log all access of objects within a bucket. Please note that it is possible that the actual usage report at the end of a month will slightly vary.
Your other option is to develop your own service that sits between users and Amazon S3, that monitors all requests to your buckets/objects.
— http://aws.amazon.com/articles/1109#13
Again, having your app server mediate all requests seems to be the logical approach, and would also allow you to detect immediately (as opposed to "discover later") that a user had exceeded a threshold.

I would maintain a seperate database in the cloud to hold each users total hdd usage count. Its easy to manage the count via S3 Object Lifecycle Events which could easily trigger a Lambda which in turn writes to a DB.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js