I'm trying to design an architecture for a "simple" problem but for the moment I did not found the solution.
The problem:
I have a S3 bucket (one in each region with bucket replication in order to have the same thing in each bucket) and I would like to have a CloudFront in front of it to cache objects.
My need: to have the lowest latency for each user in the world when displaying an object from S3 bucket.
I wanted to have a CloudFront distribution in front of each S3 bucket and a Route53 to route based on the latency to the nearest CF. The problem is that we cannot have many distribution for the same cname.
Here bellow the architecture I have so far (which is not good).
Any idea how to achieve this ?
Thanks.
C.C.
Just keep one of your buckets, AWS CloudFront does all of them for you.
How CloudFront Delivers Content to Your Users
After you configure CloudFront to deliver your content, here's what happens when users request your objects:
1-A user accesses your website or application and requests one or more objects, such as an image file and an HTML file.
2-DNS routes the request to the CloudFront edge location that can best serve the request—typically the nearest CloudFront edge location in terms of latency—and routes the request to that edge location.
3-In the edge location, CloudFront checks its cache for the requested files. If the files are in the cache, CloudFront returns them to the user. If the files are not in the cache, it does the following:
CloudFront compares the request with the specifications in your
distribution and forwards the request for the files to the applicable
origin server for the corresponding file type—for example, to your
Amazon S3 bucket for image files and to your HTTP server for the HTML
files.
The origin servers send the files back to the CloudFront edge
location.
As soon as the first byte arrives from the origin, CloudFront
begins to forward the files to the user. CloudFront also adds the
files to the cache in the edge location for the next time someone
requests those files.
For more info read the following doc:
https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/HowCloudFrontWorks.html
To deliver content to end users with lower latency, Amazon CloudFront uses a global network of 138 Points of Presence (127 Edge Locations and 11 Regional Edge Caches) in 63 cities across 29 countries. Amazon CloudFront Edge locations are located in:
One thing you can do is that you can create one single CloudFront distribution and you can attach a Lambda#Edge to it and use it to rewrite the host header in the request. Inside the Lambda you can access all the headers and you can rewrite them at will, based on any logic you want. When you rewrite the host header, the request will be sent to another bucket in another region.
We used this solution to build multi-region active-active delivery from replicated buckets from two regions.
The original idea is from here: https://medium.com/buildit/a-b-testing-on-aws-cloudfront-with-lambda-edge-a22dd82e9d12
This seems to be the same solution for a different problem: https://aws.amazon.com/blogs/apn/using-amazon-cloudfront-with-multi-region-amazon-s3-origins/
We presented our solution on the AWS Summit in Berlin this year, but haven't posted about it yet anywhere.
The answer seems to be pretty elaborate as provided by #Reza Mousavi. The point of AWS CloudFront distribution is to cache objects on the Edge locations worldwide (see options while configuring-attached snapshot).
Best practice (at least what I do -no complaints so far) is to configure a single distribution for each application origin. The option while configuring gives you the regions to choose based on your customer origin.
The AWS solutions has launched new solution to address the S3 replication across the regions.
For example, you can create objects in Oregon, rename them in Singapore, and delete them in Dublin, and the changes are replicated to all other regions. This solution is designed for workloads that can tolerate lost events and variations in replication speed.
https://aws.amazon.com/solutions/multi-region-asynchronous-object-replication-solution/
Related
We keep user-specific downloadable files on AWS S3 buckets in N.Virginia region. Our clients download the files from these buckets all over the world. Files size ranges from 1-20 GB. For larger files, clients in non-US location face and complain about slow downloads or interrupted downloads. How can we optimize these downloads?
We are thinking about the following approaches:
Accelerated downloads (higher costs)
use of CloudFront CDN with S3 origin (Since our downloads are of different files, each file being downloaded just once or twice, will CDN help since, for 1st time, it will fetch data from US bucket only)
Use of akamai as CDN (same concern as of CloudFront, only thing is we have a better price deal with akamai at org level)
Depending on the user's location (we know where the download will happen), we can keep the file in the specific bucket which was created at that aws region.
So, I want recommendations in terms of cost+download speed. Which may be a better option to explore further?
As each file will only be downloaded a few times, you won't benefit from CloudFront's caching, because the likelihood that the download requests all hit the same CloudFront node and that this node hasn't evicted the file from its cache yet, are probably near zero, especially for such large files.
On the other hand you gain something else by using CloudFront or S3 Transfer Acceleration (the latter one being essentially the same as the first one without caching): The requests enter AWS' network already at the edge, so you can avoid using congested networks from the location of the user to the location of your S3 bucket, which is usually the main reason for slow and interrupted downloads.
Storing the data depending on the users location would improve the situation as well, although CloudFront edge locations are usually closer to a user than the next AWS region with S3. Another reason for not distributing the files to different S3 buckets depending on the users location is the management overhead: You need to manage multiple S3 buckets, store each file in the correct bucket and point each user to the correct bucket. While storing could be simplified by using S3 Replication (you could use a filter to only replicate objects to a specific target bucket meant for this bucket), the overhead with managing multiple endpoints for multiple customers remains. Also while you state that you know the location of the customers, what happens if a customer does change its location and suddenly wants to download an object which is now stored on the other side of the world? You'd have the same problem again.
In your situation I'd probably choose option 2 and set up CloudFront in front of S3. I'd prefer CloudFront over S3 Transfer Acceleration, as it gives you more flexibility: You can use your own domain with HTTPS, you can later on reconfigure origins when the location of the files changes, etc. Depending on how far you want to go you could even combine that with S3 replication and have multiple origins for your CloudFront distribution to direct requests for different files to S3 buckets in different regions.
Which solution to choose depends on your use case and constraints. One constraint seems to be cost for you, another one could for example be the maximum file size of 20GB supported by CloudFront, if you have files to distribute larger than that.
I have searched a lot on this but all I get is using CloudFront (CDN) in collaboration with S3.
I want to do something different.
CloudFront works as a CDN with its Origin set to either my domain where images are, or S3.
If I set it to my domain, there is an issue of having my hosting space used.
If I use it with S3, the question is, how to get my images to S3 without much hassle? In case of CDN, this is automatic, as every call to CloudFront copies the image from my server automatically.
Is it possible that CloudFront works with S3 but if image is not present on S3, it copies it from my server to S3?
Or may be S3 itself works as CDN (best solution). I have seen on some sites that they use s3 urls for hosting their images, like this:
https://retsimages.s3.amazonaws.com/14/A10363214_6.jpg
How is that possible?
If I set it to my domain, there is an issue of having my hosting space used.
More expensive than the storage space is the cost of having a server sitting there ready to handle the request. Your application logic konws when the images change; that's the time to put them in S3.
how to get my images to S3 without much hassle?
There's an SDK for just about every language, so upload the image as it comes in. Use s3cmd sync to move the images you have. Then you can just turn off your server.
Or may be S3 itself works as CDN
CloudFront can use a customer provided dns name and matching certificate so that you can use a custom domain with https. It can integrate into AWS WAF which S3 cannot directly. Otherwise, CDN behaves similarly to s3. CloudFront should provide better caching and endpoint locality, but you'll see little functional difference at low volumes. Neither is read-after-write consistent, but Cloudfront caches additionally. Pricing is unlikely to make CloudFront cheaper for most uses.
Is it possible that CloudFront works with S3 but if image is not present on S3, it copies it from my server to S3?
Close.
CloudFront does have a feature that would help move you in this direction -- origin groups. Create an origin group with S3 as primary and your server as secondary. Any time CloudFront encounters a cache miss, it will first check S3, and only if the image is not there, it will retry by sending the request to your server. It will cache the response, but it will not remember the source of the object -- so subsequent requests on future cache misses for the same object will always try S3 first.
This means something on your server needs to be responsible for ultimately moving images to S3 -- but as long as the image exists in one place or the other, the image will be served by CloudFront and cached in CloudFront in the edge or edges (up to two -- one global/outer, one regional/inner) that handled the request.
I have following S3 buckets
"client1"
"client2"
...
"clientX"
and our clients upload data to their buckets via jar app (client1 to bucket client1 ect.). Here is peace of code:
BasicAWSCredentials credentials = new BasicAWSCredentials(accessKey, secretKey);
AmazonS3 s3client = new AmazonS3Client(credentials);
s3client.setRegion(Region.getRegion(Regions.US_EAST_1));
File file = new File(OUTPUT_DIRECTORY + '/' + fileName);
s3client.putObject(new PutObjectRequest(bucketName, datasource + '/' + fileName, file));
and problem is, that they have firewall for output traffic. They must allow URL .amazonaws.com in firewall. Is it possible to set endpoint to our domain storage.domain.com ?
We are expecting, that we will change region in future, but all our client are locket to amazonaws.com = US_EAST_1 region now -> all our clients will need change rules in firewall.
If the endpoint will be storage.domain.com - everything will be ok :)
Example of expected clients URL
client1 will put data to URL client1.storage.domain.com
client2 will put data to URL client2.storage.domain.com
clientX will put data to URL clientX.storage.domain.com
We know about setting in CloudFront, but it's per bucket. We are finding solution with one global AWS setting. How can we do that?
Thank you very much
Not sure if this will be affordable for you (due to extra fee you may have), but this should work:
Create Route53 with your domain and subdomains (client1, client2.. clientX)
Create (or use default) VPC with endpoints (https://aws.amazon.com/blogs/aws/new-vpc-endpoint-for-amazon-s3/)
Route all traffic from Route53 to your VPC through the internet gateway (IGW)
You may need to have Security group and NACL things configured. Let me know if you need further details.
There are numerous factors at play, here, not the least of which is support for SSL.
First, we need to eliminate one obvious option that will not work:
S3 supports naming each bucket after a domain name, then using a CNAME to point at each bucket. So, for example, if you name a bucket client-1.storage.example.com and then point a DNS CNAME (in Route 53) for client-1.storage.example.com CNAME client-1.storage.example.com.s3.amazonaws.com then this bucket becomes accessible on the Internet as client-1.storage.example.com.
This works only if you do not try to use HTTPS. The reason for the limitation is a combination of factors, which are outside the scope of this answer. There is no workaround that uses only S3. Workarounds require additonal components.
Even though the scenario above will not work for your application, let's assume for a moment that it will, since it makes another problem easy to illustrate:
We are finding solution with one global AWS setting
This may not be a good idea, even if it is possible. In the above scenario, it would be tempting for you to set up a wildcard CNAME so that *.storage.example.com CNAME s3[-region].amazonaws.com which would give you a magical DNS entry that would work for any bucket with a name matching *.storage.example.com and created in the appropriate region... but there is a serious vulnerability in such a configuration -- I could create a bucket called sqlbot.storage.example.com (assuming no such bucket already existed) and now I have a bucket that you do not control, using a hostname under your domain, and you don't have any way to know about it, or stop it. I can potentially use this to breach your clients' security because now my bucket is accessible from inside your client's firewall, thanks to the wildcard configuration.
No, you really need to automate the steps to deploy each client, regardless of the ultimate solution, rather than relying on a single global setting. All AWS services (S3, Route 53, etc.) lend themselves to automation.
CloudFront seems like it holds the key to the simplest solution, by allowing you to map each client hostname to their own bucket. Yes, this does require a CloudFront distribution to be configured for each client, but this operation can also be automated, and there isn't a charge for each CloudFront distribution. The only charges for CloudFront are usage-related (per request and per GB transferted). Additional advantages here include SSL support (including wildcard *.storage.example.com certificate from ACM which can be shared across multiple CloudFront distributions) and the fact that with CloudFront in the path, you do not need the bucket name and the hostname to be the same.
This also gives you the advantage of being able to place each bucket in the most desirable region for that specific bucket. It is, however, limited to files not exceeding 20 GB in size, due to the size limit imposed by CloudFront.
But the problem with using CloudFront for applications with a large number of uploads of course is that you're going to pay bandwidth charges for the uploads. In Europe, Canada, and the US, it's cheap ($0.02/GB) but in India it is much more expensive ($0.16/GB), with other areas varying in price between these extremes. (You pay for downloads with CloudFront, too, but in that case, S3 does not bill you for any bandwidth charges when downloads are pulled through CloudFront... so the consideration is not usually as significant, and adding CloudFront in front of S3 for downloads can actually be slightly cheaper than using S3 alone).
So, while CloudFront is probably the official answer, there are a couple of considerations that are still potentially problematic.
S3 transfer acceleration avoids the other problem you mentioned -- the bucket regions. Buckets with transfer acceleration enabled are accessible at https://bucket-name.s3-accelerate.amazonaws.com regardless of the bucket region, so that's a smaller hole to open, but the transfer acceleration feature is only supported for buckets without dots in their bucket names. And transfer acceleration comes with additional bandwidth charges.
So where does this leave you?
There's not a built-in, "serverless" solution that I can see that would be simple, global, automatic, and inexpensive.
It seems unlikely, in my experience, that a client who is so security-conscious as to restrict web access by domain would simultaneously be willing to whitelist what is in effect a wildcard (*.storage.example.com) and could result in trusting traffic that should not be trusted. Granted, it would be better than *.amazonaws.com but it's not clear just how much better.
I'm also reasonably confident that many security configurations rely on static IP address whitelisting, rather than whitelisting by name... filtering by name in an HTTPS environment has implications and complications of its own.
Faced with such a scenario, my solution would revolve around proxy servers deployed in EC2 -- in the same region as the buckets -- which would translate the hostnames in the requests into bucket names and forward the requests to S3. These could be deployed behind ELB or could be deployed on Elastic IP addresses, load balanced using DNS from Route 53, so that you have static endpoint IP addresses for clients that need them.
Note also that any scenario involving Host: header rewrites for requests authorized by AWS Signature V4 will mean that you have to modify your application's code to sign the requests with the real hostname of the target S3 endpoint, while sending the requests to a different hostname. Sending requests directly to the bucket endpoint (including the transfer acceleration endpoint) is the only way to avoid this.
After reading some AWS documentations, I am wondering what's the difference between these different use cases if I want to delivery (js, css, images and api request) content in Asia (including China), US, and EU.
Store my images and static files on S3 US region and setup EU and Asia(Japan or Singapore) cross region replication to sync with US region S3.
Store my images and static files on S3 US region and setup cloudfront CDN to cache my content in different locations after initial request.
Do both above (if there is significant performance improvement).
What is the most cost effective solution if I need to achieve global deployment? And how to make request from China consistent and stable (I tried cloudfront+s3(us-west), it's fast but the performance is not consistent)?
PS. In early stage, I don't expect too many user requests, but users spread globally and I want them to have similar experience. The majority of my content are panorama images which I'd expect to load ~30MB (10 high res images) data sequentially in each visit.
Cross region replication will copy everything in a bucket in one region to a different bucket in another region. This is really only for extra backup/redundancy in case an entire AWS region goes down. It has nothing to do with performance. Note that it replicates to a different bucket, so you would need to use different URLs to access the files in each bucket.
CloudFront is a Content Delivery Network. S3 is simply a file storage service. Serving a file directly from S3 can have performance issues, which is why it is a good idea to put a CDN in front of S3. It sounds like you definitely need a CDN, and it sounds like you have tested CloudFront and are unimpressed. It also sounds like you need a CDN with a larger presence in China.
There is no reason you have to chose CloudFront as your CDN just because you are using other AWS services. You should look at other CDN services and see what their edge networks looks like. Given your requirements I would highly recommend you take a look at CloudFlare. They have quite a few edge network locations in China.
Another option might be to use a CDN that you can actually push your files to. I've used this feature in the past with MaxCDN. You would push your files to the CDN via FTP, and the files would automatically be pushed to all edge network locations and cached until you push an update. For your use case of large image downloads, this might provide a more performant caching mechanism. MaxCDN doesn't appear to have a large China presence though, and the bandwidth charges would be more expensive than CloudFlare.
If you want to serve your files in S3 buckets to all around the world, then I believe you may consider using S3 Transfer acceleration. It can be used in cases where you either upload to or download from your S3 bucket . Or you may also try AWS Global Accelerator
CloudFront's job is to cache content at hundreds of caches ("edge locations") around the world, making them more quickly accessible to users around the world. By caching content at locations close to users, users can get responses to their requests more quickly than they otherwise would.
S3 Cross-Region Replication (CRR) simply copies an S3 bucket from one region to another. This is useful for backing up data, and it also can be used to speed up content delivery for a particular region. Unlike CloudFront, CRR supports real-time updating of bucket data, which may be important in situations where data needs to be current (e.g. a website with frequently-changing content). However, it's also more of a hassle to manage than CloudFront is, and more expensive on a multi-region scale.
If you want to achieve global deployment in a cost-effective way, then CloudFront would probably be the better of the two, except in the special situation outlined in the previous paragraph.
Since heroku file system is ephemeral , I am planning on using AWS for static assets for my django project on heroku
I am seeing two conflicting articles one which advises on using AWS S3. This one says to use S3
https://devcenter.heroku.com/articles/s3
While another one below says, S3 has disadvantages and to use Cloudfront CDN instead
https://devcenter.heroku.com/articles/using-amazon-cloudfront-cdn
Many developers make use of Amazon’s S3 service for serving static
assets that have been uploaded previously, either manually or by some
form of build process. Whilst this works, this is not recommended as
S3 was designed as a file storage service and not for optimal delivery
of files under load. Therefore, serving static assets from S3 is not
recommended.
Amazon CloudFront is a Content Delivery Network (CDN) that integrates with other Amazon Web Services like S3 that give us an easy way to distribute content to end users with low latency, high data transfer speeds.
CloudFront makes your static files available from data centers around the world (called edge locations). When a visitor requests a file from your website, he or she is invisibly redirected to a copy of the file at the nearest edge location (Now AWS has around 35 edge locations spread across the world), which results in faster download times than if the visitor had accessed the content from S3 bucket located in a particular region.
So if your user base is spread across the world its a better option to use CloudFront else if your users are localized you would not find much difference using CloudFront than S3 (but in this case you need to choose right location for your your S3 bucket: US East, US West, Asia Pacific, EU, South America etc)
Comparative features of Amazon S3 and CloudFront
My recommendation is to use CloudFront on top of Whitenoise. You will be serving the static assets directly from your Heroku app, but CloudFront as the CDN will take over once you reach scale.
Whitenoise radically simplifies build processes and the need to use convoluted caching headers.
Read http://whitenoise.evans.io/en/latest/ for the full manifesto.
(Note that Whitenoise is relevant only for static assets bundled with your app, not for user-uploaded files, which still require S3 for proper storage. You'd still want to use CF though.)
Actually, you should use both.
CloudFront only acts as a CDN, which basically means it caches resources in edge locations all over the world. In order for this to work, it has to initially download those resources from an origin location, whenever they expire or don't yet exist.
CloudFront distributions can have one of two possible origin types. S3 or EC2. In your case, you should store your assets in S3 and connect the bucket to a CloudFront distribution. Use the CloudFront links for actually serving the assets, and S3 for storage.
This will ensure the best possible performance, as well as correct and scalable load handling.
Hope this helps, let me know if you need additional info in the comments section.