Can a distribution automatically match the subdomain from a request to figure out the origin - amazon-web-services

We're adding a lot of nearly equivalent apps on the same domain, each app can be accessed through its specific subdomain. Each app has got specific assets (not a lot).
Every app refer to the same cdn.mydomain.com to get the assets from cloudfront.
Assets are named spaced. For exemple:
app1:
Can be reached from app1.mydomain.com
assets url is cdn.mydomain.com/assets/app1
cloudfront orgin app1.mydomain.com
cache behavior /assets/app1/* to origin app1.mydomain.com
When Cloudfront doesn't have the assets in cache, it downloads it from the right origin.
Actually we're making a new origin and cache behavior on the same distribution each time we add a new app.
We're trying to simplify that process so Cloudfront can be able to get the assets from the right origin without having to specify it. And this will resolve the problem if we hit the limit of the number of origin in one distribution.
How can we do this and is it possible?
We're thinking of making an origin of mydomain.com with a cache configure to forward the host header but we're not sure that this will do the trick.

Origins are tied to Cache Behaviors, which are tied to path patterns. You can't really do what you're thinking about doing.
I would suggest that you should create a distribution for each app and each subdomain. It's very easy to script this using aws-cli, since once you have one set up the way you like it, you can use its configuration output as a template to make more, with minimal changes. (I use a Perl script to build the final JSON to create each distribution, with minimal inputs like alternate domain name and certificate ARN and pipe its output into aws-cli.)
I believe this is the right approach, because:
CloudFront cannot select the origin based on the Host header. Only the path pattern is used to select the origin.
Lambda#Edge can rewrite the path and can inspect the Host header, but it cannot rewrite the path before the matching is done that selects the Cache Behavior (and thus the origin). You cannot use Lambda#Edge to cause CloudFront to switch or select origins, unless you generatre browser redirects, which you probably don't want to do, for performance reasons. I've submitted a feature request to allow a Lambda trigger to signal CloudFront that it should return to the beginning of processing and re-evaluate the path, but I don't know if it is being considered as a future feature -- AWS tends to keep their plans for future functionality close to the vest, and understandably so.
you don't gain any efficiency or cost savings by combining your sites in a single distribution, since the resources are different
if you decide to whitelist the Host header, that means CloudFront will cache responses, separately, based on the Host header, the same as it would do if you had created multiple distributions. Even if the path is identical, it will still cache separate responses if the Host header differs, as it must to ensure sensible behavior
the default limit for distributions is 200, while the limit for origins and cache behaviors is 25. Both can be raised by request, but the number of distributions they can give you is unlimited, while the other resources are finite because they increase the workload on the system for each request and would eventually have a negative performance impact
separate distributions gives you separate logs and reports
provisioning errors have a smaller blast radius when each app has its own distribution
You can also go into Amazon Certificate Manager and a wildcard certificate for * *.cdn.example.com. Then use e.g. app1.cdn.example.com as the alternate domain name for the app1 distribution and attach the wildcard cert. Then reuse the same cert on the app2.cdn.app.com distribution, etc.
Note that you also have an easy migration strategy from your current solution: You can create a single distribution with *.cdn.example.com as its alternate domain name. Code the apps to use their own unique-name-here.cdn.example.com. Point all the DNS records here. Later, when you create a distribution with a specific alternate domain name foo.cdn.example.com, CloudFront will automatically stop routing those requests to the wildcard distribution and start routing them to the one with the specific domain. You will need to change the DNS entry... but CloudFront will actually handle the requests correctly, routing them to the newly-created distribution, before you change the DNS, because it has some internal magic that will match the non-wildcard hostname to the correct distribution regardless of whether the browser connects to the new endpoint or the old... so the migration event should pretty much be a non-event.
I'd suggest the wildcard strategy is a good one, anyway, so that your apps are each connecting to a specific endpoint hostname, allowing you much more flexibility in the future.

Related

CloudFlare or AWS CDN links

I have a script that I install on a page and it will load some more JS and CSS from an S3 bucket.
I have versions, so when I do a release on Github for say 1.1.9 it will get deployed to /my-bucket/1.1.9/ on S3.
Question, if I want to have something like a symbolic link /my-bucket/v1 -> /my-bucket/1.1.9, how can I achieve this with AWS or CloudFlare?
The idea is that I want to release a new version by deploying it, to my bucket or whatever CDN, and than when I am ready I want to switch v1 to the latest 1.x.y version released. I want all websites to point to /v1 and get the latest when there is new release.
Is there a CDN or AWS service or configuration that will allow me to create a sort of a linux-like symbolic link like that?
A simple solution with CloudFront requires a slight change in your path design:
Bucket:
/1.1.9/v1/foo
Browser:
/v1/foo
CloudFront Origin Path (on the Origin tab)
/1.1.9
Whatever you configure as the Origin Path is added to the beginning of whatever the browser requested before sending the request to the Origin server.
Note that changing this means you also need to do a cache invalidation, because responses are cached based on what was requested, not what was fetched.
There is a potential race condition here, between the time you change the config and invalidate -- there is no correlation in the order of operations between configuration changes and invalidation requests -- a config change followed by an invalidation may be completed after,¹ so will probably need to invalidate, update config, invalidate, verify that the distribution had progressed a stable state, then invalidate once more. You don't need to invalidate objects individually, just /* or /v1*. It would be best if only the resource directly requested is subject to the rewrite, and not it's dependencies. Remember, also, that browser caching is a big cost-saver that you can't leverage as fully if you use the same request URI to represent a different object over time.
More complicated path rewriting in CloudFront requires a Lambda#Edge Origin Request trigger (or you could use Viewer Request, but these run more often and thus cost more and add to overall latency).
¹ Invalidation requests -- though this is not documented and is strictly anecdotal -- appear to involve a bit of time travel. Invalidations are timestamped, and it appears that they invalidate anything cached before their timestamp, rather than before the time they propagate to the edge locations. Architecturally, it would make sense if CloudFront is designed such that invalidations don't actively purge content, but only serve as directives for the cache to consider any cached object as stale if it pre-dates the timestamp on the invalidation request, allowing the actual purge to take place in the background. Invalidations seem to complete too rapidly for any other explanation. This means creating an invalidation request after the distribution returns to the stable Deployed state would assure that everything old is really purged, and that another invalidation request when the change is initially submitted would catch most of the stragglers that might be served from cache before the change is propagated. Changes and invalidations do appear to propagate to the edges via independent pipelines, based on observed completion timing.

Invalidate Cloudfront's cached data by passing in custom header

I need some resources or general direction.
I am looking into using Cloudfront to help combat latency on calls to my service.
I want to be able to serve cached data, but need to allow the client to be able to specify when they want to bypass cached data and get the latest data instead.
I know that I can send a random value in the query parameter to invalidate the cache. But I want to be able to send a custom header that will do the same thing.
Ideally, I would like to use the Cloudfront that is created behind the scenes with API Gateway. Is this possible? Or would I need to create a new CloudFront to sit in front of API Gateway?
Has anyone done this? Are there any resources you can point me to?
You cannot actually invalidate the CloudFront cache by passing a specific header -- or with a query parameter, for that matter. That is cache busting, and not invalidation.
You can configure CloudFront to include the value of a specific header in the cache key, simply by whitelisting that header for forwarding to the origin -- even if the origin ignores it.
http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/distribution-web-values-specify.html#DownloadDistValuesForwardHeaders
However... the need to give your APIs consumers a way to bypass your cache seems like there's a problem with your design. Use an adaptive Cache-Control response header and cache the responses in CloudFront for an appropriate amount of time, and this issue goes away.
Otherwise, the clever ones will just bypass it all the time, by continually changing that value.
CloudFront does caches based on headers.
Create a custom header and whitelist on that header.
CloudFront will fetch from origin if the value is not found in the cache.
Hope it helps.
EDIT:
http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/header-caching.html
Header based caching.

Best way to handle Cloudfront/S3 website with www redirected to bare domain

I have a website that I would like the www-prefixed version to redirect to the bare domain.
After searching for different solutions, I found this closed topic here with this answer that seems to work great: https://stackoverflow.com/a/42869783/8406990
However, I have a problem where if I update the root object "index/html" in my S3 bucket, it can take over a day before Cloudfront serves the new version. I have even manually invalidated the file, and while that updates the "index.html" file correctly, Cloudfront still serves the old one.
To better explain, if I type in: http://mywebsite.com/index.html, it will serve the new version. But if I type in http://mywebsite.com/, it serves the old index.html.
I went ahead and added "index.html" in the Default Root Object Property of my Cloudfront distribution (for the bare domain), and it immediately worked as I wanted. Typing in just the domain (without adding /index.html) returned the new version.
However, this is in contrast with the answer in the thread I just linked to, which explicitly states NOT to set a "default root object" when using two distributions to do the redirect. I was hoping to gain a better understanding of this "Default Root Object", and whether there is a better way to make sure the root object updates the cached version correctly?
Thank you.
If you really put index.html/ as the default root object and your CloudFront distribution is pointing to the web site hosting endpoint of the bucket and it worked, then you were almost certainly serving up an object in your bucket called index.html/ which would appear in your bucket as a folder, or an object named index.html inside a folder named index.html. The trailing slash doesn't belong new there. This might explain the strange behavior. But that also might be a typo in your question.
Importantly... one purpose of CloudFront is to minimize requests to the back-end and keep copies cached in locations that are geographically near where they are frequently requested. Updating an object in S3 isn't designed to update what CloudFront serves right away, unless you have configured it to do so. One way of doing this is to set (for example) Cache-Control: public, max-age=600 on the object metadata when you save it to S3. This would tell CloudFront never to serve up a cached copy of the object that it obtained from S3 longer than 600 seconds (10 minutes) ago. If you don't set this, CloudFront will not check back for 24 hours, by default (the "Default TTL").
http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/Expiration.html
This only works in one direction -- it tells CloudFront how long it is permitted to retain a cached copy without checking for updates. It doesn't tell CloudFront that it must wait that long before checking. Objects that are requested infrequently might be released by CloudFront before their max-age expires. The next request fetches a fresh copy from S3.
If you need to wipe an object from CloudFront's cache right away, that's called a cache invalidation. These are billed $0.005 for each path (not each file) that you request be invalidated, but the first 1,000 per month per AWS account are billed at $0.00. You can invalidate all your files by requesting an invalidation for /*. This leaves S3 untouched but CloudFront discards anything it cached before the invalidation request.
The default root object is a legacy feature that is no longer generally needed since S3 introduced static web site hosting buckets. Before that -- and still, if you point CloudFront to the REST endpoint for the bucket -- someone hitting the root of your web site would see a listing of all your objects. Obviously, that's almost always undesirable, so the default root object allowed you to substitute a different page at the root of the site.
With static hosting in S3, you have index documents, which work in any "directory" on your site, making the CloudFront option -- which only works at the root of the site, not anywhere an index document is available. So it's relatively uncommon to use this feature, now.

Uploading various sized Images to AWS Cloudfront versus post processing

We are using AWS cloudfront to render static contents on our site with origin as S3 BUCKET. Now as next steps, the user can dynamically upload images which we want to push to CDN. But we would require different sizes of it so that we can use it later in in the site. One option is to actually do preprocessing of images before pushing to S3 BUCKET . This ends up creating multiple images based on sizes. Can we do post processing something like http://imageprocessor.org/imageprocessor-web/ does but still use cloudfront. Any feedback would be helpful.
Regards
Raghav
Well, yes, it is possible to do post-processing and use CloudFront but you need an intermediate layer between CloudFront and S3. I designed a system using the following high-level implementation:
Request arrives at CloudFront, which serves the image from cache if available; otherwise CloudFront sends the request to the origin server.
The origin server is not S3. The origin server is Varnish, on EC2.
Varnish sends the request to S3, where all the resized image results are stored. If S3 returns 200 OK, the image is returned to CloudFront and to the requesting browser and the process is complete. Since the Varnish machine runs in the same AWS region as the S3 bucket, the performance is essentially indistinguishble between CloudFront >> S3 and CloudFront >> Varnish >> S3.
Otherwise, Varnish is configured to retry the failed request by sending it to the resizer platform, which also runs in EC2.
The resizer examines the request to determine what image is being requested, and what size. In my application, the desired size is in the last few characters of the filename, so xxxxx_300_300_.jpg means 300 x 300. The resizer fetches the source image... resizes it... stores the result in S3... and returns the new image to Varnish, which returns it to CloudFront and to the requester. The resizer itself is Imagemagick wrapped in Mojolicious and uses a MySQL database to identify the source URI where the original image can be fetched.
Storing the results in a backing store, like S3, and checking there, first, on each request, is a critical part of this process, because CloudFront does not work like many people seem to assume. Check your assumptions against the following assertions:
CloudFront has 50+ edge locations. Requests are routed to the edge that optimal for (usually, geographically close to) the viewer. The edge caches are all independent. If I request an object through CloudFront, and you request the same object, and our requests arrive at different edge locations, then neither of us will be served from cache. If you are generating content on demand, you want to save your results to S3 so that you do not have to repeat the processing effort.
CloudFront honors your Cache-Control: header (or overridden values in configuration) for expiration purposes, but does not guarantee to retain objects in cache until they expire. Caches are volatile and CloudFront is no exception. For this reason, too, your results need to be stored in S3 to avoid duplicate processing.
This is a much more complex solution than pre-processing.
I have a pool of millions of images, a large percentage of which would have a very low probability of being viewed, and this is an appropriate solution, here. It was originally designed as a parallel solution to make up for deficiencies in a poorly-architected preprocessor that sometimes "forgot" to process everything correctly, but it worked so well that it is now the only service providing images.
However, if your motivation revolves around avoiding the storage cost of the preprocessed results, this solution won't entirely solve that.

Multiple distributions without assigning CNAME

From my understanding, the SSL option on CloudFront is a costly option (out of reach for me). Therefore, I am considering using the https://*.cloudfront.com option.
One of the perks of CF over S3 is the ability of assigning multiple custom domains to get the benefit concurrent parallel HTTP connections, ie. cdn0.domain.com, cdn1.domain.com, etc.
Since custom domain + SSL is not an option, does CF have a wildcard option of the https://*[0,1,2,3].cloudfront.com variant to a single distribution?
The solution would be to only have multiple CF distributions in this case... one for images, another for static code (JS, CSS) etc. Typically if you are already keeping these images etc in S3 bucket, have separate bucket for each type (say one for image) and make these buckets origin servers in the CF distribution.
Having said that, the concurrent connections in a browsers have increased over time. It is not that small anyway. Typically a page needs to load only one JS (combined. minified), one CSS(combined. minified) and one Image for Icons (sprited). This is like only 1 + 3 connections - not too high. Other images (like large thumbnails etc) in the page anyway come from another CF distibution. So you dont have to "artifically" create sub-domains for performance.
This shows the state of connections now : (What's the maximum number of simultaneous connections a browser will make?)