CloudFlare or AWS CDN links - amazon-web-services

I have a script that I install on a page and it will load some more JS and CSS from an S3 bucket.
I have versions, so when I do a release on Github for say 1.1.9 it will get deployed to /my-bucket/1.1.9/ on S3.
Question, if I want to have something like a symbolic link /my-bucket/v1 -> /my-bucket/1.1.9, how can I achieve this with AWS or CloudFlare?
The idea is that I want to release a new version by deploying it, to my bucket or whatever CDN, and than when I am ready I want to switch v1 to the latest 1.x.y version released. I want all websites to point to /v1 and get the latest when there is new release.
Is there a CDN or AWS service or configuration that will allow me to create a sort of a linux-like symbolic link like that?

A simple solution with CloudFront requires a slight change in your path design:
Bucket:
/1.1.9/v1/foo
Browser:
/v1/foo
CloudFront Origin Path (on the Origin tab)
/1.1.9
Whatever you configure as the Origin Path is added to the beginning of whatever the browser requested before sending the request to the Origin server.
Note that changing this means you also need to do a cache invalidation, because responses are cached based on what was requested, not what was fetched.
There is a potential race condition here, between the time you change the config and invalidate -- there is no correlation in the order of operations between configuration changes and invalidation requests -- a config change followed by an invalidation may be completed after,¹ so will probably need to invalidate, update config, invalidate, verify that the distribution had progressed a stable state, then invalidate once more. You don't need to invalidate objects individually, just /* or /v1*. It would be best if only the resource directly requested is subject to the rewrite, and not it's dependencies. Remember, also, that browser caching is a big cost-saver that you can't leverage as fully if you use the same request URI to represent a different object over time.
More complicated path rewriting in CloudFront requires a Lambda#Edge Origin Request trigger (or you could use Viewer Request, but these run more often and thus cost more and add to overall latency).
¹ Invalidation requests -- though this is not documented and is strictly anecdotal -- appear to involve a bit of time travel. Invalidations are timestamped, and it appears that they invalidate anything cached before their timestamp, rather than before the time they propagate to the edge locations. Architecturally, it would make sense if CloudFront is designed such that invalidations don't actively purge content, but only serve as directives for the cache to consider any cached object as stale if it pre-dates the timestamp on the invalidation request, allowing the actual purge to take place in the background. Invalidations seem to complete too rapidly for any other explanation. This means creating an invalidation request after the distribution returns to the stable Deployed state would assure that everything old is really purged, and that another invalidation request when the change is initially submitted would catch most of the stragglers that might be served from cache before the change is propagated. Changes and invalidations do appear to propagate to the edges via independent pipelines, based on observed completion timing.

Related

Can a distribution automatically match the subdomain from a request to figure out the origin

We're adding a lot of nearly equivalent apps on the same domain, each app can be accessed through its specific subdomain. Each app has got specific assets (not a lot).
Every app refer to the same cdn.mydomain.com to get the assets from cloudfront.
Assets are named spaced. For exemple:
app1:
Can be reached from app1.mydomain.com
assets url is cdn.mydomain.com/assets/app1
cloudfront orgin app1.mydomain.com
cache behavior /assets/app1/* to origin app1.mydomain.com
When Cloudfront doesn't have the assets in cache, it downloads it from the right origin.
Actually we're making a new origin and cache behavior on the same distribution each time we add a new app.
We're trying to simplify that process so Cloudfront can be able to get the assets from the right origin without having to specify it. And this will resolve the problem if we hit the limit of the number of origin in one distribution.
How can we do this and is it possible?
We're thinking of making an origin of mydomain.com with a cache configure to forward the host header but we're not sure that this will do the trick.
Origins are tied to Cache Behaviors, which are tied to path patterns. You can't really do what you're thinking about doing.
I would suggest that you should create a distribution for each app and each subdomain. It's very easy to script this using aws-cli, since once you have one set up the way you like it, you can use its configuration output as a template to make more, with minimal changes. (I use a Perl script to build the final JSON to create each distribution, with minimal inputs like alternate domain name and certificate ARN and pipe its output into aws-cli.)
I believe this is the right approach, because:
CloudFront cannot select the origin based on the Host header. Only the path pattern is used to select the origin.
Lambda#Edge can rewrite the path and can inspect the Host header, but it cannot rewrite the path before the matching is done that selects the Cache Behavior (and thus the origin). You cannot use Lambda#Edge to cause CloudFront to switch or select origins, unless you generatre browser redirects, which you probably don't want to do, for performance reasons. I've submitted a feature request to allow a Lambda trigger to signal CloudFront that it should return to the beginning of processing and re-evaluate the path, but I don't know if it is being considered as a future feature -- AWS tends to keep their plans for future functionality close to the vest, and understandably so.
you don't gain any efficiency or cost savings by combining your sites in a single distribution, since the resources are different
if you decide to whitelist the Host header, that means CloudFront will cache responses, separately, based on the Host header, the same as it would do if you had created multiple distributions. Even if the path is identical, it will still cache separate responses if the Host header differs, as it must to ensure sensible behavior
the default limit for distributions is 200, while the limit for origins and cache behaviors is 25. Both can be raised by request, but the number of distributions they can give you is unlimited, while the other resources are finite because they increase the workload on the system for each request and would eventually have a negative performance impact
separate distributions gives you separate logs and reports
provisioning errors have a smaller blast radius when each app has its own distribution
You can also go into Amazon Certificate Manager and a wildcard certificate for * *.cdn.example.com. Then use e.g. app1.cdn.example.com as the alternate domain name for the app1 distribution and attach the wildcard cert. Then reuse the same cert on the app2.cdn.app.com distribution, etc.
Note that you also have an easy migration strategy from your current solution: You can create a single distribution with *.cdn.example.com as its alternate domain name. Code the apps to use their own unique-name-here.cdn.example.com. Point all the DNS records here. Later, when you create a distribution with a specific alternate domain name foo.cdn.example.com, CloudFront will automatically stop routing those requests to the wildcard distribution and start routing them to the one with the specific domain. You will need to change the DNS entry... but CloudFront will actually handle the requests correctly, routing them to the newly-created distribution, before you change the DNS, because it has some internal magic that will match the non-wildcard hostname to the correct distribution regardless of whether the browser connects to the new endpoint or the old... so the migration event should pretty much be a non-event.
I'd suggest the wildcard strategy is a good one, anyway, so that your apps are each connecting to a specific endpoint hostname, allowing you much more flexibility in the future.

Best way to handle Cloudfront/S3 website with www redirected to bare domain

I have a website that I would like the www-prefixed version to redirect to the bare domain.
After searching for different solutions, I found this closed topic here with this answer that seems to work great: https://stackoverflow.com/a/42869783/8406990
However, I have a problem where if I update the root object "index/html" in my S3 bucket, it can take over a day before Cloudfront serves the new version. I have even manually invalidated the file, and while that updates the "index.html" file correctly, Cloudfront still serves the old one.
To better explain, if I type in: http://mywebsite.com/index.html, it will serve the new version. But if I type in http://mywebsite.com/, it serves the old index.html.
I went ahead and added "index.html" in the Default Root Object Property of my Cloudfront distribution (for the bare domain), and it immediately worked as I wanted. Typing in just the domain (without adding /index.html) returned the new version.
However, this is in contrast with the answer in the thread I just linked to, which explicitly states NOT to set a "default root object" when using two distributions to do the redirect. I was hoping to gain a better understanding of this "Default Root Object", and whether there is a better way to make sure the root object updates the cached version correctly?
Thank you.
If you really put index.html/ as the default root object and your CloudFront distribution is pointing to the web site hosting endpoint of the bucket and it worked, then you were almost certainly serving up an object in your bucket called index.html/ which would appear in your bucket as a folder, or an object named index.html inside a folder named index.html. The trailing slash doesn't belong new there. This might explain the strange behavior. But that also might be a typo in your question.
Importantly... one purpose of CloudFront is to minimize requests to the back-end and keep copies cached in locations that are geographically near where they are frequently requested. Updating an object in S3 isn't designed to update what CloudFront serves right away, unless you have configured it to do so. One way of doing this is to set (for example) Cache-Control: public, max-age=600 on the object metadata when you save it to S3. This would tell CloudFront never to serve up a cached copy of the object that it obtained from S3 longer than 600 seconds (10 minutes) ago. If you don't set this, CloudFront will not check back for 24 hours, by default (the "Default TTL").
http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/Expiration.html
This only works in one direction -- it tells CloudFront how long it is permitted to retain a cached copy without checking for updates. It doesn't tell CloudFront that it must wait that long before checking. Objects that are requested infrequently might be released by CloudFront before their max-age expires. The next request fetches a fresh copy from S3.
If you need to wipe an object from CloudFront's cache right away, that's called a cache invalidation. These are billed $0.005 for each path (not each file) that you request be invalidated, but the first 1,000 per month per AWS account are billed at $0.00. You can invalidate all your files by requesting an invalidation for /*. This leaves S3 untouched but CloudFront discards anything it cached before the invalidation request.
The default root object is a legacy feature that is no longer generally needed since S3 introduced static web site hosting buckets. Before that -- and still, if you point CloudFront to the REST endpoint for the bucket -- someone hitting the root of your web site would see a listing of all your objects. Obviously, that's almost always undesirable, so the default root object allowed you to substitute a different page at the root of the site.
With static hosting in S3, you have index documents, which work in any "directory" on your site, making the CloudFront option -- which only works at the root of the site, not anywhere an index document is available. So it's relatively uncommon to use this feature, now.

Uploading various sized Images to AWS Cloudfront versus post processing

We are using AWS cloudfront to render static contents on our site with origin as S3 BUCKET. Now as next steps, the user can dynamically upload images which we want to push to CDN. But we would require different sizes of it so that we can use it later in in the site. One option is to actually do preprocessing of images before pushing to S3 BUCKET . This ends up creating multiple images based on sizes. Can we do post processing something like http://imageprocessor.org/imageprocessor-web/ does but still use cloudfront. Any feedback would be helpful.
Regards
Raghav
Well, yes, it is possible to do post-processing and use CloudFront but you need an intermediate layer between CloudFront and S3. I designed a system using the following high-level implementation:
Request arrives at CloudFront, which serves the image from cache if available; otherwise CloudFront sends the request to the origin server.
The origin server is not S3. The origin server is Varnish, on EC2.
Varnish sends the request to S3, where all the resized image results are stored. If S3 returns 200 OK, the image is returned to CloudFront and to the requesting browser and the process is complete. Since the Varnish machine runs in the same AWS region as the S3 bucket, the performance is essentially indistinguishble between CloudFront >> S3 and CloudFront >> Varnish >> S3.
Otherwise, Varnish is configured to retry the failed request by sending it to the resizer platform, which also runs in EC2.
The resizer examines the request to determine what image is being requested, and what size. In my application, the desired size is in the last few characters of the filename, so xxxxx_300_300_.jpg means 300 x 300. The resizer fetches the source image... resizes it... stores the result in S3... and returns the new image to Varnish, which returns it to CloudFront and to the requester. The resizer itself is Imagemagick wrapped in Mojolicious and uses a MySQL database to identify the source URI where the original image can be fetched.
Storing the results in a backing store, like S3, and checking there, first, on each request, is a critical part of this process, because CloudFront does not work like many people seem to assume. Check your assumptions against the following assertions:
CloudFront has 50+ edge locations. Requests are routed to the edge that optimal for (usually, geographically close to) the viewer. The edge caches are all independent. If I request an object through CloudFront, and you request the same object, and our requests arrive at different edge locations, then neither of us will be served from cache. If you are generating content on demand, you want to save your results to S3 so that you do not have to repeat the processing effort.
CloudFront honors your Cache-Control: header (or overridden values in configuration) for expiration purposes, but does not guarantee to retain objects in cache until they expire. Caches are volatile and CloudFront is no exception. For this reason, too, your results need to be stored in S3 to avoid duplicate processing.
This is a much more complex solution than pre-processing.
I have a pool of millions of images, a large percentage of which would have a very low probability of being viewed, and this is an appropriate solution, here. It was originally designed as a parallel solution to make up for deficiencies in a poorly-architected preprocessor that sometimes "forgot" to process everything correctly, but it worked so well that it is now the only service providing images.
However, if your motivation revolves around avoiding the storage cost of the preprocessed results, this solution won't entirely solve that.

GET from CloudFront after PUT returns old data

I am using CloudFront to access my S3 bucket.
I perform both GET and PUT operations to retrieve and update the data. The problem is that after i send PUT request with new data, GET request still returns the older data. I do see that the file is updated in S3 bucket.
I am performing both GET and PUT from iOS application. However, i tried performing GET request using regular browsers and i still receive older data.
Do i need to do anything in addition to make CloudFront refresh its data?
Cloudfront caches your data. How long depends on the headers the origin serves content with and the distribution settings.
Amazon has a document with the full results of how they interac, but if you haven't set your cache control headers, not changed any cloudfront settings, then by default data is cached for upto 24 hours.
You can either:
set headers indicating how long to cache content for (e.g. Cache-Control: max-age=300 to allow caching for up to 5 minutes). How exactly you do this depends on how you are uploading the content, at a pinch you can use the console
Use the console / api to invalidate content. Beware that only the first 1000 invalidations a month a free - beyond that amazon charges. In addition, invalidations take 10-15 minutes to process.
Change the naming strategy for your s3 data so that new data is served under a different name (perhaps less relevant in your case)
When you PUT an object into S3 by sending it through Cloudfront, Cloudfront proxies the PUT request back to S3, without interpreting it within Cloudfront... so the PUT request changes the S3 object, but the old version of the object, if cached, would have no reason to be evicted from the Cloudfront cache, and would continue to be served until it is expired, evicted, or invalidated.
"The" Cloudfront cache is not a single thing. Cloudfront has over 50 global edge locations (reqests are routed to what should be the closest one, using geolocating DNS), and objects are only cached in locations through which they have been requested. Sending an invalidation request to purge an object from cache causes a background process at AWS to contact all of the edge locations and request the object be purged, if it exists.
What's the point of uploading this way, then? The point has to do with the impact of packet loss, latency, and overall network performance on the throughput of a TCP connection.
The Cloudfront edge locations are connected to the S3 regions by high bandwidth, low loss, low latency (within the bounds of the laws of physics) connections... so the connection from the "back side" of Cloudfront towards S3 may be a connection of higher quality than the browser would be able to establish.
Since the Cloudfront edge location is also likely to be closer to the browser than S3 is, the browser connection is likely to be of higher quality and more resilient... thereby improving the net quality of the end-to-end logical connection, by splitting it into two connections. This feature is solely about performance:
http://aws.amazon.com/blogs/aws/amazon-cloudfront-content-uploads-post-put-other-methods/
If you don't have any issues sending directly to S3, then uploads "through" Cloudfront serve little purpose.

CloudFront atomic replication

I want to host a static site via Amazon CloudFront from an S3 bucket. If I update the content of the bucket with a new version of the page, is there a way I can ensure the distribution happens in an atomic way?
What I mean is, if I have assets like a.js and b.js, that the updated version of both is served at the same time, and not e.g. the old a.js and new b.js.
You have a couple of options:
You can request an invalidation. Takes about 15 minutes or so to complete.
You can give your new assets a new name. This is a bit harder to do, but in my opinion the preferable route. Since its easier to enable long expiration client side caching.
If you perform object invalidataion, there is no gurantee that two js files will be invalidataed at the same time. There would definitely be some time when your site will behave unexpectedly.
Either you do it at a time when you expect least number of users visiting your site Or create new resources like "datasage" mentioned & then use names of these newly created resources to update all the files that reference these.