Changing "Origin Path" in CloudFront takes very long to kick in - amazon-web-services

We have a static site hosted in S3 and delivered with CloudFront. The site works but rolling out updates takes quite long -- hours or longer. Specifically, changing the Origin Path is not reflected on edge locations nearly as quickly as desired.
Here is what we are trying to achieve...
Our S3 bucket is configured to host a website. It stores multiple versions of the same site. There is a sub-directory per git tag. For example:
/git-v1
/git-v2
/git-v3
..
The goal is to tell CF to start serving a new version of the site per Origin Path setting. We don't want to invalidate old objects, just keep advancing the version by creating a new directory and pointing CF at it. The status under CloudFront Distributions shows "Deployed" for a long time, yet the edge locations continue to ignore the new Origin Path.
Any idea for how to make CF start serving the new sub-directory quicker would be greatly appreciated.

The Origin Path setting is applied to the request after the cache is checked... not before. When the object requested in the URI is not in the cache, the object is requested from the Origin server. At that point, Origin Path is prepended to the incoming request path, then sent to the origin. Caching is based on the incoming request path.¹
The setting itself takes effect quickly, often in seconds, but doesn't purge the cache.
If this is just for versioning the root page, you can leave the origin path blank, change the Default Root Object to the new root object, and then just invalidate /. Or, you can keep doing what you are doing, and invalidate /* after making the change. Free invalidations are limited to 1000 per month, but invalidating /* (or any wildcard) only counts as 1 invalidation, no matter how many objects the wildcard matches.
¹ incoming request path also refers to the path as it stands after a Lambda#Edge Viewer Request trigger modifies it, if applicable.

Related

Difference between CloudFront TTL expiry and Invalidation?

What are the practical differences between when CloudFront expires objects at an origin via the CloudFront TTL setting versus when one calls invalidate?
The general idea is that you use TTLs to set the policy that CloudFront uses to determine the maximum amount of time each individual object can potentially be served from the CloudFront cache with no further interaction with the origin.
Default TTL: the maximum time an object can persist in a CloudFront cache without being considered stale, when no relevant Cache-Control directive is supplied by the origin. No Cache-Control header is added to the response by CloudFront.
Minimum TTL: If the origin supplies a Cache-Control: s-maxage value (or, if not present, then a Cache-Control: max-age value) smaller than this, CloudFront ignores it and assumes it can retain the object in the cache for not longer than this. For example, if Minimum TTL is set to 900, but the response contains Cache-Control: max-age=300, CloudFront ignores the 300 and may cache the object for up to 900 seconds. The Cache-Control header is not modified, and is returned to the viewer as received.
Maximum TTL: If the origin supplies a Cache-Control directive indicating that the object can be cached longer than this, CloudFront ignores the directive and assumes that the object must not continue to be served from cache for longer than Maximum TTL.
See Specifying How Long Objects Stay in a CloudFront Edge Cache (Expiration) in the Amazon CloudFront Developer Guide.
So, these three values control what CloudFront uses to determine whether a cached response is still "fresh enough" to be returned to subsequent viewers. It does not mean CloudFront purges the cached object after the TTL expires. Instead, CloudFront may retain the object, but will not serve it beyond the expiration without first sending request to the origin to see if the object has changed.
CloudFront does not proactively check the origin for new versions of objects that have expired -- it only checks if they are requested again, while still in the cache, and then determined to have been expired. When it does this, it usually sends a conditional request, using directives like If-Modfied-Since. This gives the origin the option of responding 304 Not Modified, which tells CloudFront that the cached object is still usable.
A misunderstanding that sometimes surfaces is that the TTL directs CloudFront how long to cache the objects. That is not what it does. It tells CloudFront how long it is allowed to cache the response with no revalidation against the origin. Cache storage inside CloudFront has no associated charge, and caches by definition are ephemeral, so, objects that are rarely requested may be purged from the cache before their TTL expires.
If an object in an edge location isn't frequently requested, CloudFront might evict the object—remove the object before its expiration date—to make room for objects that have been requested more recently.
https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/Expiration.html
On the next request, CloudFront will request the object from the origin again.
Another misunderstanding is that CloudFront's cache is monolithic. It isn't. Each of the global edges has its own independent cache, caching objects in edges through which they are being requested. Each global edge also has an upstream regional cache (in the nearest EC2 region; there may be more than one per region, but this isn't documented) where the object will also be stored, allowing other nearby global edges to find the object in the nearest regional cache, but CloudFront does not search any further, internally, for cached objects. For performance, it just goes to the origin on a cache miss.
See How CloudFront Works with Regional Edge Caches.
Invalidation is entirely different, and is intended to be used sparingly -- only the first 1000 invalidation paths submitted each month by an AWS account are free. (A path can match many files, and the path /* matches all files in the distribution).
An invalidation request has a timestamp of when the invalidation was created, and sends a message to all regions, directing them to do something along these lines (the exact algorithm isn't documented, but this accurately descibes the net effect):
Delete any files matching ${path} from your cache, if they were cached prior to ${timestamp} and
Meanwhile, since that could take some time, if you get any requests for files matching ${path} that were cached prior to ${timestamp}, don't use the cached files because they are no longer usable.
The invalidation request is considered complete as soon as the entire network has received the message. Invalidations are essentially an idempotent action, in the sense that it is not an error to invalidate files that don't actually exist, because an invalidation is telling the edges to invalidate such files if they exist.
From this, it should be apparent that the correct course of action is not to choose one or the other, but to use both as appropriate. Set your TTLs (or select "use origin cache headers" and configure your origin server always to return them with appropriate values) and then use invalidations only as necessary to purge your cache of selected or all content, as might be necessary if you've made an error, or made significant changes to the site.
The best practice, however, is not to count on invalidations but instead to use cache-busting techniques when an object changes. Cache busting means changing the actual object being requested. In the simplest implementstion, for example, this might mean you change /pics/cat1.png to /pics/cat2.png in your HTML rather than saving a new image as /pics/cat1.png when you want a new image. The problem with replacing one file with another at the same URL is that the browser also has a cache, and may continue displaying the old image.
See also Invalidating Objects.
Note also that the main TTLs are not used for error responses. By default, responses like 404 Not Found are cached for 5 minutes. This is the Error Caching Minimum TTL, designed to relieve your origin server from receiving requests that are likely to continue to fail, but only for a few minutes.
If we are looking at practical differences:
CloudFront TTL: You can control how long your objects stay in a CloudFront cache before CloudFront forwards another request to your origin.
Invalidate: Invalidate the object from edge caches. The next time a viewer requests the object, CloudFront returns to the origin to fetch the latest version of the object.
So the main difference is speed. If you deploy a new version of your application you might want to invalidate immediately.

Cloudfront decision to fetch data from cache or server?

From Amazon cloud front
Amazon CloudFront is a web service that speeds up distribution of your
static and dynamic web content, such as .html, .css, .php, and image
files, to your users. CloudFront delivers your content through a
worldwide network of data centers called edge locations.
Per mine undserstanding, CloudFront must be caching the content with URL as key. URL can serve both static and dynamic content. Say i have 100 weburl's , out of which 30 serves the static content and 70 serves dynamic content(user specific data). I have one question each on static and dynamic content
Dynamic content :-
Say user_A access his data through url_A from US. That data has been cached . He updates the first name. Now same user will access the data from same location in US or from
another location in UK. We he see data prior to updation. If yes how will edge location come to know data needs to fetched from server not from cache ?
Does edge location continue to display the data from cache for configurable amount of time and if time is passed then fetch it from server ?
Does cloudfront allows to configure specific URL's that needs to fetched from server instead of cache always ?
Static content :-
There are chances that even static data may change will with each release. How cloud front will know that cached static content is stale and needs to be fetched from server ?
Amazon CloudFront uses an expiration period (or Time To Live - TTL) that you specify.
For static content, you can set the default TTL for the distribution or you can specify the TTL as part of the headers. When the TTL has expired, the CloudFront edge location will check to see whether the Last Modified timestamp on the object has changed. If it has changed, it will fetch the updated copy. If it is not changed, it will continue serving the existing copy for the new time period.
If static content has changed, your application must send an Invalidation Request to tell CloudFront to reload the object even when the TTL has not expired.
For dynamic content, your application will normally specify zero as the TTL. Thus, that URL will always be fetched from the origin, allowing the server to modify the content for the user.
A half-and-half method is to use parameters (eg xx.cloudfront.net/info.html?user=foo). When configuring the CloudFront distribution, you can specify whether a different parameter (eg user=fred) should be treated as a separate object or whether it should be ignored.
Also, please note that each CloudFront edge location has its own cache. So, if somebody accessed a page from the USA, that would not cause it to be cached in the UK.
See the documentation: Specifying How Long Objects Stay in a CloudFront Edge Cache (Expiration)

Does Amazon CloudFront check if origin (specifically S3 bucket) has changed via MD5/ETag/Other before retransferring expired objects across regions?

I would like to know if there is a cost incurred for "refetching" an expired CloudFront object from an S3 bucket if the resource object has not changed. ie. is the object is retransferred in its entirety to each edge location, or are things like MD5-Content headers or modified times checked first before retransferring?
I'm trying to calculate the costs incurred and can't find any information on this via google or through amazons documentation.
I would like to set the Cache-Control headers to be as short a time as possible (say a few hours) so that objects can be removed/replaced reasonably quickly in places where filename versioning is not possible, without using Invalidation Requests.
If the objects are indeed retransferred in full, then obviously with hundreds of objects this solution would be too expensive to be acceptable.
On the other hand, there may be a better solution without needing to set a low value in the Cache-Control header. If so please share.
Thanks!
Once an object is "expired", it is removed from the cloudfront edge location. There's therefore no way to do an MD5 or modify time check as there is no file to compare it against.
If the file has not expired, it will not check against the origin server.
ie -
1. File is on the edge location - origin server not checked
2. File expires - therefore not on the edge location - origin server fetches it.
A short expiration time will lead to those files being removed from the origin, therefore requiring a complete refetch.

Are deleted files accessible if behind cloudfront

I am trying to wrap my head around how cloudfront and CDNs work.
If I have a file and the cache control header is set to 1 year and I am using amazon cloud front as my CDN.
What happens if I delete the file? Would it still be served as it is cached by the cloudfront servers? Would it be served in all locations of the world, or does it only get cached on an edge server if it has been requested once.
Example I have a file behind Amazon Cloud Front
blue.jpg with cache control headers set for 1 year
I visit the file from a location in New York
I then delete the file.
If I then visit the page which includes the file again from New York, Would the file be served as its cached?
What if someone then visits the page with the file from Moscow, Russia. Would he be able to view the file?
Thanks for your help :)
CloudFront is simply a collection of caches close to your users. Each edge location operates independently.
By default, CloudFront obeys your http cache control headers. If you set your headers so a file does not expire for a year, the CloudFront will continue serving that file for a year without checking back to your origin server.
Since each edge location operates independently, in your example, New York will continue serving the file, but Moscow will the file as deleted (404). As you can imaging this could lead to different users seeing different content.
There are strategies to avoid this problem.
From the CloudFront docs (http://aws.amazon.com/cloudfront/#details):
Object Versioning and Cache Invalidation
You have two options to update your files cached at the Amazon CloudFront edge locations. You can use object versioning to manage changes to your content. To implement object versioning, you create a unique filename in your origin server for each version of your file and use the file name corresponding to the correct version in your web pages or applications. With this technique, Amazon CloudFront caches the version of the object that you want without needing to wait for an object to expire before you can serve a newer version.
You can also remove copies of a file from all Amazon CloudFront edge locations at any time by calling the invalidation API. This feature removes the file from every Amazon CloudFront edge location regardless of the expiration period you set for that file on your origin server. If you need to remove multiple files at once, you may send a list of files (up to 1,000) in an XML document. The invalidation feature is designed to be used in unexpected circumstances, e.g., to correct an encoding error on a video you uploaded or an unanticipated update to your website’s CSS file. However, if you know beforehand that your files will change frequently, it is recommended that you use object versioning to manage updates to your files. This technique gives you more control over when your changes take effect and also lets you avoid potential charges for invalidating objects.

Pre-caching dynamically generated images for multiple Edge locations on Amazon Cloudfront

We are currently using CloudFront in many Edge Locations to serve product images (close to half a million) which are dynamically resized into different size dimensions.
Our Cloudfront distribution uses an origin EC2 php script to retrieve the original image from S3, transform it dynamically based on supplied querystring criteria (width, height, cropping, etc) and stream it back to Cloudfront which caches it on the edge location.
However, website visitor loading an non-cached image the first time are hit by this quite heavy transformation.
We would like to have the ability to 'pre-cache' our images (by using a batch job requesting each image url) so that end users aren't the first to hit an image in a particular size, etc.
Unfortunately, since the images are only cached on the Edge Location assigned to the pre-caching service, website visitors using another Edge Location won't get the cached image and are hit with the heavy resizing script on the origin server.
The only solution we've come of with, where every Edge Location can retrieve an image within reasonable load time, is this:
We have a Cloudfront distribution that points to an origin EC2 php script. But instead of doing the image transformation described above, the origin script forwards the request and querystring parameters to another Cloudfront distribution. This distribution has an origin EC2 php script which performs the image transformation. This way the image is always cached at the Edge location near our EC2 instance (Ireland) thus avoiding to perform yet another transformation when the image is requested from another Edge Location.
So, for example, a request in Sweden hits /image/stream/id/12345, which the Swedish Edge Location doesn't have cached, so it sends a request to the origin, which is the EC2 machine in Ireland. The EC2 'streaming' page then loads /image/size/id/12345 from another Cloudfront distribution, which hits the Irish Edge Location, which also doesn't have it cached. It then sends a request to the origin, again the same EC2 machine, but to the 'size' page which does the resizing. After this, both the Edge Location in Sweden and in Ireland have the image cached.
Now, a request from France requests the same image. The French Edge Location doesn't have it cached, so it calls the origin, which is the EC2 machine in Ireland, which calls the the second CF distribution which again hits the irish Edge Location,. This time it does have the image cached, and can return it immediately. Now the french Edge Location also have the image cached, but without having to have called the 'resizing' page - only the 'streaming' page with the cached image in Ireland.
This also means that our 'pre-caching" batch service in Ireland can do request against the irish Edge Location and pre-cache the images before they're requested by our website visitors.
We know it looks a bit absurd, but with the desire we have, that the end user should never have to wait a long time while the image is being resized, it seems like the only tangible solution.
Have we overlooked another/better solution? Any comments to the above?
I`m not sure that this will diminish loading times (if this was your goal).
Yes, this setup will save some "transformation time" but on the other hand this will also create an additional communication between servers.
I.E. Client calls French POP >> French POP calls Ireland POP = Twice the download time (and some) which might be longer than the "transformation time"...
I work for Incapsula and we've actually developed our own unique a behavior analyzing heuristic process to handle dynamic content caching. (briefly documented here: http://www.incapsula.com/the-incapsula-blog/item/414-advanced-caching-dynamic-through-learning)
Our premises is:
While one website can have millions of dynamic objects, only some of those are subject to repeated request.
Following this logic, we have an algorithm which learns visiting patterns, finds good "candidates" for Caching and then Caches them on a redundant servers. (thus avoiding the above-mentioned "double download")
The content is then re-scanned every 5 min, to preserve freshness and the heuristic system keeps track, to make sure that the content is still popular.
This is an over-simplified explanation, but it demonstrates the core idea, which is: Find out what your users need most. Get in on all the POPs. Keep track to preserve freshness and detect trends.
Hope this helps.
Just a thought...
Run two caches.
One on each edge location,
One on the server (or elasticache if multiple servers) in Ireland. They don't need to be cached for much more than a few minutes.
Have a micro instance running attached to data pipeline or a queue,
When the request comes into the origin server, return and server cache the image. Also drop the url onto the queue.
Then, have the daemon make multiple calls to each edge location. At this point, your server will get hit again (as the other edge locations won't have the image) - but it'll be served immediately out of the cache - with no requirement to perform the expensive transform.
If it's not doing the transform, and only serving out of the cache - shouldn't be a big deal.
So the flow would be like this
Request -> Cloud Front -> EC2 -> Add to cache -> Response -> CloudFront Cache -> User
- -> Queue new request at each edge location
Request -> Cloud Front -> EC2 -> already cached -> Response -> CloudFront -> User
You'd need some form of marker to state that it's been served and cached already, otherwise you'd end up in an infinite loop.