Correct invalidation path for CloudFront object - amazon-web-services

I am trying to figure out the correct path for an object to invalidate on CloudFront distribution.
CloudFront is configured with an alternate domain name of *.example.com
The tricky part is that I've setup a custom origin on EC2 that uses HAProxy to do some path rewriting.
So that the request to
mysubdomain.example.com/icon.png
is rewritten to
s3.amazonaws.com/examplebucket/somedirectory/mysubdomain/icon.png
and the result is then returned to CloudFront. (So, both the Path and the Host are being rewritten)
Now, I have trouble figuring out the correct path for this object when sending an invalidation request. (I don't want to use versioning, because I need the filename to remain the same)
I've tried with the following configuration, but it doesn't seem to be working. The invalidation is created and processed, but with no effect.
const invalidationParams = {
DistributionId: 'MY_DISTRIBUTION_ID',
InvalidationBatch: {
CallerReference: 'SOME_RANDOM_STRING',
Paths: {
Quantity: 1,
Items: [
'/somedirectory/mysubdomain/icon.png'
]
}
}
}
Since only PATH is specified, which is relative to the distribution, and no way to specify the full URL in the invalidation configuration, does it make it impossible to invalidate the object in this configuration?

CloudFront invalidations consider every object matching the path specification, as requested by the browser. To invalidate http://example.com/cat.jpg you specify one of the following:
cat.jpg
/cat.jpg
The leading slash is optional, but implied if absent.
Paths are the only values accepted for invalidation requests.
For each edge location, every copy of an object matching that path -- regardless of the alternate domain name or other attributes associated with in it -- will be evicted.
Note that "every copy of an object matching that path" may be confusing to some, since the assumption might be that only one copy would match a given path, but this is not correct. CloudFront caches different copies of the "same" object, depending on which request parameters are forwarded to the origin. If the query string, a cookie, whitelisted headers, etc., are forwarded, then many copies of the "same" object will be cached, because caching requires that the cache assume the response will vary if any if the forwarded request parameters vary. This is why so little forwarded by default -- it helps your hit rate because it reduces the likelihood of any given request seeming "unique" to the cache logic.

Related

AWS CloudFront will only invalidate entire cache

I want to invalidate the CloudFront cache entry for a specific path, say /api/dict/bob/article/1, but it has no effect. I've experimented with different wildcards, for instance:
/*/1
/api/dict/bob/article/*
/api/dict/bob/*
/api/dict/*
But in the end, the only invalidation that actually removes the object from the cache is the catch-all /*
The path's origin is custom (API Gateway)
Cache policy:
Minimum TTL: 60,
Maximum TTL:28800,
Default TTL: 10800
Cache key:
query-strings all
cookies: all
headers: whitelist (one param)
Update: I managed to invalidate an svg-file from an s3 origin. The api origin requires an x-api-key http-header. Could that make a difference?
I think * can be only in the end. Are you sure it is not a client caching and you waited for the invalidation to finish? It is a slow process, not instant for sure. You can track its progress either on the UI or using the CLI.
https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/Invalidation.html
To invalidate files, you can specify either the path for individual
files or a path that ends with the * wildcard, which might apply to
one file or to many, as shown in the following examples:
/images/image1.jpg
/images/image*
/images/*

Don't pass on the patternpath from cloudfront to origin? [duplicate]

I have two S3 buckets that are serving as my Cloudfront origin servers:
example-bucket-1
example-bucket-2
The contents of both buckets live in the root of those buckets. I am trying to configure my Cloudfront distribution to route or rewrite based on a URL pattern. For example, with these files
example-bucket-1/something.jpg
example-bucket-2/something-else.jpg
I would like to make these URLs point to the respective files
http://example.cloudfront.net/path1/something.jpg
http://example.cloudfront.net/path2/something-else.jpg
I tried setting up cache behaviors that match the path1 and path2 patterns, but it doesn't work. Do the patterns have to actually exist in the S3 bucket?
Update: the original answer, shown below, is was accurate when written in 2015, and is correct based on the built-in behavior of CloudFront itself. Originally, the entire request path needed to exist at the origin.
If the URI is /download/images/cat.png but the origin expects only /images/cat.png then the CloudFront Cache Behavior /download/* will not do what you might assume -- the cache behavior's path pattern is only for matching -- the matched prefix isn't removed.
By itself, CloudFront doesn't provide a way to remove elements from the path requested by the browser when sending the request to the origin. The request is always forwarded as it was received, or with extra characters at the beginning, if the origin path is specified.
However, the introduction of Lambda#Edge in 2017 changes the dynamic.
Lambda#Edge allows you to declare trigger hooks in the CloudFront flow and write small Javascript functions that inspect and can modify the incoming request, either before the CloudFront cache is checked (viewer request), or after the cache is checked (origin request). This allows you to rewrite the path in the request URI. You could, for example, transform a request path from the browser of /download/images/cat.png to remove /download, resulting in a request being sent to S3 (or a custom orgin) for /images/cat.png.
This option does not modify which Cache Behavior will actually service the request, because this is always based on the path as requested by the browser -- but you can then modify the path in-flight so that the actual requested object is at a path other than the one requested by the browser. When used in an Origin Request trigger, the response is cached under the path requested by the browser, so subsequent responses don't need to be rewritten -- they can be served from the cache -- and the trigger won't need to fire for every request.
Lambda#Edge functions can be quite simple to implement. Here's an example function that would remove the first path element, whatever it may be.
'use strict';
// lambda#edge Origin Request trigger to remove the first path element
// compatible with either Node.js 6.10 or 8.10 Lambda runtime environment
exports.handler = (event, context, callback) => {
const request = event.Records[0].cf.request; // extract the request object
request.uri = request.uri.replace(/^\/[^\/]+\//,'/'); // modify the URI
return callback(null, request); // return control to CloudFront
};
That's it. In .replace(/^\/[^\/]+\//,'/'), we're matching the URI against a regular expression that matches the leading / followed by 1 or more characters that must not be /, and then one more /, and replacing the entire match with a single / -- so the path is rewritten from /abc/def/ghi/... to /def/ghi/... regardless of the exact value of abc. This could be made more complex to suit specific requirements without any notable increase in execution time... but remember that a Lambda#Edge function is tied to one or more Cache Behaviors, so you don't need a single function to handle all requests going through the distribution -- just the request matched by the associated cache behavior's path pattern.
To simply prepend a prefix onto the request from the browser, the Origin Path setting can still be used, as noted below, but to remove or modify path components requires Lambda#Edge, as above.
Original answer.
Yes, the patterns have to exist at the origin.
CloudFront, natively, can prepend to the path for a given origin, but it does not currently have the capability of removing elements of the path (without Lambda#Edge, as noted above).
If your files were in /secret/files/ at the origin, you could have the path pattern /files/* transformed before sending the request to the origin by setting the "origin path."
The opposite isn't true. If the files were in /files at the origin, there is not a built-in way to serve those files from path pattern /download/files/*.
You can add (prefix) but not take away.
A relatively simple workaround would be a reverse proxy server on an EC2 instance in the same region as the S3 bucket, pointing CloudFront to the proxy and the proxy to S3. The proxy would rewrite the HTTP request on its way to S3 and stream the resulting response back to CloudFront. I use a setup like this and it has never disappointed me with its performance. (The reverse proxy software I developed can actually check multiple buckets in parallel or series and return the first non-error response it receives, to CloudFront and the requester).
Or, if using the S3 Website Endpoints as the custom origins, you could use S3 redirect routing rules to return a redirect to CloudFront, sending the browser back with the unhandled prefix removed. This would mean an extra request for each object, increasing latency and cost somewhat, but S3 redirect rules can be set to fire only when the request doesn't actually match a file in the bucket. This is useful for transitioning from one hierarchical structure to another.
http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/distribution-web-values-specify.html
http://docs.aws.amazon.com/AmazonS3/latest/dev/HowDoIWebsiteConfiguration.html

Difference between CloudFront TTL expiry and Invalidation?

What are the practical differences between when CloudFront expires objects at an origin via the CloudFront TTL setting versus when one calls invalidate?
The general idea is that you use TTLs to set the policy that CloudFront uses to determine the maximum amount of time each individual object can potentially be served from the CloudFront cache with no further interaction with the origin.
Default TTL: the maximum time an object can persist in a CloudFront cache without being considered stale, when no relevant Cache-Control directive is supplied by the origin. No Cache-Control header is added to the response by CloudFront.
Minimum TTL: If the origin supplies a Cache-Control: s-maxage value (or, if not present, then a Cache-Control: max-age value) smaller than this, CloudFront ignores it and assumes it can retain the object in the cache for not longer than this. For example, if Minimum TTL is set to 900, but the response contains Cache-Control: max-age=300, CloudFront ignores the 300 and may cache the object for up to 900 seconds. The Cache-Control header is not modified, and is returned to the viewer as received.
Maximum TTL: If the origin supplies a Cache-Control directive indicating that the object can be cached longer than this, CloudFront ignores the directive and assumes that the object must not continue to be served from cache for longer than Maximum TTL.
See Specifying How Long Objects Stay in a CloudFront Edge Cache (Expiration) in the Amazon CloudFront Developer Guide.
So, these three values control what CloudFront uses to determine whether a cached response is still "fresh enough" to be returned to subsequent viewers. It does not mean CloudFront purges the cached object after the TTL expires. Instead, CloudFront may retain the object, but will not serve it beyond the expiration without first sending request to the origin to see if the object has changed.
CloudFront does not proactively check the origin for new versions of objects that have expired -- it only checks if they are requested again, while still in the cache, and then determined to have been expired. When it does this, it usually sends a conditional request, using directives like If-Modfied-Since. This gives the origin the option of responding 304 Not Modified, which tells CloudFront that the cached object is still usable.
A misunderstanding that sometimes surfaces is that the TTL directs CloudFront how long to cache the objects. That is not what it does. It tells CloudFront how long it is allowed to cache the response with no revalidation against the origin. Cache storage inside CloudFront has no associated charge, and caches by definition are ephemeral, so, objects that are rarely requested may be purged from the cache before their TTL expires.
If an object in an edge location isn't frequently requested, CloudFront might evict the object—remove the object before its expiration date—to make room for objects that have been requested more recently.
https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/Expiration.html
On the next request, CloudFront will request the object from the origin again.
Another misunderstanding is that CloudFront's cache is monolithic. It isn't. Each of the global edges has its own independent cache, caching objects in edges through which they are being requested. Each global edge also has an upstream regional cache (in the nearest EC2 region; there may be more than one per region, but this isn't documented) where the object will also be stored, allowing other nearby global edges to find the object in the nearest regional cache, but CloudFront does not search any further, internally, for cached objects. For performance, it just goes to the origin on a cache miss.
See How CloudFront Works with Regional Edge Caches.
Invalidation is entirely different, and is intended to be used sparingly -- only the first 1000 invalidation paths submitted each month by an AWS account are free. (A path can match many files, and the path /* matches all files in the distribution).
An invalidation request has a timestamp of when the invalidation was created, and sends a message to all regions, directing them to do something along these lines (the exact algorithm isn't documented, but this accurately descibes the net effect):
Delete any files matching ${path} from your cache, if they were cached prior to ${timestamp} and
Meanwhile, since that could take some time, if you get any requests for files matching ${path} that were cached prior to ${timestamp}, don't use the cached files because they are no longer usable.
The invalidation request is considered complete as soon as the entire network has received the message. Invalidations are essentially an idempotent action, in the sense that it is not an error to invalidate files that don't actually exist, because an invalidation is telling the edges to invalidate such files if they exist.
From this, it should be apparent that the correct course of action is not to choose one or the other, but to use both as appropriate. Set your TTLs (or select "use origin cache headers" and configure your origin server always to return them with appropriate values) and then use invalidations only as necessary to purge your cache of selected or all content, as might be necessary if you've made an error, or made significant changes to the site.
The best practice, however, is not to count on invalidations but instead to use cache-busting techniques when an object changes. Cache busting means changing the actual object being requested. In the simplest implementstion, for example, this might mean you change /pics/cat1.png to /pics/cat2.png in your HTML rather than saving a new image as /pics/cat1.png when you want a new image. The problem with replacing one file with another at the same URL is that the browser also has a cache, and may continue displaying the old image.
See also Invalidating Objects.
Note also that the main TTLs are not used for error responses. By default, responses like 404 Not Found are cached for 5 minutes. This is the Error Caching Minimum TTL, designed to relieve your origin server from receiving requests that are likely to continue to fail, but only for a few minutes.
If we are looking at practical differences:
CloudFront TTL: You can control how long your objects stay in a CloudFront cache before CloudFront forwards another request to your origin.
Invalidate: Invalidate the object from edge caches. The next time a viewer requests the object, CloudFront returns to the origin to fetch the latest version of the object.
So the main difference is speed. If you deploy a new version of your application you might want to invalidate immediately.

Changing "Origin Path" in CloudFront takes very long to kick in

We have a static site hosted in S3 and delivered with CloudFront. The site works but rolling out updates takes quite long -- hours or longer. Specifically, changing the Origin Path is not reflected on edge locations nearly as quickly as desired.
Here is what we are trying to achieve...
Our S3 bucket is configured to host a website. It stores multiple versions of the same site. There is a sub-directory per git tag. For example:
/git-v1
/git-v2
/git-v3
..
The goal is to tell CF to start serving a new version of the site per Origin Path setting. We don't want to invalidate old objects, just keep advancing the version by creating a new directory and pointing CF at it. The status under CloudFront Distributions shows "Deployed" for a long time, yet the edge locations continue to ignore the new Origin Path.
Any idea for how to make CF start serving the new sub-directory quicker would be greatly appreciated.
The Origin Path setting is applied to the request after the cache is checked... not before. When the object requested in the URI is not in the cache, the object is requested from the Origin server. At that point, Origin Path is prepended to the incoming request path, then sent to the origin. Caching is based on the incoming request path.¹
The setting itself takes effect quickly, often in seconds, but doesn't purge the cache.
If this is just for versioning the root page, you can leave the origin path blank, change the Default Root Object to the new root object, and then just invalidate /. Or, you can keep doing what you are doing, and invalidate /* after making the change. Free invalidations are limited to 1000 per month, but invalidating /* (or any wildcard) only counts as 1 invalidation, no matter how many objects the wildcard matches.
¹ incoming request path also refers to the path as it stands after a Lambda#Edge Viewer Request trigger modifies it, if applicable.

Multiple Cloudfront Origins with Behavior Path Redirection

I have two S3 buckets that are serving as my Cloudfront origin servers:
example-bucket-1
example-bucket-2
The contents of both buckets live in the root of those buckets. I am trying to configure my Cloudfront distribution to route or rewrite based on a URL pattern. For example, with these files
example-bucket-1/something.jpg
example-bucket-2/something-else.jpg
I would like to make these URLs point to the respective files
http://example.cloudfront.net/path1/something.jpg
http://example.cloudfront.net/path2/something-else.jpg
I tried setting up cache behaviors that match the path1 and path2 patterns, but it doesn't work. Do the patterns have to actually exist in the S3 bucket?
Update: the original answer, shown below, is was accurate when written in 2015, and is correct based on the built-in behavior of CloudFront itself. Originally, the entire request path needed to exist at the origin.
If the URI is /download/images/cat.png but the origin expects only /images/cat.png then the CloudFront Cache Behavior /download/* will not do what you might assume -- the cache behavior's path pattern is only for matching -- the matched prefix isn't removed.
By itself, CloudFront doesn't provide a way to remove elements from the path requested by the browser when sending the request to the origin. The request is always forwarded as it was received, or with extra characters at the beginning, if the origin path is specified.
However, the introduction of Lambda#Edge in 2017 changes the dynamic.
Lambda#Edge allows you to declare trigger hooks in the CloudFront flow and write small Javascript functions that inspect and can modify the incoming request, either before the CloudFront cache is checked (viewer request), or after the cache is checked (origin request). This allows you to rewrite the path in the request URI. You could, for example, transform a request path from the browser of /download/images/cat.png to remove /download, resulting in a request being sent to S3 (or a custom orgin) for /images/cat.png.
This option does not modify which Cache Behavior will actually service the request, because this is always based on the path as requested by the browser -- but you can then modify the path in-flight so that the actual requested object is at a path other than the one requested by the browser. When used in an Origin Request trigger, the response is cached under the path requested by the browser, so subsequent responses don't need to be rewritten -- they can be served from the cache -- and the trigger won't need to fire for every request.
Lambda#Edge functions can be quite simple to implement. Here's an example function that would remove the first path element, whatever it may be.
'use strict';
// lambda#edge Origin Request trigger to remove the first path element
// compatible with either Node.js 6.10 or 8.10 Lambda runtime environment
exports.handler = (event, context, callback) => {
const request = event.Records[0].cf.request; // extract the request object
request.uri = request.uri.replace(/^\/[^\/]+\//,'/'); // modify the URI
return callback(null, request); // return control to CloudFront
};
That's it. In .replace(/^\/[^\/]+\//,'/'), we're matching the URI against a regular expression that matches the leading / followed by 1 or more characters that must not be /, and then one more /, and replacing the entire match with a single / -- so the path is rewritten from /abc/def/ghi/... to /def/ghi/... regardless of the exact value of abc. This could be made more complex to suit specific requirements without any notable increase in execution time... but remember that a Lambda#Edge function is tied to one or more Cache Behaviors, so you don't need a single function to handle all requests going through the distribution -- just the request matched by the associated cache behavior's path pattern.
To simply prepend a prefix onto the request from the browser, the Origin Path setting can still be used, as noted below, but to remove or modify path components requires Lambda#Edge, as above.
Original answer.
Yes, the patterns have to exist at the origin.
CloudFront, natively, can prepend to the path for a given origin, but it does not currently have the capability of removing elements of the path (without Lambda#Edge, as noted above).
If your files were in /secret/files/ at the origin, you could have the path pattern /files/* transformed before sending the request to the origin by setting the "origin path."
The opposite isn't true. If the files were in /files at the origin, there is not a built-in way to serve those files from path pattern /download/files/*.
You can add (prefix) but not take away.
A relatively simple workaround would be a reverse proxy server on an EC2 instance in the same region as the S3 bucket, pointing CloudFront to the proxy and the proxy to S3. The proxy would rewrite the HTTP request on its way to S3 and stream the resulting response back to CloudFront. I use a setup like this and it has never disappointed me with its performance. (The reverse proxy software I developed can actually check multiple buckets in parallel or series and return the first non-error response it receives, to CloudFront and the requester).
Or, if using the S3 Website Endpoints as the custom origins, you could use S3 redirect routing rules to return a redirect to CloudFront, sending the browser back with the unhandled prefix removed. This would mean an extra request for each object, increasing latency and cost somewhat, but S3 redirect rules can be set to fire only when the request doesn't actually match a file in the bucket. This is useful for transitioning from one hierarchical structure to another.
http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/distribution-web-values-specify.html
http://docs.aws.amazon.com/AmazonS3/latest/dev/HowDoIWebsiteConfiguration.html