Resize images on the fly in CloudFront and get them in the same URL instantly: AWS CloudFront -> S3 -> Lambda -> CloudFront - amazon-web-services

TLDR: We have to trick CloudFront 307 redirect caching by creating new cache behavior for responses coming from our Lambda function.
You will not believe how close we are to achieve this. We have stucked so badly in the last step.
Business case:
Our application stores images in S3 and serves them with CloudFront in order to avoid any geographic slow downs around the globe.
Now, we want to be really flexible with the design and to be able to request new image dimentions directly in the CouldFront URL!
Each new image size will be created on demand and then stored in S3, so the second time it is requested it will be
served really quickly as it will exist in S3 and also will be cached in CloudFront.
Lets say the user had uploaded the image chucknorris.jpg.
Only the original image will be stored in S3 and wil be served on our page like this:
//xxxxx.cloudfront.net/chucknorris.jpg
We have calculated that we now need to display a thumbnail of 200x200 pixels.
Therefore we put the image src to be in our template:
//xxxxx.cloudfront.net/chucknorris-200x200.jpg
When this new size is requested, the amazon web services have to provide it on the fly in the same bucket and with the requested key.
This way the image will be directly loaded in the same URL of CloudFront.
I made an ugly drawing with the architecture overview and the workflow on how we are doing this in AWS:
Here is how Python Lambda ends:
return {
'statusCode': '301',
'headers': {'location': redirect_url},
'body': ''
}
The problem:
If we make the Lambda function redirect to S3, it works like a charm.
If we redirect to CloudFront, it goes into redirect loop because CloudFront caches 307 (as well as 301, 302 and 303).
As soon as our Lambda function redirects to CloudFront, CloudFront calls the API Getaway URL instead of fetching the image from S3:
I would like to create new cache behavior in CloudFront's Behaviors settings tab.
This behavior should not cache responses from Lambda or S3 (don't know what exactly is happening internally there), but should still cache any followed requests to this very same resized image.
I am trying to set path pattern -\d+x\d+\..+$, add the ARN of the Lambda function in add "Lambda Function Association"
and set Event Type Origin Response.
Next to that, I am setting the "Default TTL" to 0.
But I cannot save the behavior due to some error:
Are we on the right way, or is the idea of this "Lambda Function Association" totally different?

Finally I was able to solve it. Although this is not really a structural solution, it does what we need.
First, thanks to the answer of Michael, I have used path patterns to match all media types. Second, the Cache Behavior page was a bit misleading to me: indeed the Lambda association is for Lambda#Edge, although I did not see this anywhere in all the tooltips of the cache behavior: all you see is just Lambda. This feature cannot help us as we do not want to extend our AWS service scope with Lambda#Edge just because of that particular problem.
Here is the solution approach:
I have defined multiple cache behaviors, one per media type that we support:
For each cache behavior I set the Default TTL to be 0.
And the most important part: In the Lambda function, I have added a Cache-Control header to the resized images when putting them in S3:
s3_resource.Bucket(BUCKET).put_object(Key=new_key,
Body=edited_image_obj,
CacheControl='max-age=12312312',
ContentType=content_type)
To validate that everything works, I see now that the new image dimention is served with the cache header in CloudFront:

You're on the right track... maybe... but there are at least two problems.
The "Lambda Function Association" that you're configuring here is called Lambda#Edge, and it's not yet available. The only users who can access it is users who have applied to be included in the limited preview. The "maximum allowed is 0" error means you are not a preview participant. I have not seen any announcements related to when this will be live for all accounts.
But even once it is available, it's not going to help you, here, in the way you seem to expect, because I don't believe an Origin Response trigger allows you to do anything to trigger CloudFront to try a different destination and follow the redirect. If you see documentation that contradicts this assertion, please bring it to my attention.
However... Lambda#Edge will be useful for setting Cache-Control: no-cache on the 307 so CloudFront won't cache it, but the redirect itself will still need to go all the way back to the browser.
Note also, Lambda#Edge only supports Node, not Python... so maybe this isn't even part of your plan, yet. I can't really tell, from the question.
Read about the Lambda#Edge limited preview.
The second problem:
I am trying to set path pattern -\d+x\d+\..+$
You can't do that. Path patterns are string matches supporting * wildcards. They are not regular expressions. You might get away with /*-*x*.jpg, though, since multiple wildcards appear to be supported.

Related

Google Cloud CDN started ignoring query strings for storage buckets

Some months ago activated Cloud CDN for storage buckets. Our storage data is regularly changed via a backend. So to invalidate the cached version we added a query param with the changedDate to the url that is served to the client.
Back then this worked well.
Sometime in the last months (probably weeks) Google seemed to change that and is now ignoring the query string for caching from storage buckets.
First part: Does anyone know why this is changed and why noone was
notified about it?
Second part: How can you invalidate the Cache for a particular object
in a storage bucket without sending a cache-invalidation request
(which you shouldn't) everytime?
I don't like the idea of deleting the old file and uploading a new file with changed filename everytime something is uploaded...
EDIT:
for clarification: the official docu ( cloud.google.com/cdn/docs/caching ) already states that they now ignore query strings for storage buckets:
For backend buckets, the cache key consists of the URI without the query > string. Thus https://example.com/images/cat.jpg, https://example.com/images/cat.jpg?user=user1, and https://example.com/images/cat.jpg?user=user2 are equivalent.
We were affected by this also. After contacting Google Support, they have confirmed this is a permanent change. The recommended work around is to either use versioning in the object name, or use cache invalidation. The latter sounds a bit odd as the cache invalidation documentation states:
Invalidation is intended for use in exceptional circumstances, not as part of your normal workflow.
For backend buckets, the cache key consists of the URI without the query string, as the official documentation states.1 The bucket is not evaluating the query string but the CDN should still do that. I could reproduce this same scenario and currently is still possible to use a query string as cache buster.
Seems like the reason for the change is that the old behavior resulted in lost caching opportunities, higher costs and higher latency. The only recommended workaround for now is to create the new objects by incorporating the version into the object's name (which seems is not valid options for your case), or using cache invalidation.
Invalidating the cache for a particular object will require to use a particular query. Maybe a Cache-Control header allowing such objects to be cached for a certain time may be your workaround. Cloud CDN cache has an expiration time defined by the "Cache-Control: s-maxage", "Cache-Control: max-age", and/or Expires headers 2.
According to the doc, when using backend bucket as origin for Cloud CDN, query strings in the request URL are not included in the cache key:
For backend buckets, the cache key consists of the URI without the protocol, host, or query string.
Maybe using the query string to identify different versions of cached content is not the best practices promoted by GCP. But for some legacy issues, it has to be.
So, one way to workaround this is make backend bucket to be a static website (do NOT enable CDN here), then use custom origins (Cloud CDN backed by Internet network endpoint groups backend service) which points to that static website.
For backend service, query string IS part of cache key.
For backend services, Cloud CDN defaults to using the complete request URI as the cache key
That's it. Yes, It is tedious but works!

Don't pass on the patternpath from cloudfront to origin? [duplicate]

I have two S3 buckets that are serving as my Cloudfront origin servers:
example-bucket-1
example-bucket-2
The contents of both buckets live in the root of those buckets. I am trying to configure my Cloudfront distribution to route or rewrite based on a URL pattern. For example, with these files
example-bucket-1/something.jpg
example-bucket-2/something-else.jpg
I would like to make these URLs point to the respective files
http://example.cloudfront.net/path1/something.jpg
http://example.cloudfront.net/path2/something-else.jpg
I tried setting up cache behaviors that match the path1 and path2 patterns, but it doesn't work. Do the patterns have to actually exist in the S3 bucket?
Update: the original answer, shown below, is was accurate when written in 2015, and is correct based on the built-in behavior of CloudFront itself. Originally, the entire request path needed to exist at the origin.
If the URI is /download/images/cat.png but the origin expects only /images/cat.png then the CloudFront Cache Behavior /download/* will not do what you might assume -- the cache behavior's path pattern is only for matching -- the matched prefix isn't removed.
By itself, CloudFront doesn't provide a way to remove elements from the path requested by the browser when sending the request to the origin. The request is always forwarded as it was received, or with extra characters at the beginning, if the origin path is specified.
However, the introduction of Lambda#Edge in 2017 changes the dynamic.
Lambda#Edge allows you to declare trigger hooks in the CloudFront flow and write small Javascript functions that inspect and can modify the incoming request, either before the CloudFront cache is checked (viewer request), or after the cache is checked (origin request). This allows you to rewrite the path in the request URI. You could, for example, transform a request path from the browser of /download/images/cat.png to remove /download, resulting in a request being sent to S3 (or a custom orgin) for /images/cat.png.
This option does not modify which Cache Behavior will actually service the request, because this is always based on the path as requested by the browser -- but you can then modify the path in-flight so that the actual requested object is at a path other than the one requested by the browser. When used in an Origin Request trigger, the response is cached under the path requested by the browser, so subsequent responses don't need to be rewritten -- they can be served from the cache -- and the trigger won't need to fire for every request.
Lambda#Edge functions can be quite simple to implement. Here's an example function that would remove the first path element, whatever it may be.
'use strict';
// lambda#edge Origin Request trigger to remove the first path element
// compatible with either Node.js 6.10 or 8.10 Lambda runtime environment
exports.handler = (event, context, callback) => {
const request = event.Records[0].cf.request; // extract the request object
request.uri = request.uri.replace(/^\/[^\/]+\//,'/'); // modify the URI
return callback(null, request); // return control to CloudFront
};
That's it. In .replace(/^\/[^\/]+\//,'/'), we're matching the URI against a regular expression that matches the leading / followed by 1 or more characters that must not be /, and then one more /, and replacing the entire match with a single / -- so the path is rewritten from /abc/def/ghi/... to /def/ghi/... regardless of the exact value of abc. This could be made more complex to suit specific requirements without any notable increase in execution time... but remember that a Lambda#Edge function is tied to one or more Cache Behaviors, so you don't need a single function to handle all requests going through the distribution -- just the request matched by the associated cache behavior's path pattern.
To simply prepend a prefix onto the request from the browser, the Origin Path setting can still be used, as noted below, but to remove or modify path components requires Lambda#Edge, as above.
Original answer.
Yes, the patterns have to exist at the origin.
CloudFront, natively, can prepend to the path for a given origin, but it does not currently have the capability of removing elements of the path (without Lambda#Edge, as noted above).
If your files were in /secret/files/ at the origin, you could have the path pattern /files/* transformed before sending the request to the origin by setting the "origin path."
The opposite isn't true. If the files were in /files at the origin, there is not a built-in way to serve those files from path pattern /download/files/*.
You can add (prefix) but not take away.
A relatively simple workaround would be a reverse proxy server on an EC2 instance in the same region as the S3 bucket, pointing CloudFront to the proxy and the proxy to S3. The proxy would rewrite the HTTP request on its way to S3 and stream the resulting response back to CloudFront. I use a setup like this and it has never disappointed me with its performance. (The reverse proxy software I developed can actually check multiple buckets in parallel or series and return the first non-error response it receives, to CloudFront and the requester).
Or, if using the S3 Website Endpoints as the custom origins, you could use S3 redirect routing rules to return a redirect to CloudFront, sending the browser back with the unhandled prefix removed. This would mean an extra request for each object, increasing latency and cost somewhat, but S3 redirect rules can be set to fire only when the request doesn't actually match a file in the bucket. This is useful for transitioning from one hierarchical structure to another.
http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/distribution-web-values-specify.html
http://docs.aws.amazon.com/AmazonS3/latest/dev/HowDoIWebsiteConfiguration.html

AWS s3 bucket with CloudFront - Failed to load resource: the server responded with a status of 403 [duplicate]

I am experimenting with AWS S3 and CloudFront for a web application that I am developing.
In the app I'm letting users upload files to the S3 bucket (using the AWS SDK) and make it available via CloudFront CDN, but the issue is even when the files are uploaded and ready in the S3 bucket it takes about a minute or 2 to be available in the CloudFront CDN url, is this normal?
CloudFront attempts to fetch uncached content from the origin server in real time. There is no "replication delay" or similar issue because CloudFront is a pull-through CDN. Each CloudFront edge location knows only about your site's existence and configuration; it doesn't know about your content until it receives requests for it. When that happens, the CloudFront edge fetches the requested content from the origin server, and caches it as appropriate, for serving subsequent requests.
The issue that's occurring here is related to a concept sometimes called "negative caching" -- caching the fact that a request won't work -- which is typically done to avoid hammering the origin of whatever's being cached with requests that are likely to fail anyway.
By default, when your origin returns an HTTP 4xx or 5xx status code, CloudFront caches these error responses for five minutes and then submits the next request for the object to your origin to see whether the problem that caused the error has been resolved and the requested object is now available.
— http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/custom-error-pages.html
If the browser, or anything else, tries to download the file from that particular CloudFront edge before the upload into S3 is complete, S3 will return an error, and CloudFront -- at that edge location -- will cache that error and remember, for the next 5 minutes, not to bother trying again.
Not to worry, though -- this timer is configurable, so if the browser is doing this under the hood and outside your control, you should still be able to fix it.
You can specify the error-caching duration—the Error Caching Minimum TTL—for each 4xx and 5xx status code that CloudFront caches. For a procedure, see Configuring Error Response Behavior.
— http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/custom-error-pages.html
To configure this in the console:
When viewing the distribution configuration, click the Error Pages tab.
For each error where you want to customize the timing, begin by clicking Create Custom Error Response.
Choose the error code you want to modify from the drop-down list, such as 403 (Forbidden) or 404 (Not Found) -- your bucket configuration determines which code S3 returns for missing objects, so if you aren't sure, change 403 then repeat the process and change 404.
Set Error Caching Minimum TTL (seconds) to 0
Leave Customize Error Response set to No (If set to Yes, this option enables custom response content on errors, which is not what you want. Activating this option is outside the scope of this question.)
Click Create. This takes you back to the previous view, where you'll see Error Caching Minimum TTL for the code you just defined.
Repeat these steps for each HTTP response code you want to change from the default behavior (which is the 300 second hold time, discussed above).
When you've made all the changes you want, return to the main CloudFront console screen where the distributions are listed. Wait for the distribution state to change from In Progress to Deployed (formerly, this took quite some time but now requires typically about 5 minutes for the changes to be pushed out to all the edges) and test.
Are these new files being written to S3 for the first time, or are they updates to existing files? S3 provides read-after-write consistency for new objects, and given CloudFront's pull model you should not be having this issue with new files written to S3. If you are, then I would open a ticket with AWS.
If these are updates to existing files, then you have both S3 eventual consistency and CloudFront cache expiration to deal with. Both of which could cause this sort of behavior.
As observed in your comment, it seems that google chrome is messing up with your upload/preview strategy:
Chrome is requesting the URL that currently doesn't have the
content.
the request is cached by cloudfront with invalid response
you upload the file to S3
when preview the uploaded file the cloudfront answers with the cached response (step 2).
after the cloudfront cache expires, cloudfront hits origin and the problem can no longer be reproducible.

Amazon CloudFront Latency

I am experimenting with AWS S3 and CloudFront for a web application that I am developing.
In the app I'm letting users upload files to the S3 bucket (using the AWS SDK) and make it available via CloudFront CDN, but the issue is even when the files are uploaded and ready in the S3 bucket it takes about a minute or 2 to be available in the CloudFront CDN url, is this normal?
CloudFront attempts to fetch uncached content from the origin server in real time. There is no "replication delay" or similar issue because CloudFront is a pull-through CDN. Each CloudFront edge location knows only about your site's existence and configuration; it doesn't know about your content until it receives requests for it. When that happens, the CloudFront edge fetches the requested content from the origin server, and caches it as appropriate, for serving subsequent requests.
The issue that's occurring here is related to a concept sometimes called "negative caching" -- caching the fact that a request won't work -- which is typically done to avoid hammering the origin of whatever's being cached with requests that are likely to fail anyway.
By default, when your origin returns an HTTP 4xx or 5xx status code, CloudFront caches these error responses for five minutes and then submits the next request for the object to your origin to see whether the problem that caused the error has been resolved and the requested object is now available.
— http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/custom-error-pages.html
If the browser, or anything else, tries to download the file from that particular CloudFront edge before the upload into S3 is complete, S3 will return an error, and CloudFront -- at that edge location -- will cache that error and remember, for the next 5 minutes, not to bother trying again.
Not to worry, though -- this timer is configurable, so if the browser is doing this under the hood and outside your control, you should still be able to fix it.
You can specify the error-caching duration—the Error Caching Minimum TTL—for each 4xx and 5xx status code that CloudFront caches. For a procedure, see Configuring Error Response Behavior.
— http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/custom-error-pages.html
To configure this in the console:
When viewing the distribution configuration, click the Error Pages tab.
For each error where you want to customize the timing, begin by clicking Create Custom Error Response.
Choose the error code you want to modify from the drop-down list, such as 403 (Forbidden) or 404 (Not Found) -- your bucket configuration determines which code S3 returns for missing objects, so if you aren't sure, change 403 then repeat the process and change 404.
Set Error Caching Minimum TTL (seconds) to 0
Leave Customize Error Response set to No (If set to Yes, this option enables custom response content on errors, which is not what you want. Activating this option is outside the scope of this question.)
Click Create. This takes you back to the previous view, where you'll see Error Caching Minimum TTL for the code you just defined.
Repeat these steps for each HTTP response code you want to change from the default behavior (which is the 300 second hold time, discussed above).
When you've made all the changes you want, return to the main CloudFront console screen where the distributions are listed. Wait for the distribution state to change from In Progress to Deployed (formerly, this took quite some time but now requires typically about 5 minutes for the changes to be pushed out to all the edges) and test.
Are these new files being written to S3 for the first time, or are they updates to existing files? S3 provides read-after-write consistency for new objects, and given CloudFront's pull model you should not be having this issue with new files written to S3. If you are, then I would open a ticket with AWS.
If these are updates to existing files, then you have both S3 eventual consistency and CloudFront cache expiration to deal with. Both of which could cause this sort of behavior.
As observed in your comment, it seems that google chrome is messing up with your upload/preview strategy:
Chrome is requesting the URL that currently doesn't have the
content.
the request is cached by cloudfront with invalid response
you upload the file to S3
when preview the uploaded file the cloudfront answers with the cached response (step 2).
after the cloudfront cache expires, cloudfront hits origin and the problem can no longer be reproducible.

Multiple Cloudfront Origins with Behavior Path Redirection

I have two S3 buckets that are serving as my Cloudfront origin servers:
example-bucket-1
example-bucket-2
The contents of both buckets live in the root of those buckets. I am trying to configure my Cloudfront distribution to route or rewrite based on a URL pattern. For example, with these files
example-bucket-1/something.jpg
example-bucket-2/something-else.jpg
I would like to make these URLs point to the respective files
http://example.cloudfront.net/path1/something.jpg
http://example.cloudfront.net/path2/something-else.jpg
I tried setting up cache behaviors that match the path1 and path2 patterns, but it doesn't work. Do the patterns have to actually exist in the S3 bucket?
Update: the original answer, shown below, is was accurate when written in 2015, and is correct based on the built-in behavior of CloudFront itself. Originally, the entire request path needed to exist at the origin.
If the URI is /download/images/cat.png but the origin expects only /images/cat.png then the CloudFront Cache Behavior /download/* will not do what you might assume -- the cache behavior's path pattern is only for matching -- the matched prefix isn't removed.
By itself, CloudFront doesn't provide a way to remove elements from the path requested by the browser when sending the request to the origin. The request is always forwarded as it was received, or with extra characters at the beginning, if the origin path is specified.
However, the introduction of Lambda#Edge in 2017 changes the dynamic.
Lambda#Edge allows you to declare trigger hooks in the CloudFront flow and write small Javascript functions that inspect and can modify the incoming request, either before the CloudFront cache is checked (viewer request), or after the cache is checked (origin request). This allows you to rewrite the path in the request URI. You could, for example, transform a request path from the browser of /download/images/cat.png to remove /download, resulting in a request being sent to S3 (or a custom orgin) for /images/cat.png.
This option does not modify which Cache Behavior will actually service the request, because this is always based on the path as requested by the browser -- but you can then modify the path in-flight so that the actual requested object is at a path other than the one requested by the browser. When used in an Origin Request trigger, the response is cached under the path requested by the browser, so subsequent responses don't need to be rewritten -- they can be served from the cache -- and the trigger won't need to fire for every request.
Lambda#Edge functions can be quite simple to implement. Here's an example function that would remove the first path element, whatever it may be.
'use strict';
// lambda#edge Origin Request trigger to remove the first path element
// compatible with either Node.js 6.10 or 8.10 Lambda runtime environment
exports.handler = (event, context, callback) => {
const request = event.Records[0].cf.request; // extract the request object
request.uri = request.uri.replace(/^\/[^\/]+\//,'/'); // modify the URI
return callback(null, request); // return control to CloudFront
};
That's it. In .replace(/^\/[^\/]+\//,'/'), we're matching the URI against a regular expression that matches the leading / followed by 1 or more characters that must not be /, and then one more /, and replacing the entire match with a single / -- so the path is rewritten from /abc/def/ghi/... to /def/ghi/... regardless of the exact value of abc. This could be made more complex to suit specific requirements without any notable increase in execution time... but remember that a Lambda#Edge function is tied to one or more Cache Behaviors, so you don't need a single function to handle all requests going through the distribution -- just the request matched by the associated cache behavior's path pattern.
To simply prepend a prefix onto the request from the browser, the Origin Path setting can still be used, as noted below, but to remove or modify path components requires Lambda#Edge, as above.
Original answer.
Yes, the patterns have to exist at the origin.
CloudFront, natively, can prepend to the path for a given origin, but it does not currently have the capability of removing elements of the path (without Lambda#Edge, as noted above).
If your files were in /secret/files/ at the origin, you could have the path pattern /files/* transformed before sending the request to the origin by setting the "origin path."
The opposite isn't true. If the files were in /files at the origin, there is not a built-in way to serve those files from path pattern /download/files/*.
You can add (prefix) but not take away.
A relatively simple workaround would be a reverse proxy server on an EC2 instance in the same region as the S3 bucket, pointing CloudFront to the proxy and the proxy to S3. The proxy would rewrite the HTTP request on its way to S3 and stream the resulting response back to CloudFront. I use a setup like this and it has never disappointed me with its performance. (The reverse proxy software I developed can actually check multiple buckets in parallel or series and return the first non-error response it receives, to CloudFront and the requester).
Or, if using the S3 Website Endpoints as the custom origins, you could use S3 redirect routing rules to return a redirect to CloudFront, sending the browser back with the unhandled prefix removed. This would mean an extra request for each object, increasing latency and cost somewhat, but S3 redirect rules can be set to fire only when the request doesn't actually match a file in the bucket. This is useful for transitioning from one hierarchical structure to another.
http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/distribution-web-values-specify.html
http://docs.aws.amazon.com/AmazonS3/latest/dev/HowDoIWebsiteConfiguration.html