Static content on CloudFront is cached incorrectly over time - amazon-web-services

I have set up a CloudFront on top of multiple S3 buckets (in different regions) to provide a fast stable version of my webapp. This webapp is implemented with React which means it's all one single HTML file and one single Javascript file.
Using the routing mechanism of React, all the paths in the URL are handled within the code. This means if I click on a link like www.example.com/users, there won't be a request sent to the server. Instead, the client code will render the appropriate page without any consultation with the server (I'm just talking about the HTML and not considering the data). This means that if some user types in the given URL, the server should return the index.html (the only HTML file I have) which then will take care of the URL on the client-side. In other words, all the requests sent to the server should either return the HTML file or the Javascript file I mentioned earlier. Even the requests that are pointing to none-existing files.
In order to implement this requirement, I asked this question and I got an answer like this:
I need to set up an error page for my distribution on CloudFront and
redirect all the 403 (Forbidden) requests to /index.html file. This
is because when the request is pointing to a nonexisting file on S3,
S3 will return 403 to CloudFront due to the lack of listing
permission. Or I can grant the listing permission and instead handle
the 404 error (I didn't test this latter option).
Anyways, I set this up and it works perfectly - for a few hours. But then, for some unknown reason, the request to the Javascript file also returns the HTML file. And of course, all I'm getting back is actually coming from CloudFront's cache which means, no matter how many times I send the request, it will keep returning the same value. That is until I invalidate the cache on CloudFront which will solve the problem for few more hours. And we go around and around.
Even though I'm not sure why this happens but my guess is that at some point the S3 buck is inaccessible to CloudFront which will result in CloudFront caching the index.html. What can I do about this?

I think I found the problem:
MAKE SURE YOUR STATIC CONTENT ON ALL THE S3 BUCKETS ARE IDENTICAL!!!
In my case, the Javascript filename is automatically generated by Webpack which means it's random. And since different regions were "compiled" separated, their filenames differed.

Related

Understanding Server/Client Routing: How Can Amazon(?) Be Redirecting My SPA ... Without a Redirect (or History Entry)?

NOTE: I'm providing details of my setup, but really this is a "how is this possible" question, not a "please debug my setup" question.
I have a "singe page application" (ie. an HTML file that uses the History API to simulate URLs). I'm serving this app on AWS S3, behind an AWS Cloudfront ... front.
I had successfully configured things so that if someone went to www.example.com/foo (let's pretend I own example.com), Cloudfront would serve an "error page" of my index.html. My index.html would then see the URL, and use its routing to show the user the correct page.
That all worked great ... until it didn't. Now for some reason when I go to www.example.com/foo, I get redirected to www.example.com. I'm trying to debug things, but what I can't understand is how I'm going from /foo to the main page.
When I look in the Network panel of my developer tools, I can see the request made to the original (/foo). Then I can see the chain of requests (for images, css files, etc.), and they all have a referrer of www.example.com/foo.
Then all of the sudden I see a request for React Developer tools (why it needs to make a request is beyond me) ... and it's from referrer www.example.com. After that I get one last image request from /foo, and then all subsequent requests come from www.example.com.
Can anyone explain how this could be working? I know that if a server returns a redirect (either type) that could change my URL ... but every request has a 200 status (ie. no server redirects).
I know Javascript could "push" a new URL to my browser ... but that would leave a history entry right? When I go "back" (either with my browser or history.back()) I go to the page before; I don't go "back" to /foo.
So somehow I'm not making a history entry, but I am switching my URL, and the URL I make requests from, and this all happens within milliseconds on page load ... without any redirects. How?
P.S. When I use my dev tools to add an beforeunload breakpoint, then try to navigate from example.com to example.com/foo I don't hit that break point (either for going to /foo, or when I'm "redirected" back to example.com).
When I check the box for any Load event, I do see some happen ... after my URL has already switched. In other words, I type example.com/foo, hit enter, and by the time any event fires I'm back on example.com. Whatever mechanism is doing the "redirection" here ... it doesn't trigger any load events.
I figured out my (AWS-specific) problem, thanks to a bit of Gatsby documentation. I'll include the details below in case it helps others, but I won't accept this answer, as I still don't understand how AWS did what it did (and I'd still welcome an answer for that).
What happened was that I had my Cloudfront "Origin Domain Name and Path" pointing to:
example.com.s3.amazonaws.com
However, as explained on https://www.gatsbyjs.com/docs/deploying-to-s3-cloudfront/:
There are two ways that you can connect CloudFront to an S3 origin. The most obvious way, which the AWS Console will suggest, is to type the bucket name in the Origin Domain Name field. This sets up an S3 origin, and allows you to configure CloudFront to use IAM to access your bucket. Unfortunately, it also makes it impossible to perform serverside (301/302) redirects, and it also means that directory indexes (having index.html be served when someone tries to access a directory) will only work in the root directory. You might not initially notice these issues, because Gatsby’s clientside JavaScript compensates for the latter and plugins such as gatsby-plugin-meta-redirect can compensate for the former. But just because you can’t see these issues, doesn’t mean they won’t affect search engines.
In order for all the features of your site to work correctly, you must instead use your S3 bucket’s Static Website Hosting Endpoint as the CloudFront origin. This does (sadly) mean that your bucket will have to be configured for public-read, because when CloudFront is using an S3 Static Website Hosting Endpoint address as the Origin, it’s incapable of authenticating via IAM.
Once I changed my Cloudfront "Origin Domain Name and Path" to the bucket's static hosting URL:
http://example.com.s3-website-us-west-1.amazonaws.com
Everything worked!
But again, I still don't understand how AWS did what it did when I mis-set my "Origin Domain Name and Path". It redirected me to my root domain, seemingly without either a redirect response OR a client-side redirect, and I'd love to hear how that was accomplished.

Performing an internal redirect on Amazon CloudFront in a 4XX error handler

We would like to serve several test domains off a single S3 bucket using CloudFront as a frontend.
Namely, https://test-1.domain.com/index.html goes to bucket-1.s3.amazonaws.com/test-1/index.html, https://test-2.domain.com/index.html to bucket-1.s3.amazonaws.com/test-2/index.html and so on.
The problem is that our web app is an SPA, so when there is no content in the S3 bucket we should return 200 not 404, say https://test-2.domain.com/some/url should get bucket-1.s3.amazonaws.com/test-2/index.html without modifying an URL (thus, 302 is not an option).
It would be perfectly possible using an Error Pages setting for a CloudFront distribution if we were serving just a single domain, but we need to distinguish between test-1. and test-2. and use index.htmls from different subfolders. Is this still possible anyhow?
I think this is possible using Lambda#edge Origin request Function.
This is how I would do it in complicated way:
Whitelist HOST header (I know we shouldn't do it for S3)
Write a Lambda#edge function to read HOST header values and
if it test-1.domain.com, choose the Origin with path as
bucket-1.s3.amazonaws.com/test-1/ else bucket-1.s3.amazonaws.com/test-2/
https://aws.amazon.com/blogs/networking-and-content-delivery/dynamically-route-viewer-requests-to-any-origin-using-lambdaedge/

S3 Static Website: Return HTTP 410

Background
I have a static website on S3 with 10000s of HTML pages indexed on Google. I'm moving to a new version and I want to remove old pages (which may no longer exist) from Google index. I've read online that the most efficient way to do that is to return HTTP 410 (Gone)
Problem
According to http://docs.aws.amazon.com/AmazonS3/latest/dev/CustomErrorDocSupport.html , you can not return a HTTP 410 when using S3 Static website
Api Gateway
I created a mock integration of API Gateway which return HTTP 410. Then I configured my S3 bucket to automatically redirect specific prefix to this url. However, the return code seen is HTTP 301 (for the first redirect). If I GET the API endpoint directly, I receive the 410 successfully, however if I access the API through a S3 GET, then the error code is 301
What's next
If anyone has an idea on how to return HTTP 410 on a static website hosted on S3, let me know.
Additionally, if you can think of a better alternative to de-index old page on Google (the manual tool isn't a solution as I have a large amount of pages) let me know :)
I really feel that a better answer would be to put a server in front of the S3 content with a very simple database table. Your real issue is determining a 410 vs a 404. That is, you know a page is gone but how do you differentiate from a typo or other error?
What I would envision is a table that is indexed by the path name - i.e., /path/to/my/file.html and a status of some sort. The server takes in a request for the full path, does a lookup in the database and either serves the page (assuming that the page is "active" or "available") or a 410 if you know the page is not active. If the page can't be found in the database then return a 404.
The two issues I see with this approach are:
The initial population of the database. If you've already removed the pages from S3 then how will you know when to put in a page and a "not available" flag? I'm not sure how many pages we're talking about but it could be quite big the first time.
Maintenance - you will likely need an administrative interface of some sort down the road for the next time you need to deactivate some number of pages.
There are content management systems that will do some of this for you or it wouldn't be too bad to write a simple server to do this pending the issues I've outlined.

How to add basic logic in AWS S3 or cloudfront?

Let's say I have two files: one for safari and one for Firefox.
I want to check User-Agent and return file based on the User-Agent.
How do I do this without adding external server?
You can't do this without adding an extra server.
S3 supports static content. It does not¹ vary its response based on request headers.
CloudFront relies on the origin server if content needs to vary based on request headers. Note that by default, CloudFront doesn't forward most headers to the origin, but this can be changed in the cache behavior configuration. If you forward the User-Agent header to the origin, your cache hit rate drops dramatically, since CloudFront has no choice but to assume any and every change in the user agent string could trigger a change in the response, so an object in the cache that was requested by a specific user agent string will only be served to a future browser with an identical user agent string. It will cache each different copy, but this still hurts your hit rate. If you only want to know the general type of browser, CloudFront can inject special headers to tell the origin whether the user agent is desktop, smart-tv, mobile, or tablet, without actually forwarding the user agent string and causing the same negative impact on the cache hit ratio.
So CloudFront will correctly cache the appropriate version of a page for each unique user agent... but the origin server must implement the actual content selection logic. And when the origin is S3, that isn't supported -- unless you have a server between CloudFront and S3. This is a perfectly valid configuration -- I have such a setup, with a server that rewrites the request path received from CloudFront before sending the request to S3, then returns the content from S3 back to CloudFront, which returns the content to the browser.
AWS Lambda would be a potential candidate for an application like this, acting as the necessary server (a serverless server, if you will) between CloudFront and S3... but it does not yet suport binary data, so for anything other than text, that isn't an option, either.
¹At least, not in any sense that is relevant, here. Exceptions exist for CORS and when access is granted or denied based on a limited subset of request headers.

Is it possible to set Content-Security-Policy headers in Amazon S3?

I'm trying to set a Content-Security-Policy header for an html file I'm serving via s3/cloudfront. I'm using the web-based AWS console. Whenever I try to add the header:
it doesn't seem to respect it. What can I do to make sure this header is served?
I'm having the same problem (using S3/CloudFront) and it appears there is currently no way to set this up easily.
S3 has a whitelist of the headers permitted, and Content-Security-Policy is not on it. Whilst it is true you can use the prefixed x-amz-meta-Content-Security-Policy, this is unhelpful as there is no browser support for it.
There are two options I can see.
1) you can serve the html content from a webserver on an EC2 instance and set that up as another CloudFront origin. Not really a great solution.
2) include the CSP as a meta tag within your html document:
<!doctype html>
<html>
<head>
<meta http-equiv="Content-Security-Policy" content="default-src http://*.foobar.com 'self'">
...
This option is not as widely supported by browsers, but it appears to work with both Webkit and Firefox, so the current Chrome, Firefox, Safari (and IOS 7 Safari) seem to support it.
I chose 2 as it was the simpler/cheaper/faster solution and I hope AWS will add the CSP header in the future.
S3/CloudFront takes any headers that the origin set and forward those to the client, but you can't set custom headers on you response directly.
You can use Lambda#Edge function that can inject security headers through CloudFront.
Here is how the process works: (reference aws blog)
Viewer navigates to website.
Before CloudFront serves content from the cache it will trigger any
Lambda function associated with the Viewer Request trigger for that
behavior.
CloudFront serves content from the cache if available, otherwise it
goes to step 4.
Only after CloudFront cache ‘Miss’, Origin Request trigger is fired
for that behavior.
S3 Origin returns content.
After content is returned from S3 but before being cached in
CloudFront, Origin Response trigger is fired.
After content is cached in CloudFront, Viewer Response trigger is
fired and is the final step before viewer receives content.
Viewer receives content.
Below is the blog from aws on how to do this step by step.
https://aws.amazon.com/blogs/networking-and-content-delivery/adding-http-security-headers-using-lambdaedge-and-amazon-cloudfront/
If you are testing through CloudFront, have you made sure you have invalidated the cached objects? Can you try to upload a completely new file and then try accessing it via CF and see if the header is still not there?
Update
Seems like custom metadata will not work as expected as per DOC. Any metadata other than the ones supported by S3 (the ones displayed in the dropdown) will have to be prefixed with x-amz-meta-