S3 Static Website: Return HTTP 410 - amazon-web-services

Background
I have a static website on S3 with 10000s of HTML pages indexed on Google. I'm moving to a new version and I want to remove old pages (which may no longer exist) from Google index. I've read online that the most efficient way to do that is to return HTTP 410 (Gone)
Problem
According to http://docs.aws.amazon.com/AmazonS3/latest/dev/CustomErrorDocSupport.html , you can not return a HTTP 410 when using S3 Static website
Api Gateway
I created a mock integration of API Gateway which return HTTP 410. Then I configured my S3 bucket to automatically redirect specific prefix to this url. However, the return code seen is HTTP 301 (for the first redirect). If I GET the API endpoint directly, I receive the 410 successfully, however if I access the API through a S3 GET, then the error code is 301
What's next
If anyone has an idea on how to return HTTP 410 on a static website hosted on S3, let me know.
Additionally, if you can think of a better alternative to de-index old page on Google (the manual tool isn't a solution as I have a large amount of pages) let me know :)

I really feel that a better answer would be to put a server in front of the S3 content with a very simple database table. Your real issue is determining a 410 vs a 404. That is, you know a page is gone but how do you differentiate from a typo or other error?
What I would envision is a table that is indexed by the path name - i.e., /path/to/my/file.html and a status of some sort. The server takes in a request for the full path, does a lookup in the database and either serves the page (assuming that the page is "active" or "available") or a 410 if you know the page is not active. If the page can't be found in the database then return a 404.
The two issues I see with this approach are:
The initial population of the database. If you've already removed the pages from S3 then how will you know when to put in a page and a "not available" flag? I'm not sure how many pages we're talking about but it could be quite big the first time.
Maintenance - you will likely need an administrative interface of some sort down the road for the next time you need to deactivate some number of pages.
There are content management systems that will do some of this for you or it wouldn't be too bad to write a simple server to do this pending the issues I've outlined.

Related

AWS CloudFront + Lambda#Edge "The JSON output is not parsable"

I have a Lambda function (a packaged next.js app) which I'm trying to access via CloudFront. The web app works unless I try to hit the homepage.
When I hit /search or /video/{videoId} the page loads just fine.
When I try to hit the homepage, I get the following error page:
502 ERROR
The request could not be satisfied.
The Lambda function returned invalid JSON: The JSON output is not parsable. We can't connect to the server for this app or website at this time. There might be too much traffic or a configuration error. Try again later, or contact the app or website owner.
If you provide content to customers through CloudFront, you can find steps to troubleshoot and help prevent this error by reviewing the CloudFront documentation.
Generated by cloudfront (CloudFront)
Request ID: {id}
Why would just the homepage be invalid JSON? Where can I see this JSON to determine what is wrong? I created a mock Cloudfront request test in the Lambda function and it just returns successfully.
The problem was due to the 1 MB size limit of CloudFront Lambda#Edge responses. I didn't realize that Next.js's serverside rendering was creating a large <script id="__NEXT_DATA__"> tag on my homepage with all the fetched info from my API duplicated multiple times over. This resulted in my app's homepage being >2 MB.
I refactored my app to only send one network request, and made sure that data is only put into the __NEXT_DATA__ tag once. The app now works.

Understanding Server/Client Routing: How Can Amazon(?) Be Redirecting My SPA ... Without a Redirect (or History Entry)?

NOTE: I'm providing details of my setup, but really this is a "how is this possible" question, not a "please debug my setup" question.
I have a "singe page application" (ie. an HTML file that uses the History API to simulate URLs). I'm serving this app on AWS S3, behind an AWS Cloudfront ... front.
I had successfully configured things so that if someone went to www.example.com/foo (let's pretend I own example.com), Cloudfront would serve an "error page" of my index.html. My index.html would then see the URL, and use its routing to show the user the correct page.
That all worked great ... until it didn't. Now for some reason when I go to www.example.com/foo, I get redirected to www.example.com. I'm trying to debug things, but what I can't understand is how I'm going from /foo to the main page.
When I look in the Network panel of my developer tools, I can see the request made to the original (/foo). Then I can see the chain of requests (for images, css files, etc.), and they all have a referrer of www.example.com/foo.
Then all of the sudden I see a request for React Developer tools (why it needs to make a request is beyond me) ... and it's from referrer www.example.com. After that I get one last image request from /foo, and then all subsequent requests come from www.example.com.
Can anyone explain how this could be working? I know that if a server returns a redirect (either type) that could change my URL ... but every request has a 200 status (ie. no server redirects).
I know Javascript could "push" a new URL to my browser ... but that would leave a history entry right? When I go "back" (either with my browser or history.back()) I go to the page before; I don't go "back" to /foo.
So somehow I'm not making a history entry, but I am switching my URL, and the URL I make requests from, and this all happens within milliseconds on page load ... without any redirects. How?
P.S. When I use my dev tools to add an beforeunload breakpoint, then try to navigate from example.com to example.com/foo I don't hit that break point (either for going to /foo, or when I'm "redirected" back to example.com).
When I check the box for any Load event, I do see some happen ... after my URL has already switched. In other words, I type example.com/foo, hit enter, and by the time any event fires I'm back on example.com. Whatever mechanism is doing the "redirection" here ... it doesn't trigger any load events.
I figured out my (AWS-specific) problem, thanks to a bit of Gatsby documentation. I'll include the details below in case it helps others, but I won't accept this answer, as I still don't understand how AWS did what it did (and I'd still welcome an answer for that).
What happened was that I had my Cloudfront "Origin Domain Name and Path" pointing to:
example.com.s3.amazonaws.com
However, as explained on https://www.gatsbyjs.com/docs/deploying-to-s3-cloudfront/:
There are two ways that you can connect CloudFront to an S3 origin. The most obvious way, which the AWS Console will suggest, is to type the bucket name in the Origin Domain Name field. This sets up an S3 origin, and allows you to configure CloudFront to use IAM to access your bucket. Unfortunately, it also makes it impossible to perform serverside (301/302) redirects, and it also means that directory indexes (having index.html be served when someone tries to access a directory) will only work in the root directory. You might not initially notice these issues, because Gatsby’s clientside JavaScript compensates for the latter and plugins such as gatsby-plugin-meta-redirect can compensate for the former. But just because you can’t see these issues, doesn’t mean they won’t affect search engines.
In order for all the features of your site to work correctly, you must instead use your S3 bucket’s Static Website Hosting Endpoint as the CloudFront origin. This does (sadly) mean that your bucket will have to be configured for public-read, because when CloudFront is using an S3 Static Website Hosting Endpoint address as the Origin, it’s incapable of authenticating via IAM.
Once I changed my Cloudfront "Origin Domain Name and Path" to the bucket's static hosting URL:
http://example.com.s3-website-us-west-1.amazonaws.com
Everything worked!
But again, I still don't understand how AWS did what it did when I mis-set my "Origin Domain Name and Path". It redirected me to my root domain, seemingly without either a redirect response OR a client-side redirect, and I'd love to hear how that was accomplished.

Static content on CloudFront is cached incorrectly over time

I have set up a CloudFront on top of multiple S3 buckets (in different regions) to provide a fast stable version of my webapp. This webapp is implemented with React which means it's all one single HTML file and one single Javascript file.
Using the routing mechanism of React, all the paths in the URL are handled within the code. This means if I click on a link like www.example.com/users, there won't be a request sent to the server. Instead, the client code will render the appropriate page without any consultation with the server (I'm just talking about the HTML and not considering the data). This means that if some user types in the given URL, the server should return the index.html (the only HTML file I have) which then will take care of the URL on the client-side. In other words, all the requests sent to the server should either return the HTML file or the Javascript file I mentioned earlier. Even the requests that are pointing to none-existing files.
In order to implement this requirement, I asked this question and I got an answer like this:
I need to set up an error page for my distribution on CloudFront and
redirect all the 403 (Forbidden) requests to /index.html file. This
is because when the request is pointing to a nonexisting file on S3,
S3 will return 403 to CloudFront due to the lack of listing
permission. Or I can grant the listing permission and instead handle
the 404 error (I didn't test this latter option).
Anyways, I set this up and it works perfectly - for a few hours. But then, for some unknown reason, the request to the Javascript file also returns the HTML file. And of course, all I'm getting back is actually coming from CloudFront's cache which means, no matter how many times I send the request, it will keep returning the same value. That is until I invalidate the cache on CloudFront which will solve the problem for few more hours. And we go around and around.
Even though I'm not sure why this happens but my guess is that at some point the S3 buck is inaccessible to CloudFront which will result in CloudFront caching the index.html. What can I do about this?
I think I found the problem:
MAKE SURE YOUR STATIC CONTENT ON ALL THE S3 BUCKETS ARE IDENTICAL!!!
In my case, the Javascript filename is automatically generated by Webpack which means it's random. And since different regions were "compiled" separated, their filenames differed.

Understanding CORS

I've been looking on the web regarding CORS, and I wanted to confirm if whatever I made of it is, what it actually is.
Mentioned below is a totally fictional scenario.
I'll take an example of a normal website. Say my html page has a form that takes a text field name. On submitting it, it sends the form data to myPage.php. Now, what happens internally is that, the server sends the request to www.mydomain.com/mydirectory/myPage.php along with the text fields. Now, the server sees that the request was fired off from the same domain/port/protocol
(Question 1. How does server know about all these details. Where does it extract all these details froms?)
Nonetheless, since the request is originated from same domain, it server the php script and returns whatever is required off it.
Now, for the sake of argument, let's say I don't want to manually fill the data in text field, but instead I want to do it programmatically. What I do is, I create a html page with javascript and fire off a POST request along with the parameters (i.e. values of textField). Now since my request is not from any domain as such, the server disregards the service to my request. and I get cross domain error?
Similarly, I could have written a Java program also, that makes use of HTTPClient/Post request and do the same thing.
Question 2 : Is this what the problem is?
Now, what CORS provide us is, that the server will say that 'anyone can access myPage.php'.
From enable cors.org it says that
For simple CORS requests, the server only needs to add the following header to its response:
Access-Control-Allow-Origin: *
Now, what exactly is the client going to do with this header. As in, the client anyway wanted to make call to the resources on server right? It should be upto server to just configure itself with whether it wants to accept or not, and act accordingly.
Question 3 : What's the use of sending a header back to client (who has already made a request to the server)?
And finally, what I don't get is that, say I am building some RESTful services for my android app. Now, say I have one POST service www.mydomain.com/rest/services/myPost. I've got my Tomcat server hosting these services on my local machine.
In my android app, I just call this service, and get the result back (if any). Where exactly did I use CORS in this case. Does this fall under a different category of server calls? If yes, then how exactly.
Furthermore, I checked Enable Cors for Tomcat and it says that I can add a filter in my web.xml of my dynamic web project, and then it will start accepting it.
Question 4 : Is that what is enabling the calls from my android device to my webservices?
Thanks
First of all, the cross domain check is performed by the browser, not the server. When the JavaScript makes an XmlHttpRequest to a server other than its origin, if the browser supports CORS it will initialize a CORS process. Or else, the request will result in an error (unless user has deliberately reduced browser security)
When the server encounters Origin HTTP header, server will decide if it is in the list of allowed domains. If it is not in the list, the request will fail (i.e. server will send an error response).
For number 3 and 4, I think you should ask separate questions. Otherwise this question will become too broad. And I think it will quickly get close if you do not remove it.
For an explanation of CORS, please see this answer from programmers: https://softwareengineering.stackexchange.com/a/253043/139479
NOTE: CORS is more of a convention. It does not guarantee security. You can write a malicious browser that disregards the same domain policy. And it will execute JavaScript fetched from any site. You can also create HTTP headers with arbitrary Origin headers, and get information from any third party server that implements CORS. CORS only works if you trust your browser.
For question 3, you need to understand the relationship between the two sites and the client's browser. As Krumia alluded to in their answer, it's more of a convention between the three participants in the request.
I recently posted an article which goes into a bit more detail about how CORS handshakes are designed to work.
Well I am not a security expert but I hope, I can answer this question in one line.
If CORS is enabled then server will just ask browser if you are calling the request from [xyz.com]? If browser say yes it will show the result and if browser says no it is from [abc.com] it will throw error.
So CORS is dependent on browser. And that's why browsers send a preflight request before actual request.
In my case I just added
.authorizeRequests().antMatchers(HttpMethod.OPTIONS, "/**").permitAll()
to my WebSecurityConfiguration file issue is resolved

Asp Mvc 3 - Restful web service for consuming on multiple platforms

I am wanting to expose a restful web service for posting and retrieving data, this may be consumed by mobile devices or a web site.
Now the actual creation of the service isn't a problem, what does seem to be a problem is communicating from a different domain.
I have made a simple example service deployed on the ASP.NET development server, which just exposes a simple POST action to send a request with JSON content. Then I have created a simple web page using jquery ajax to send some dummy data over, yet I believe I am getting stung with the same origin policy.
Is this a common thing, and how do you get around it? Some places have mentioned having a proxy on the domain that you always request a get to, but then you cannot use it in a restful manner...
So is this a common issue with a simple fix? As there seem to be plenty of restful services out there that allow 3rd parties to use their service...
How exactly are you "getting stung with the same origin policy"? From your description, I don't see how it could be relevant. If yourdomain.com/some-path/defined-request.json returns a certain JSON response, then it will return that response regardless of what is requesting the file, unless you have specifically defined required credentials that are not satisfied.
Here is an example of such a web service. It will return the same JSON object regardless of from where the request is made: http://maps.googleapis.com/maps/api/geocode/json?address=1600+Amphitheatre+Parkway,+Mountain+View,+CA&sensor=true
Unless I am misunderstanding you (in which case you should clarify your actual problem), the same origin policy doesn't really seem to apply here.
Update Re: Comment
"I make a simple HTML page and load it as file://myhtmlfilelocation/myhtmlfile.html and try to make an ajax request"
The cause of your problem is that you are using the file:// URL scheme, instead of the http:// protocol scheme. You can find information about this scheme in Section 3.10 of RFC 1738. Here is an excerpt:
The file URL scheme is used to designate files accessible on a particular host computer. This scheme, unlike most other URL schemes, does not designate a resource that is universally accessible over the Internet.
You should be able to resolve your issue by using the http:// scheme instead of the file:// scheme when you make your asynchronous HTTP request.