My question came out when I experienced two different behaviors in object URL from json files stored in a s3 bucket.
Consider a json file: mydata.json
If I upload this file using s3 dashboard from AWS website, I am able to see data in browser: //s3-us-west-2.amazonaws.com/bucket/folder/mydata.json. I am also able to read this data from a different application if I create a specific configuration in s3 bucket.
For the other hand, if I use boto3 library for python and upload the same file in the same bucket (making file public in the process), when I click object URL it downloads the file, but it doesn't open data in browser.
This is the code I used:
# upload json file
bucket.upload_file(path, jsonkey)
object_acl = s3.ObjectAcl('bucket_name', jsonkey)
bucket_response = object_acl.put(ACL='public-read')
I explored file properties such as metadata. When I upload file via dashboard, the metadata assigned is Content-Type: application/json, and via boto3 is Content-Type: binary/octet-stream. I don't really know if metadata affects the object URL behavior.
In this context, how can I properly configure files in json format to be downloaded or to be read? I mean, what is the main configuration that affects object URL behavior?
I couldn't find a significant difference between both methods (dashboard and boto3) in properties or permissions, besides Content-Type in metadata. However, when I tried to change Content-Type, behavior was the same.
Any other information I can provide to clarify this question, be free to ask. Thanks in advance.
The documentation for the S3 bucket resource's upload_file() method is not ideal as it simply refers you to the equivalent S3Transfer docs for how extra arguments can be used.
Try the following:
bucket.upload_file(path, jsonkey, ExtraArgs={'ContentType': "application/json"})
I will be hosting a static web site on S3. The problem is that the web engine behind S3-as-a-web-server does not transform http://example.com/hello/ into http://example.com/hello/index.html.
When configuring the web site, there is a provision for the root document (the one which will be displayed when calling http://example.com), but not any deeper URLS (such as my example).
Is it possible to use the redirect rules to achieve that?
I actually have a solution for this problem, but is is really convoluted:
host the web site on an S3 bucket
deploy a CloudFront instance which origins in that bucket
use a Lambda#Edge which will rewrite the call once it hits CloudFront
I hope there is something more straightforward (I have hope in the redirect rules, though "redirect" suggests that something was already attained, which is not the case in my problem as S3 does not seem to understand what http://example.com/hello/ is.
When you specify the default index file and wants to serve index.html in a subpath,
You need to have the index.html in every level.
The documentation for S3 specifies the following
If you create such a folder structure in your bucket, you must have an
index document at each level. When a user specifies a URL that
resembles a folder lookup, the presence or absence of a trailing slash
determines the behavior of the website. For example, the following
URL, with a trailing slash, returns the photos/index.html index
document.
http://example-bucket.s3-website-region.amazonaws.com/photos/ However,
if you exclude the trailing slash from the preceding URL, Amazon S3
first looks for an object photos in the bucket. If the photos object
is not found, then it searches for an index document,
photos/index.html. If that document is found, Amazon S3 returns a 302
Found message and points to the photos/ key. For subsequent requests
to photos/, Amazon S3 returns photos/index.html.
Alternatively, If you want ALL paths to server index.html, this thread might be useful
We are using an onPrem S3 compatible storage server in an intranet network and we want to expose this intranet url to internet so we used a ReverseProxy with a mapping to the intranet url. When we test the intranet url it works perfectly but when we test the internet url we get the 403 error:
The request signature we calculated does not match the signature you provided. Check your Secret Access Key and signing method. For more information, see REST Authentication and SOAP Authentication for details. (Service: Amazon S3; Status Code: 403; Error Code: SignatureDoesNotMatch; Request ID: 0a440c7f:15cc604b1e2:12d3af:24d; S3 Extended Request ID: null), S3 Extended Request ID: null
After debugging, we found that the proxy modifies the host header used to calculate the signature in order to redirect the request to the intranet url...
So my question is how to supress some headers from the V4 signature calculation using AWS SDK or Boto3 client. or is there a better architecture to expose an onPrem S3 service.
Thanks in advance.
Amir.
There are essentially two solutions to this.
The first one is easier: sign the request for the internal URL, then just use simple string prefix replacement to rewrite the host part of the signed URL to point it to the hostname of the external proxy. When the proxy rewrites the Host header, it will end up rewriting it back to exactly what you signed.
It is, I assume, common knowledge that signed URLs are immune to tampering, for all practical purposes: you can't change anything about a signed URL without invalidating it... but that's not what this is. The change is temporary, and the proxy's net effect is to undo the change.
The alternate solution requires the proxy or another service in the chain (before the storage service) to know the signing keys and secrets, so that it can first validate the incoming request, and if valid, modify the request and then generate a new signature that the service will accept. I once wrote a service to do this so that when a request was for HEAD, the proxy would use the same key and secret (which it knew) to generate a signature for the same request, but with GET. If it matched the signature in the incoming request, the proxy would replace the existing signature with a signature for a HEAD request -- thus allowing the client to use a URL originally signed for a GET request, to make either a GET or a HEAD request -- something S3 does not natively support, since a GET and a HEAD for the same object require two different signed URLs. The concept is the same, though -- generate a signature in the proxy for what the client is requesting, to validate the incoming signature, and then re-sign the request as needed. The solution I built used HAProxy's Lua integration to examine and modify the request in flight.
As in the title, I can't seem to get it to work, i'm following the high level guide detailed here but any images uploaded seem to be blank.
What i've set up:
/images/{object} - PUT
> Integration Request
AWS Region: ap-southeast-2
AWS Service: S3
AWS Subdomain [bucket name here]
HTTP method: PUT
Path override: /{object}
Execution Role [I have one set up]
> URL Path Paramaters
object -> method.request.path.object
I'm trying to use Postman to send a PUT request with Content-Type: image/png and the body is a binary upload of a png file.
I've also tried using curl:
curl -X PUT -H "Authorization: Bearer [token]" -H "Content-Type: image/gif" --upload-file ~/Pictures/bart.gif https://[api-url]/dev/images/cool.gif
It creates the file on the server and the size seems to be double what ever was uploaded, when viewed I just get "image has an error".
When I try with .txt files (content-type: text/plain) it seems to work though.
Any ideas?
After reading alot and chatting to AWS technical support, the problem seems to be that you can't do binary uploads through API Gateway as anything that passes through automatically goes through a UTF-8 encode.
There are a few workarounds for this I can think of, my solution will be to base64 the files before upload and trigger a lambda when they hit the bucket to decode them
This is a old post, but I got a solution.
AWS now support binary upload through APIGateway READ.
In general, go to your API settings, and add a Binary Media type.
After that, you can handle the file in base64
How do you set a default root object for subdirectories on a statically hosted website on Cloudfront? Specifically, I'd like www.example.com/subdir/index.html to be served whenever the user asks for www.example.com/subdir. Note, this is for delivering a static website held in an S3 bucket. In addition, I would like to use an origin access identity to restrict access to the S3 bucket to only Cloudfront.
Now, I am aware that Cloudfront works differently than S3 and amazon states specifically:
The behavior of CloudFront default root objects is different from the
behavior of Amazon S3 index documents. When you configure an Amazon S3
bucket as a website and specify the index document, Amazon S3 returns
the index document even if a user requests a subdirectory in the
bucket. (A copy of the index document must appear in every
subdirectory.) For more information about configuring Amazon S3
buckets as websites and about index documents, see the Hosting
Websites on Amazon S3 chapter in the Amazon Simple Storage Service
Developer Guide.
As such, even though Cloudfront allows us to specify a default root object, this only works for www.example.com and not for www.example.com/subdir. In order to get around this difficulty, we can change the origin domain name to point to the website endpoint given by S3. This works great and allows the root objects to be specified uniformly. Unfortunately, this doesn't appear to be compatable with origin access identities. Specifically, the above links states:
Change to edit mode:
Web distributions β Click the Origins tab, click the origin that you want to edit, and click Edit. You can only create an origin access
identity for origins for which Origin Type is S3 Origin.
Basically, in order to set the correct default root object, we use the S3 website endpoint and not the website bucket itself. This is not compatible with using origin access identity. As such, my questions boils down to either
Is it possible to specify a default root object for all subdirectories for a statically hosted website on Cloudfront?
Is it possible to setup an origin access identity for content served from Cloudfront where the origin is an S3 website endpoint and not an S3 bucket?
There IS a way to do this. Instead of pointing it to your bucket by selecting it in the dropdown (www.example.com.s3.amazonaws.com), point it to the static domain of your bucket (eg. www.example.com.s3-website-us-west-2.amazonaws.com):
Thanks to This AWS Forum thread
(New Feature May 2021) CloudFront Function
Create a simple JavaScript function below
function handler(event) {
var request = event.request;
var uri = request.uri;
// Check whether the URI is missing a file name.
if (uri.endsWith('/')) {
request.uri += 'index.html';
}
// Check whether the URI is missing a file extension.
else if (!uri.includes('.')) {
request.uri += '/index.html';
}
return request;
}
Read here for more info
Activating S3 hosting means you have to open the bucket to the world. In my case, I needed to keep the bucket private and use the origin access identity functionality to restrict access to Cloudfront only. Like #Juissi suggested, a Lambda function can fix the redirects:
'use strict';
/**
* Redirects URLs to default document. Examples:
*
* /blog -> /blog/index.html
* /blog/july/ -> /blog/july/index.html
* /blog/header.png -> /blog/header.png
*
*/
let defaultDocument = 'index.html';
exports.handler = (event, context, callback) => {
const request = event.Records[0].cf.request;
if(request.uri != "/") {
let paths = request.uri.split('/');
let lastPath = paths[paths.length - 1];
let isFile = lastPath.split('.').length > 1;
if(!isFile) {
if(lastPath != "") {
request.uri += "/";
}
request.uri += defaultDocument;
}
console.log(request.uri);
}
callback(null, request);
};
After you publish your function, go to your cloudfront distribution in the AWS console. Go to Behaviors, then chooseOrigin Request under Lambda Function Associations, and finally paste the ARN to your new function.
I totally agree that it's a ridiculous problem! The fact that CloudFront knows about serving index.html as Default Root Object AND STILL they say it doesn't work for subdirectories (source) is totally strange!
The behavior of CloudFront default root objects is different from the behavior of Amazon S3 index documents. When you configure an Amazon S3 bucket as a website and specify the index document, Amazon S3 returns the index document even if a user requests a subdirectory in the bucket.
I, personally, believe that AWS has made it this way so CloudFront becomes a CDN only (loading assets, with no logic in it whatsoever) and every request to a path in your website should be served from a "Server" (e.g. EC2 Node/Php server, or a Lambda function.)
Whether this limitation exists to enhance security, or keep things apart (i.e. logic and storage separated), or make more money (to enforce people to have a dedicated server, even for static content) is up to debate.
Anyhow, I'm summarizing the possible solutions workarounds here, with their pros and cons.
1) S3 can be Public - Use Custom Origin.
It's the easiest one, originally posted by #JBaczuk answer as well as in this github gist. Since S3 already supports serving index.html in subdirectories via Static Website Hosting, all you need to do is:
Go to S3, enable Static Website Hosting
Grab the URL in the form of http://<bucket-name>.s3-website-us-west-2.amazonaws.com
Create a new Origin in CloudFront and enter this as a Custom Origin (and NOT S3 ORIGIN), so CloudFront treats this as an external website when getting the content.
Pros:
Very easy to set up.
It supports /about/, /about, and /about/index.html and redirect the last two to the first one, properly.
Cons:
If your files in the S3 bucket are not in the root of S3 (say in /artifacts/* then going to www.domain.com/about (without the trailing /) will redirect you to www.domain.com/artifacts/about which is something you don't want at all! Basically the /about to /about/ redirect in S3 breaks if you serve from CloudFront and the path to files (from the root) don't match.
Security and Functionality: You cannot make S3 Private. It's because CloudFront's Origin Access Identity is not going to be supported, clearly, because CloudFront is instructed to take this Origin as a random website. It means that users can potentially get the files from S3 directly, which might not be what you ever what due to security/WAF concerns, as well as the website actually working if you have JS/html that relies on the path being your domain only.
[maybe an issue] The communication between CloudFront and S3 is not the way it's recommended to optimize stuff.
[maybe?] someone has complained that it doesn't work smoothly for more than one Origin in the Distribution (i.e. wanting /blog to go somewhere)
[maybe?] someone has complained that it doesn't preserve the original query params as expected.
2) Official solution - Use a Lambda Function.
It's the official solution (though the doc is from 2017). There is a ready-to-launch 3rd-party Application (JavaScript source in github) and example Python Lambda function (this answer) for it, too.
Technically, by doing this, you create a mini-server (they call it serverless!) that only serves CloudFront's Origin Requests to S3 (so, it basically sits between CloudFront and S3.)
Pros:
Hey, it's the official solution, so probably lasts longer and is the most optimized one.
You can customize the Lambda Function if you want and have control over it. You can support further redirect in it.
If implemented correctly, (like the 3rd party JS one, and I don't think the official one) it supports /about/ and /about both (with a redirect from the latter without trailing / to the former).
Cons:
It's one more thing to set up.
It's one more thing to have an eye, so it doesn't break.
It's one more thing to check when something breaks.
It's one more thing to maintain -- e.g. the third-party one here has open PRs since Jan 2021 (it's April 2021 now.)
The 3rd party JS solution doesn't preserve the query params. So /about?foo=bar is 301 redirected to /about/ and NOT /about/?foo=bar. You need to make changes to that lambda function to make it work.
The 3rd party JS solution keeps /about/ as the canonical version. If you want /about to be the canonical version (i.e. other formats get redirected to it via 301), you have to make changes to the script.
[minor] It only works in us-east-1 (open issue in Github since 2020, still open and an actual problem in April 2021).
[minor] It has its own cost, although given CloudFront's caching, shouldn't be significant.
3) Create fake "Folder File"s in S3 - Use a manual Script.
It's a solution between the first two -- It supports OAI (private S3) and it doesn't require a server. It's a bit nasty though!
What you do here is, you run a script that for each subdirectory of /about/index.html it creates an object in S3 named (has key of) /about and copy that HTML file (the content and the content-type) into this object.
Example scripts can be found in this Reddit answer and this answer using AWS CLI.
Pros:
Secure: Supports S3 Private and CloudFront OAI.
No additional live piece: The script runs pre-upload to S3 (or one-time) and then the system remains intact with the two pieces of S3 and CF only.
Cons:
[Needs Confirmation] It supports /about but not /about/ with trailing / I believe.
Technically you have two different files being stored. Might look confusing and make your deploys expensive if there are tons of HTML files.
Your script has to manually find all the subdirectories and create a dummy object out of them in S3. That has the potential to break in the future.
PS. Other Tricks)
Dirty trick using Javascript on Custom Error
While it doesn't look like a real thing, this answer deserves some credit, IMO!
You let the Access Denied (404s turning into 403) go through, then catch them, and manually, via a JS, redirect them to the right place.
Pros
Again, easy to set up.
Cons
It relies on JavaScript in Client-Side.
It messes up with SEO -- especially if the crawler doesn't run JS.
It messes up with the user's browser history. (i.e. back button) and possibly could be improved (and get more complicated!) via HTML5 history.replace.
There is an "official" guide published on AWS blog that recommends setting up a Lambda#Edge function triggered by your CloudFront distribution:
Of course, it is a bad user experience to expect users to always type index.html at the end of every URL (or even know that it should be there). Until now, there has not been an easy way to provide these simpler URLs (equivalent to the DirectoryIndex Directive in an Apache Web Server configuration) to users through CloudFront. Not if you still want to be able to restrict access to the S3 origin using an OAI. However, with the release of Lambda#Edge, you can use a JavaScript function running on the CloudFront edge nodes to look for these patterns and request the appropriate object key from the S3 origin.
Solution
In this example, you use the compute power at the CloudFront edge to inspect the request as itβs coming in from the client. Then re-write the request so that CloudFront requests a default index object (index.html in this case) for any request URI that ends in β/β.
When a request is made against a web server, the client specifies the object to obtain in the request. You can use this URI and apply a regular expression to it so that these URIs get resolved to a default index object before CloudFront requests the object from the origin. Use the following code:
'use strict';
exports.handler = (event, context, callback) => {
// Extract the request from the CloudFront event that is sent to Lambda#Edge
var request = event.Records[0].cf.request;
// Extract the URI from the request
var olduri = request.uri;
// Match any '/' that occurs at the end of a URI. Replace it with a default index
var newuri = olduri.replace(/\/$/, '\/index.html');
// Log the URI as received by CloudFront and the new URI to be used to fetch from origin
console.log("Old URI: " + olduri);
console.log("New URI: " + newuri);
// Replace the received URI with the URI that includes the index page
request.uri = newuri;
// Return to CloudFront
return callback(null, request);
};
Follow the guide linked above to see all steps required to set this up, including S3 bucket, CloudFront distribution and Lambda#Edge function creation.
There is one other way to get a default file served in a subdirectory, like example.com/subdir/. You can actually (programatically) store a file with the key subdir/ in the bucket. This file will not show up in the S3 management console, but it actually exists, and CloudFront will serve it.
Johan Gorter and Jeremie indicated index.html can be stored as an object with key subdir/.
I validated this approach works and an alternative easy way to do this with awscli's s3api copy-object
aws s3api copy-object --copy-source bucket_name/subdir/index.html --key subdir/ --bucket bucket_name
Workaround for the issue is to utilize lambda#edge for rewriting the requests. One just needs to setup the lambda for the CloudFront distribution's viewer request event and to rewrite everything that ends with '/' AND is not equal to '/' with default root document e.g. index.html.
UPDATE: It looks like I was incorrect! See JBaczuk's answer, which should be the accepted answer on this thread.
Unfortunately, the answer to both your questions is no.
1. Is it possible to specify a default root object for all subdirectories for a statically hosted website on Cloudfront?
No. As stated in the AWS CloudFront docs...
... If you define a default root object, an end-user request for a subdirectory of your distribution does not return the default root object. For example, suppose index.html is your default root object and that CloudFront receives an end-user request for the install directory under your CloudFront distribution:
http://d111111abcdef8.cloudfront.net/install/
CloudFront will not return the default root object even if a copy of index.html appears in the install directory.
...
The behavior of CloudFront default root objects is different from the behavior of Amazon S3 index documents. When you configure an Amazon S3 bucket as a website and specify the index document, Amazon S3 returns the index document even if a user requests a subdirectory in the bucket. (A copy of the index document must appear in every subdirectory.)
2. Is it possible to setup an origin access identity for content served from Cloudfront where the origin is an S3 website endpoint and not an S3 bucket?
Not directly. Your options for origins with CloudFront are S3 buckets or your own server.
It's that second option that does open up some interesting possibilities, though. This probably defeats the purpose of what you're trying to do, but you could setup your own server whose sole job is to be a CloudFront origin server.
When a request comes in for http://d111111abcdef8.cloudfront.net/install/, CloudFront will forward this request to your origin server, asking for /install. You can configure your origin server however you want, including to serve index.html in this case.
Or you could write a little web app that just takes this call and gets it directly from S3 anyway.
But I realize that setting up your own server and worrying about scaling it may defeat the purpose of what you're trying to do in the first place.
Another alternative to using lambda#edge is to use CloudFront's error pages. Set up a Custom Error Response to send all 403's to a specific file. Then add javascript to that file to append index.html to urls that end in a /. Sample code:
if ((window.location.href.endsWith("/") && !window.location.href.endsWith(".com/"))) {
window.location.href = window.location.href + "index.html";
}
else {
document.write("<Your 403 error message here>");
}
One can use newly released cloudfront functions and here is sample code.
Note: If you are using static website hosting, then you do not need any function!
I know this is an old question, but I just struggled through this myself. Ultimately my goal was less to set a default file in a directory, and more to have the the end result of a file that was served without .html at the end of it
I ended up removing .html from the filename and programatically/manually set the mime type to text/html. It is not the traditional way, but it does seem to work, and satisfies my requirements for the pretty urls without sacrificing the benefits of cloudformation. Setting the mime type is annoying, but a small price to pay for the benefits in my opinion
#johan-gorter indicated above that CloudFront serves file with keys ending by /
After investigation, it appears that this option works, and that one can create this type of files in S3 programatically. Therefore, I wrote a small lambda that is triggered when a file is created on S3, with a suffix index.html or index.htm
What it does is copying an object dir/subdir/index.html into an object dir/subdir/
import json
import boto3
s3_client = boto3.client("s3")
def lambda_handler(event, context):
for f in event['Records']:
bucket_name = f['s3']['bucket']['name']
key_name = f['s3']['object']['key']
source_object = {'Bucket': bucket_name, 'Key': key_name}
file_key_name = False
if key_name[-10:].lower() == "index.html" and key_name.lower() != "index.html":
file_key_name = key_name[0:-10]
elif key_name[-9:].lower() == "index.htm" and key_name.lower() != "index.htm":
file_key_name = key_name[0:-9]
if file_key_name:
s3_client.copy_object(CopySource=source_object, Bucket=bucket_name, Key=file_key_name)