Limit Size Of Objects While Uploading To Amazon S3 Using Pre-Signed URL - amazon-web-services

I know of limiting the upload size of an object using this method: http://doc.s3.amazonaws.com/proposals/post.html#Limiting_Uploaded_Content
But i would like to know how it can be done while generating a pre-signed url using S3 SDK on the server side as an IAM user.
This Url from SDK has no such option in its parameters : http://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/S3.html#putObject-property
Neither in this:
http://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/S3.html#getSignedUrl-property
Please note: I already know of this answer: AWS S3 Pre-signed URL content-length and it is NOT what i am looking for.

The V4 signing protocol offers the option to include arbitrary headers in the signature. See:
http://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-query-string-auth.html
So, if you know the exact Content-Length in advance, you can include that in the signed URL. Based on some experiments with CURL, S3 will truncate the file if you send more than specified in the Content-Length header. Here is an example V4 signature with multiple headers in the signature
http://docs.aws.amazon.com/general/latest/gr/sigv4-add-signature-to-request.html

You may not be able to limit content upload size ex-ante, especially considering POST and Multi-Part uploads. You could use AWS Lambda to create an ex-post solution. You can setup a Lambda function to receive notifications from the S3 bucket, have the function check the object size and have the function delete the object or do some other action.
Here's some documentation on
Handling Amazon S3 Events Using the AWS Lambda.

For any other wanderers that end up on this thread - if you set the Content-Length attribute when sending the request from your client, there a few possibilities:
The Content-Length is calculated automatically, and S3 will store up to 5GB per file
The Content-Length is manually set by your client, which means one of these three scenarios will occur:
The Content-Length matches your actual file size and S3 stores it.
The Content-Length is less than your actual file size, so S3 will truncate your file to fit it.
The Content-Length is larger than your actual file size, and you will receive a 400 Bad Request
In any case, a malicious user can override your client and manually send a HTTP request with whatever headers they want, including a much larger Content-Length than you may be expecting. Signed URLs do not protect against this! The only way is to setup an POST policy. Official docs here: https://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-HTTPPOSTConstructPolicy.html
More details here: https://janac.medium.com/sending-files-directly-from-client-to-amazon-s3-signed-urls-4bf2cb81ddc3?postPublishedType=initial
Alternatively, you can have a Lambda that automatically deletes files that are larger than expected.

Related

How to specify custom metadata in resumable upload (via XML API)?

I am following steps for resumable upload outlined here.
According to documentation custom metadata has to be specified in first POST and is to be passed via x-goog-meta-* headers. I.e.:
x-goog-meta-header1: value1
x-goog-meta-header2: value2
... etc
But in my testing all these values disappear. After final PUT object shows up in the bucket with proper content-type but without a single piece of custom metadata.
What I am doing wrong?
P.S. It is rather suspicious that JSON API in resumable upload takes metadata as payload of first POST...
P.P.S. I am performing resumable upload via XML API described here (only using C++ code instead of curl utility). Adding x-goog-meta-mykey: myvalue header has no effect on object's custom metadata.
if you replace AWS4-HMAC-SHA256 in Authorization header with GOOG4-HMAC-SHA256 -- it works. GCS uses this bit as a "should I expect x-amz- or x-goog- headers?" switch. Problem is that with resumable upload you have to specify x-goog-resumable and adding x-amz-meta-* headers causes request to fail with a message about mixing x-goog- and x-amz- headers.
I also went ahead and changed few other aspects of signature, namely:
request type: aws4_request -> goog4_request
signing key: AWS4 -> GOOG4 (GOOG1 works too)
service name: s3 -> storage (even though in some errors GCS asks for either s3 or storage to be specified here, it also takes gcs and maybe other values)
... this isn't necessary, it seems. I've done it just for consistency.

AWS service to verify data integrity of file in S3 via checksum?

One method of ensuring a file in S3 is what it claims to be is to download it, get its checksum, and match the result against the checksum you were expecting.
Does AWS provide any service that allows this to happen without the user needing to first download the file? (i.e. ideally a simple request/url that provides the checksum of an S3 file, so that it can be verified before the file is downloaded)
What I've tried so far
I can think of a DIY solution along the lines of
Create an API endpoint that accepts a POST request with the S3 file url
Have the API run a lambda that generates the checksum of the file
Respond with the checksum value
This may work, but is already a little complicated and would have further considerations, e.g. large files may take a long time to generate a checksum (e.g. > 60 seconds)
I'm hoping AWS have some simple way of validating S3 files?
There is an ETag created against each object, which is an MD5 of the object contents.
However, there seems to be some exceptions.
From Common Response Headers - Amazon Simple Storage Service:
ETag: The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. The ETag may or may not be an MD5 digest of the object data. Whether or not it is depends on how the object was created and how it is encrypted as described below:
Objects created by the PUT Object, POST Object, or Copy operation, or through the AWS Management Console, and are encrypted by SSE-S3 or plaintext, have ETags that are an MD5 digest of their object data.
Objects created by the PUT Object, POST Object, or Copy operation, or through the AWS Management Console, and are encrypted by SSE-C or SSE-KMS, have ETags that are not an MD5 digest of their object data.
If an object is created by either the Multipart Upload or Part Copy operation, the ETag is not an MD5 digest, regardless of the method of encryption.
Also, the calculation of an ETag for a multi-part upload can be complex. See: s3cmd - What is the algorithm to compute the Amazon-S3 Etag for a file larger than 5GB? - Stack Overflow

AWS Lambda#edge. How to read HTML file from S3 and put content in response body

Specifically, in an origin response triggered function (EX. With 404 Status), how can I read an HTML file stored in S3 and use its content for the response body?
(I would like to manually return a custom error page just as CloudFront does, but choosing it based on cookies).
NOTE: The HTML file in S3 is stored in the same bucket of my website. OAI Enabled.
Thank you very much!
Lambda#Edge functions don't currently¹ have direct access to any body content from the origin.
You will need to grant your Lambda Execution Role the necessary privileges to read from the bucket, and then use s3.getObject() from the JavaScript SDK to fetch the object from the bucket, then use its body.
The SDK is already in the environment,² so you don't need to bundle it with your code. You can just require it, and create the S3 client globally, outside the handler, which saves time on subsequent invocations.
'use strict';
const AWS = require('aws-sdk');
const s3 = new AWS.S3({ region: 'us-east-2' }); // use the correct region for your bucket
exports.handler ...
Note that one of the perceived hassles of updating a Lambda#Edge function is that the Lambda console gives the impression that redeploying it is annoyingly complicated... but you don't have to use the Lambda console to do this. The wording of the "enable trigger and replicate" checkbox gives you the impression that it's doing something important, but it turns out... it isn't. Changing the version number in the CloudFront configurarion and saving changes accomplishes the same purpose.
After you create a new version of the function, you can simply go to the Cache Behavior in the CloudFront console and edit the trigger ARN to use the new version number, then save changes.
¹currently but I have submitted this as a feature request; this could potentially allow a response trigger to receive a copy of the response body and rewrite it. It would necessarily be limited to the maximum size of the Lambda API (or smaller, as generated responses are currently limited), and might not be applicable in this case, since I assume you may be fetching a language-specific response.
²already in the environment. If I remember right, long ago, Lambda#Edge didn't include the SDK, but it is always there, now.

Amazon S3 Object Lifecycle Management via header

I've been searching for an answer to this question for quite some time but apparently I'm missing something.
I use s3cmd heavily to automate document uploads to AWS S3, via script. One of the parameters that can be used in s3cmd is --add-header, which I assume allows for lifecycle rules to be added.
My objective is to add this parameters and specify a +X (where X is days) to the upload. In the event of ... --add-header=...1 ... the lifecyle rule would delete this file after 24h.
I know this can be easily done via the console, but I would like to have a more detailed control over individual files/scripts.
I've read the parameters that can be passed to S3 via s3cmd, but I somehow can't understand how to put all of those together to get the intended result.
Thank you very much for any help or assistance!
The S3 API itself does not implement support for any request header that triggers lifecycle management at the object level.
The --add-header option for s3cmd can add headers that S3 understands, such as Content-Type, but there is no lifecycle header you can send using any tool.
You might be thinking of this:
If you make a GET or a HEAD request on an object that has been scheduled for expiration, the response will include an x-amz-expiration header that includes this expiration date and the corresponding rule Id
https://aws.amazon.com/blogs/aws/amazon-s3-object-expiration/
This is a reaponse header, and is read-only.

AWS S3 Upload Integrity

I'm using S3 to backup large files that are critical to my business. Can I be confident that once uploaded, these files are verified for integrity and are intact?
There is a lot of documentation around scalability and availability but I couldn't find any information talking about integrity and/or checksums.
When uploading to S3, there's an optional request header (which in my opinion should not be optional, but I digress), Content-MD5. If you set this value to the base64 encoding of the MD5 hash of the request body, S3 will outright reject your upload in the event of a mismatch, thus preventing the upload of corrupt data.
The ETag header will be set to the hex-encoded MD5 hash of the object, for single part uploads (with an exception for some types of server-side encryption).
For multipart uploads, the Content-MD5 header is set to the same value, but for each part.
When S3 combines the parts of a multipart upload into the final object, the ETag header is set to the hex-encoded MD5 hash of the concatenated binary-encoded (raw bytes) MD5 hashes of each part, plus - plus the number of parts.
When you ask S3 to do that final step of combining the parts of a multipart upload, you have to give it back the ETags it gave you during the uploads of the original parts, which is supposed to assure that what S3 is combining is what you think it is combining. Unfortunately, there's an API request you can make to ask S3 about the parts you've uploaded, and some lazy developers will just ask S3 for this list and then send it right back, which the documentarion warns against, but hey, it "seems to work," right?
Multipart uploads are required for objects over 5GB and optional for uploads over 5MB.
Correctly used, these features provide assurance of intact uploads.
If you are using Signature Version 4, which also optional in older regions, there is an additional integrity mechanism, and this one isn't optional (if you're actually using V4): uploads must have a request header x-amz-content-sha256, set to the hex-encoded SHA-256 hash of the payload, and the request will be denied if there's a mismatch here, too.
My take: Since some of these features are optional, you can't trust that any tools are doing this right unless you audit their code.
I don't trust anybody with my data, so for my own purposes, I wrote my own utility, internally called "pedantic uploader," which uses no SDK and speaks directly to the REST API. It calculates the sha256 of the file and adds it as x-amz-meta-... metadata so it can be fetched with the object for comparison. When I upload compressed files (gzip/bzip2/xz) I store the sha of both compressed and uncompressed in the metadata, and I store the compressed and uncompressed size in octets in the metadata as well.
Note that Content-MD5 and x-amz-content-sha256 are request headers. They are not returned with downloads. If you want to save this information in the object metadata, as I described here.
Within EC2, you can easily download an object without actually saving it to disk, just to verify its integrity. If the EC2 instance is in the same region as the bucket, you won't be billed for data transfer if you use an instance with a public IPv4 or IPv6 address, a NAT instance, an S3 VPC endpoint, or through an IPv6 egress gateway. (You'll be billed for NAT Gateway data throughput if you access S3 over IPv4 through a NAT Gateway). Obviously there are ways to automate this, but manually, if you select the object in the console, choose Download, right-click and copy the resulting URL, then do this:
$ curl -v '<url from console>' | md5sum # or sha256sum etc.
Just wrap the URL from the console in single ' quotes since it will be pre-signed and will include & in the query string, which you don't want the shell to interpret.
You can perform an MD5 checksum locally, and then verify that against the MD5 checksum of the object on S3 to ensure data integrity. Here is a guide