gZIP with AWS cloudFront and S3 - amazon-web-services

CloudFront offers compression (gZIP) for certain file types from the origin. My architecture looks like this:
So, the requirements for the files to get compressed in cloudFront are:
1. Have to enable Compress Objects Automatically option in cloudFront's cache behaviour settings.
2. content-type and content-length has to be returned by S3. S3 sends these headers by default. I have cross checked this.
3. The received file type must be one of the file types listed by cloudFront. In my case, I want to compress app.bundle.js which comes under application/javascript (content-type) and it is also present in the supported file-types of cloudFront.
I guess above are the only requirements to get a gZipped version of the files to browser. Even after having the above things, gzip does not work for me. Any ideas, what am I missing?

Related

Aws::Transfer::TransferManager::UploadDirectory and content-type

I'm attempting to use Aws::Transfer::TransferManager::UploadDirectory to upload a directory of files to s3. These files will later be hosted via CloudFront to web clients. For this reason, I need to set several headers such as Content-Type, Content-Encoding and they will be different depending on the file.
At first glance, there does not appear to be a way to specify this information when as part of the UploadDirectory call. There is a forAws::Map<Aws::String, Aws::String>() metadata that feels like it should be what I want, but it's missing documentation and I'm not sure how a string -> string mapping could do what I want.
Is UploadDirectory the wrong approach here? Would I be better off re-implementing my own version so that I can do more per-file operations?

How to enable gzip compression on AWS CloudFront

I m trying to gzip compress the img I m serving through CloudFront. My origin is S3
Based on several articles/blogs on aws, what I did is:
1) Set "Content-Length" header for the object I want to compress. I set the value equal to the size appeared on the size property box
2) Set the Compress Objects Automatically value to Yes in the Behaviour of my cloud distribution.
3) I invalidated my object to get a fresh copy from S3.
Still I m not able to make CloudFront gzip my object. Any ideas?
I'm trying to gzip compress the [image]
You don't typically need to gzip images -- doing so saves very little bandwidth, if any, since virtually all image formats used on the web are already compressed.
Also, CloudFront doesn't support it.
See File Types that CloudFront Compresses for the suported file formats. They are text-based formats, which tend to benefit substantially from gzip compression.
If you really want the files served gzipped, you can store the files in S3, already gzipped.
$ gzip -9 myfile.png
This will create a gzipped file myfile.png.gz.
Upload the file to S3 without the .gz on the end. Set the Content-Encoding: header to gzip and set the Content-Type: header to the normal, correct value for the file, such as image/png.
This breaks any browser that doesn't understand Content-Encoding: gzip, but there should be no browsers in use that have that limitation.
Note that the -9, above, means maximum compression.
If you're trying to gzip jpegs/pngs, I would suggest that you first compress them online with a tool such as https://tinyjpg.com/
You will not need to compress the images further ideally. Compressing images with image optimization tools will work better than using gzip -9 as it takes into consideration the textures, colors and patterns and such.
Also, make sure that you save your file in the proper formats (actual images in jpg and vector images in png) - This will help in reducing the size of the images

AWS S3 Upload Integrity

I'm using S3 to backup large files that are critical to my business. Can I be confident that once uploaded, these files are verified for integrity and are intact?
There is a lot of documentation around scalability and availability but I couldn't find any information talking about integrity and/or checksums.
When uploading to S3, there's an optional request header (which in my opinion should not be optional, but I digress), Content-MD5. If you set this value to the base64 encoding of the MD5 hash of the request body, S3 will outright reject your upload in the event of a mismatch, thus preventing the upload of corrupt data.
The ETag header will be set to the hex-encoded MD5 hash of the object, for single part uploads (with an exception for some types of server-side encryption).
For multipart uploads, the Content-MD5 header is set to the same value, but for each part.
When S3 combines the parts of a multipart upload into the final object, the ETag header is set to the hex-encoded MD5 hash of the concatenated binary-encoded (raw bytes) MD5 hashes of each part, plus - plus the number of parts.
When you ask S3 to do that final step of combining the parts of a multipart upload, you have to give it back the ETags it gave you during the uploads of the original parts, which is supposed to assure that what S3 is combining is what you think it is combining. Unfortunately, there's an API request you can make to ask S3 about the parts you've uploaded, and some lazy developers will just ask S3 for this list and then send it right back, which the documentarion warns against, but hey, it "seems to work," right?
Multipart uploads are required for objects over 5GB and optional for uploads over 5MB.
Correctly used, these features provide assurance of intact uploads.
If you are using Signature Version 4, which also optional in older regions, there is an additional integrity mechanism, and this one isn't optional (if you're actually using V4): uploads must have a request header x-amz-content-sha256, set to the hex-encoded SHA-256 hash of the payload, and the request will be denied if there's a mismatch here, too.
My take: Since some of these features are optional, you can't trust that any tools are doing this right unless you audit their code.
I don't trust anybody with my data, so for my own purposes, I wrote my own utility, internally called "pedantic uploader," which uses no SDK and speaks directly to the REST API. It calculates the sha256 of the file and adds it as x-amz-meta-... metadata so it can be fetched with the object for comparison. When I upload compressed files (gzip/bzip2/xz) I store the sha of both compressed and uncompressed in the metadata, and I store the compressed and uncompressed size in octets in the metadata as well.
Note that Content-MD5 and x-amz-content-sha256 are request headers. They are not returned with downloads. If you want to save this information in the object metadata, as I described here.
Within EC2, you can easily download an object without actually saving it to disk, just to verify its integrity. If the EC2 instance is in the same region as the bucket, you won't be billed for data transfer if you use an instance with a public IPv4 or IPv6 address, a NAT instance, an S3 VPC endpoint, or through an IPv6 egress gateway. (You'll be billed for NAT Gateway data throughput if you access S3 over IPv4 through a NAT Gateway). Obviously there are ways to automate this, but manually, if you select the object in the console, choose Download, right-click and copy the resulting URL, then do this:
$ curl -v '<url from console>' | md5sum # or sha256sum etc.
Just wrap the URL from the console in single ' quotes since it will be pre-signed and will include & in the query string, which you don't want the shell to interpret.
You can perform an MD5 checksum locally, and then verify that against the MD5 checksum of the object on S3 to ensure data integrity. Here is a guide

Limit Size Of Objects While Uploading To Amazon S3 Using Pre-Signed URL

I know of limiting the upload size of an object using this method: http://doc.s3.amazonaws.com/proposals/post.html#Limiting_Uploaded_Content
But i would like to know how it can be done while generating a pre-signed url using S3 SDK on the server side as an IAM user.
This Url from SDK has no such option in its parameters : http://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/S3.html#putObject-property
Neither in this:
http://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/S3.html#getSignedUrl-property
Please note: I already know of this answer: AWS S3 Pre-signed URL content-length and it is NOT what i am looking for.
The V4 signing protocol offers the option to include arbitrary headers in the signature. See:
http://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-query-string-auth.html
So, if you know the exact Content-Length in advance, you can include that in the signed URL. Based on some experiments with CURL, S3 will truncate the file if you send more than specified in the Content-Length header. Here is an example V4 signature with multiple headers in the signature
http://docs.aws.amazon.com/general/latest/gr/sigv4-add-signature-to-request.html
You may not be able to limit content upload size ex-ante, especially considering POST and Multi-Part uploads. You could use AWS Lambda to create an ex-post solution. You can setup a Lambda function to receive notifications from the S3 bucket, have the function check the object size and have the function delete the object or do some other action.
Here's some documentation on
Handling Amazon S3 Events Using the AWS Lambda.
For any other wanderers that end up on this thread - if you set the Content-Length attribute when sending the request from your client, there a few possibilities:
The Content-Length is calculated automatically, and S3 will store up to 5GB per file
The Content-Length is manually set by your client, which means one of these three scenarios will occur:
The Content-Length matches your actual file size and S3 stores it.
The Content-Length is less than your actual file size, so S3 will truncate your file to fit it.
The Content-Length is larger than your actual file size, and you will receive a 400 Bad Request
In any case, a malicious user can override your client and manually send a HTTP request with whatever headers they want, including a much larger Content-Length than you may be expecting. Signed URLs do not protect against this! The only way is to setup an POST policy. Official docs here: https://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-HTTPPOSTConstructPolicy.html
More details here: https://janac.medium.com/sending-files-directly-from-client-to-amazon-s3-signed-urls-4bf2cb81ddc3?postPublishedType=initial
Alternatively, you can have a Lambda that automatically deletes files that are larger than expected.

Force AWS EMR to unzip files in S3

I have a bucket in AWS's S3 service that contains gzipped CSV files, however when they were stored they all were saved with the metadata Content-Type of text/csv.
Now I am using AWS EMR, which will not recognize them as a zipped file and unzip them. I've looked through configuration option for EMR but don't see anything that would work... I have almost a million files, so renaming their metadata value would require a Boto script that cycled through all the files and renamed the metadata value.
Am I missing something easy? Thanks!
The Content-Type isn't the problem... that's correct if the files are csv, but if you stored them gzipped, then you needed to also have set Content-Encoding: gzip in the header metadata. Doing that "should" trigger the useragent that's fetching them to gunzip them on the fly when they are downloaded... so had you done that, it should have "just worked."
(I store gzipped log files this way, with Content-Type: text/plain and Content-Encoding: gzip and when you download them with a web browser, the file you get is no longer gzipped because the browser untwizzles the compression on the fly due to the Content-Encoding header.)
But, since you've already uploaded the files, I did find this in the google machine, which might help:
GZipped input. A lot of my input data had already been gzipped, but luckily if you pass -jobconf stream.recordreader.compression=gzip in the extra arguments section Hadoop will decompress them on the fly before passing the data to your mapper.
http://petewarden.typepad.com/searchbrowser/2010/01/elastic-mapreduce-tips.html