Gzip compression with CloudFront doesn't work - amazon-web-services

I have an angular app which, even builded with prod mode, has multiple large files (more than 1MB).
I want to compress them with gzip compression feature present on CloudFront.
I activated "Compress Objects Automatically" option in CloudFront console. The origin of my distribution is a s3 bucket.
However the bundle downloaded when I'm loading the page via my broswer are not compressed with gzip
here's an example of an request/response
Request header :
:authority:dev.test.com
:method:GET
:path:/vendor.cc93ad5b987bea0611e1.bundle.js
:scheme:https
accept:*/*
accept-encoding:gzip, deflate, br
accept-language:fr-FR,fr;q=0.8,en-US;q=0.6,en;q=0.4
cache-control:no-cache
pragma:no-cache
referer:https://dev.test.com/console/projects
user-agent:Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36
Response header
accept-ranges:bytes
age:17979
content-length:5233622
content-type:text/javascript
date:Tue, 07 Nov 2017 08:42:08 GMT
etag:"6dfe6e16901c5ee5c387407203829bec"
last-modified:Thu, 26 Oct 2017 09:57:15 GMT
server:AmazonS3
status:200
via:1.1 9b307acf1eed524f97301fa1d3a44753.cloudfront.net (CloudFront)
x-amz-cf-id:9RpiXSuSGszUaX7hBA4ZaEO949g76UDoCaxzwFtiWo7C-wla-PyBsA==
x-cache:Hit from cloudfront
According to the AWS documentation everything is ok :
Accept-Encoding: gzip
Content-Length present
file between 1,000 and
10,000,000 bytes
...
Have you an idea why cloudfront doen't compress my files ?

This response was cached several hours ago.
age:17979
CloudFront won't go back and gzip what has already been cached.
CloudFront compresses files in each edge location when it gets the files from your origin. When you configure CloudFront to compress your content, it doesn't compress files that are already in edge locations. In addition, when a file expires in an edge location and CloudFront forwards another request for the file to your origin, CloudFront doesn't compress the file if your origin returns an HTTP status code 304, which means that the edge location already has the latest version of the file. If you want CloudFront to compress the files that are already in edge locations, you'll need to invalidate those files.
http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/ServingCompressedFiles.html
Do a cache invalidation, wait for it to complete, and try again.
http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/Invalidation.html

Dynamic GZip compression is handle by CloudFront in best effort basis. Based on the capacity and availability in edge locations.
To get predictable compression, you need to gzip them before uploading to S3.

Related

Do Amazon CloudFront or Azure CDN support dynamic compression for HTTP range requests?

AWS CloudFront and Azure CDN can dynamically compress files under certain circumstances. But do they also support dynamic compression for HTTP range requests?
I couldn't find any hints in the documentations only on the Google Cloud Storage docs.
Azure:
Range requests may be compressed into different sizes. Azure Front Door requires the content-length values to be the same for any GET HTTP request. If clients send byte range requests with the accept-encoding header that leads to the Origin responding with different content lengths, then Azure Front Door will return a 503 error. You can either disable compression on Origin/Azure Front Door or create a Rules Set rule to remove accept-encoding from the request for byte range requests.
See: https://learn.microsoft.com/en-us/azure/frontdoor/standard-premium/how-to-compression
AWS:
HTTP status code of the response
CloudFront compresses objects only when the HTTP status code of the response is 200, 403, or 404.
--> Range-Request has status code 206
See:
https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/ServingCompressedFiles.html
https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/206
• Yes, Azure CDN also supports dynamic compression for HTTP range requests wherein it is known as ‘object chunking’. You can describe object chunking as dividing the file to be retrieved from the origin server/resource into smaller chunks of 8 MB. When a large file is requested, the CDN retrieves smaller pieces of the file from the origin. After the CDN POP server receives a full or byte-range file request, the CDN edge server requests the file from the origin in chunks of 8 MB.
• After the chunk arrives at the CDN edge, it's cached and immediately served to the user. The CDN then prefetches the next chunk in parallel. This prefetch ensures that the content stays one chunk ahead of the user, which reduces latency. This process continues until the entire file is downloaded (if requested), all byte ranges are available (if requested), or the client terminates the connection.
Also, this capability of object chunking relies on the ability of the origin server to support byte-range requests; if the origin server doesn't support byte-range requests, requests to download data greater than 8mb size will fail.
Please find the below link for more details regarding the above: -
https://learn.microsoft.com/en-us/azure/cdn/cdn-large-file-optimization#object-chunking
Also, find the below link for more clarification on the types of compression and the nature of compression for Azure CDN profiles that are supported: -
https://learn.microsoft.com/en-us/azure/cdn/cdn-improve-performance#azure-cdn-standard-from-microsoft-profiles
Some tests have shown when dynamic compression is enabled in AWS CloudFront the range support is disabled. So Range and If-Range headers are removed from all request.

How to identify the object that consumes the most bandwidth in an AWS S3 bucket?

What is the best way to identify the object that consumes the most bandwidth in a S3 bucket with thousands of other objects ?
By "bandwidth" I will assume that you mean the bandwidth consumed by delivering files from S3 to some place on the Internet (as when you use S3 to serve static assets).
To track this, you'll need to enable S3 access logs, which creates logfiles in a different bucket that show all of the operations against your primary bucket (or a path in it).
Here are two examples of logged GET operations. The first is from anonymous Internet access using a public S3 URL, while the second uses the AWS CLI to download the file. I've redacted or modified any identifying fields, but you should be able to figure out the format from what remains.
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx com-example-mybucket [04/Feb/2020:15:50:00 +0000] 3.4.5.6 - XXXXXXXXXXXXXXXX REST.GET.OBJECT index.html "GET /index.html HTTP/1.1" 200 - 90 90 9 8 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0" - xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx - ECDHE-RSA-AES128-GCM-SHA256 - com-example-mybucket.s3.amazonaws.com TLSv1.2
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx com-example-mybucket [05/Feb/2020:14:51:44 +0000] 3.4.5.6 arn:aws:iam::123456789012:user/me XXXXXXXXXXXXXXXX REST.GET.OBJECT index.html "GET /index.html HTTP/1.1" 200 - 90 90 29 29 "-" "aws-cli/1.17.7 Python/3.6.9 Linux/4.15.0-76-generic botocore/1.14.7" - xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx SigV4 ECDHE-RSA-AES128-GCM-SHA256 AuthHeader com-example-mybucket.s3.amazonaws.com TLSv1.2
So, to get what you want:
Enable logging
Wait for a representative amount of data to be logged. At least 24 hours unless you're a high-volume website (and note that it can take up to an hour for log records to appear).
Extract all the lines that contain REST.GET.OBJECT
From these, extract the filename and the number of bytes (in this case, the file is 90 bytes).
For each file, multiply the number of bytes by the number of times that it appears in a given period.
Beware: because every access is logged, the logfiles can grow quite large, quite fast, and you will pay for storage charges. You should create a life-cycle rule on the destination bucket to delete old logs.
Update: you could also use Athena to query this data. Here's an AWS blog post that describes the process.

AWS s3 upload api call returning 411 status

I have been trying to perform AWS s3 rest api call to upload document to s3 bucket. The document is in the form of a byte array.
PUT /Test.pdf HTTP/1.1
Host: mybucket.s3.amazonaws.com
Authorization: **********
Content-Type: application/pdf
Content-Length: 5039151
x-amz-content-sha256: STREAMING-AWS4-HMAC-SHA256-PAYLOAD
x-amz-date: 20180301T055442Z
When we perform the api call, it gives the response status 411 i.e Length Required. We have already added the Content-Length header with the byte array length as value. But still the issue is repeating. Please help to resolve the issue.
x-amz-content-sha256: STREAMING-AWS4-HMAC-SHA256-PAYLOAD is only used with the non-standards-based chunk upload API. This is a custom encoding that allows you to write chunks of data to the wire. This is not the same thing as the Multipart Upload API, and is not the same thing as Transfer-Encoding: chunked (which S3 doesn't support for uploads).
It's not clear why this would result in 411 Length Required but the error suggests that S3 is not happy with the format of the upload.
For a standard PUT upload, x-amz-content-sha256 must be set to the hex-encoded SHA-256 hash of the request body, or the string UNSIGNED-PAYLOAD. The former is recommended, because it provides an integrity check. If for any reason your data were to become corrupted on the wire in a way that TCP failed to detect, S3 would automatically reject the corrupt upload and not create the object.
See also https://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-auth-using-authorization-header.html

amazon s3 access log files incorrect value »bytes sent«

analyzing our S3 access log files I have noticed that the value of the »data transfer out per month« in the S3 access log files (S3stat and own log file analysis) is strongly different from the values in your bills.
Now I have made a test downloading files from one of our buckets and it looks like the access log files are incorrect.
At the 03/02/2015 I have uploaded a zip file on our bucket and then downloaded the complete file successfully with two different internet connections.
One day later at the 04/02/2015 I have analyzed the log files. Unfortunately, both entries have the value "-" at "Bytes Sent".
Amazons »Server Access Log Format« (http://docs.aws.amazon.com/AmazonS3/latest/dev/LogFormat.html) says:
»The number of response bytes sent, excluding HTTP protocol overhead, or "-" if zero.«
The corresponding entries looks like this:
Bucket Owner Bucket [03 / Feb / 2015: 10: 28: 41 +0000] RemoteIP -
RequestID REST.GET.OBJECT Download.zip "GET /Bucket/Download.zip HTTP
/ 1.1 "200 - - 760 542 2228865159 58" - "" Mozilla / 5.0 (Windows NT
6.1; WOW64; rv: 35.0) Gecko / 20100101 Firefox / 35.0 "-
Bucket Owner Bucket [03 / Feb / 2015: 10: 28: 57 +0000] RemoteIP -
RequestID REST.GET.OBJECT Download.zip "GET /Bucket/Download.zip HTTP
/ 1.1 "200 - - 860 028 2228865159 23" - "" Mozilla / 5.0 (Windows NT
6.1; WOW64; rv: 35.0) Gecko / 20100101 Firefox / 35.0 "-
As you can see has both logs quite long connection duration »Total Time«: 0:12:40 and 0:14:20.
Then I checked our log files of our main buckets for the month December 2014 based on these findings. In 2332 relevant entries (all ZIP files on our bucket) I found 860 entries with this error.
Thus, the Amazon S3 access log files seem flawed and useless for our analysis.
Can anybody help me? Do I make a mistake and if so, how can these log files be reliably evaluated?
Thanks
Peter
after two months of inquiry Amazon it looks like Amazon has fixed this issue. My first test for the time period 13.03. to 16.03. has no such errors anymore and our S3stat analysis has a massive (now correct) leap in the »Daily Bandwidth« since the 12.03.2015.
For more information you can look here:
https://forums.aws.amazon.com/thread.jspa?messageID=606654
Peter

Cloudfront TTL not working

I'm having a problem and tried to follow answers here in forum, but with no success whatsoever.
In order to generate thumbnails, I have set up the following schema:
S3 Account for original images
Ubuntu Server using NGINX and Thumbor
Cloudfront
The user uploads original images to S3, which will be pulled through Ubuntu Server with Cloudfront in front of the request:
http://cloudfront.account/thumbor-server/http://s3.aws...
The big deal is, that we often loose objects in Cloudfront, I want them to stay 360 days in cache.
I get following response through Cloudfront URL:
Cache-Control:max-age=31536000
Connection:keep-alive
Content-Length:4362
Content-Type:image/jpeg
Date:Sun, 26 Oct 2014 09:18:31 GMT
ETag:"cc095261a9340535996fad26a9a882e9fdfc6b47"
Expires:Mon, 26 Oct 2015 09:18:31 GMT
Server:nginx/1.4.6 (Ubuntu)
Via:1.1 5e0a3a528dab62c5edfcdd8b8e4af060.cloudfront.net (CloudFront)
X-Amz-Cf-Id:B43x2w80SzQqvH-pDmLAmCZl2CY1AjBtHLjN4kG0_XmEIPk4AdiIOw==
X-Cache:Miss from cloudfront
After a new refresh, I get:
Age:50
Cache-Control:max-age=31536000
Connection:keep-alive
Date:Sun, 26 Oct 2014 09:19:21 GMT
ETag:"cc095261a9340535996fad26a9a882e9fdfc6b47"
Expires:Mon, 26 Oct 2015 09:18:31 GMT
Server:nginx/1.4.6 (Ubuntu)
Via:1.1 5e0a3a528dab62c5edfcdd8b8e4af060.cloudfront.net (CloudFront)
X-Amz-Cf-Id:slWyJ95Cw2F5LQr7hQFhgonG6oEsu4jdIo1KBkTjM5fitj-4kCtL3w==
X-Cache:Hit from cloudfront
My Nginx responses as following:
Cache-Control:max-age=31536000
Content-Length:4362
Content-Type:image/jpeg
Date:Sun, 26 Oct 2014 09:18:11 GMT
Etag:"cc095261a9340535996fad26a9a882e9fdfc6b47"
Expires:Mon, 26 Oct 2015 09:18:11 GMT
Server:nginx/1.4.6 (Ubuntu)
Why does Cloudfront not store my objects as indicated? Max-Age is set?
Many thanks in advance.
Your second request shows that the object was indeed cached. I assume you see that, but the question doesn't make it clear.
The Cache-Control: max-age only specifies the maximum age of your objects in the Cloudfront Cache at any particular edge location. There is no minimum time interval for which your objects are guaranteed to persist... after all, Cloudfront is a cache, which is volatile by definition.
If an object in an edge location isn't frequently requested, CloudFront might evict the object—remove the object before its expiration date—to make room for objects that are more popular.
— http://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/Expiration.html
Additionally, there is no concept of Cloudfront as a whole having a copy of your object. Each edge location's cache appears to operate independently of the others, so it's not uncommon to see multiple requests for relatively popular objects coming from different Cloudfront edge locations.
If you are trying to mediate the load on your back-end server, it might make sense to place some kind of cache that you control, in front of it, like varnish, squid, another nginx or a custom solution, which is how I'm accomplishing this in my systems.
Alternately, you could store every result in S3 after processing, and then configure your existing server to check S3, first, before attempting the work of resizing the object again.
Then why is there a documented "minimum" TTL?
On the same page quoted above, you'll also find this:
For web distributions, if you add Cache-Control or Expires headers to your objects, you can also specify the minimum amount of time that CloudFront keeps an object in the cache before forwarding another request to the origin.
I can see why this, and the tip phrase cited on the comment, below...
The minimum amount of time (in seconds) that an object is in a CloudFront cache before CloudFront forwards another request to your origin to determine whether an updated version is available. 
...would seem to contradict my answer. There is no contradiction, however.
The minimum ttl, in simple terms, establishes a lower boundary for the internal interpretation of Cache-Control: max-age, overriding -- within Cloudfront -- any smaller value sent by the origin server. Server says cache it for 1 day, max, but configured minimum ttl is 2 days? Cloudfront forgets about what it saw in the max-age header and may not check the origin again on subsequent requests for the next 2 days, rather than checking again after 1 day.
The nature of a cache dictates the correct interpretation of all of the apparent ambiguity:
Your configuration limits how long Cloudfront MAY serve up cached copies of an object, and the point after which it SHOULD NOT continue to return the object from its cache. They do not mandate how long Cloudfront MUST maintain the cached copy, because Cloudfront MAY evict an object at any time.
If you set the Cache-Control: header correctly, Cloudfront will consider the larger of max-age or your Minimum TTL as the longest amount of time you want them to serve up the cached copy without consulting the origin server again.
As your site traffic increases, this should become less of an issue, since your objects will be more "popular," but fundamentally there is no way to mandate that Cloudfront maintain a copy of an object.