Cloudfront, Compression, HTTPS Load error - amazon-web-services

I've working on a project and we recently switched from HTTP to HTTPS.
So here are a few things. We're hosting our website on S3 with the static website enabled. For the server, we created an instance through Elastic Beanstalk. Using ACM, we have a certificate and successfully attached it to our frontend through Cloudfront and our server through Elastic Beanstalk Load Balancer.
Now our site is finally live and says secure! For some reason sometimes on the first load it takes upwards of 10 seconds to load the page and usually ends of timing out. This happens every couple of hours or whenever we clear our cache. But as soon as we refresh the site takes 2-3 seconds to load and everything works perfectly. We think that this is because we're using Angular 5 and using ng build, we get a large file called vendor.bundle.js that's around 13 mb.
We want to gzip this to hopefully solve the problem cause we think the website is timing out before that vendor file is even loaded in. We went into Cloudfront and enabled compression. We then went into our S3 bucket and added "Content-Length" as an allowed header. I looked up the request and response body and it is as follows:
Request
:method: GET
:scheme: https
:authority: riftapp.io
:path: /vendor.bundle.js
Host: riftapp.io
If-None-Match: "de078424bf26a6b1e2873009a37924ee-2"
Accept: */*
Connection: keep-alive
Accept-Language: en-us
Accept-Encoding: br, gzip, deflate
Cookie: __stripe_mid=333bb2df-0b1a-4757-a193-0596baeabd7c;
__stripe_sid=bff74933-d3fd-4ee2-a7a8-e5e73a224fdb
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4)
AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.1
Safari/605.1.15
If-Modified-Since: Wed, 25 Jul 2018 03:02:42 GMT
Referer: https://riftapp.io/home
Response
:status: 304
Via: 1.1 e3a844a9e0d478ce4d12c6d1f3a2d892.cloudfront.net (CloudFront)
ETag: "de078424bf26a6b1e2873009a37924ee-2"
Age: 3021
Date: Wed, 25 Jul 2018 03:55:47 GMT
Server: AmazonS3
x-amz-cf-id: JHS0o-e-tBh1wfTy4AgKHALzHEEvdbUOkO0mgzSc56LAmnx515WS9A==
x-cache: Hit from cloudfront
We think that maybe it's because the request body host is riftapp.io instead of abc.cloudfront.net or something, but when I tried to delete the origin, I can't because our S3 bucket is linked to it.

Related

Google cloud CDN backend service load balancer not caching any resources

We have a requirement to generate images on the fly and cache using CDN. For this we have configured a backend service with a load balancer enabled cloud CDN. We are using Nginx proxy server. We have added headers specified in the Google cloud CDN docs, but unfortunately it is not caching.
Request:
GET /resize?size=l&url=https://example.com/image.jpeg HTTP/1.1
Host: resize.example.com
Request Headers:
Host: resize.example.com
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:70.0) Gecko/20100101 Firefox/70.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Connection: keep-alive
Upgrade-Insecure-Requests: 1
Response headers:
HTTP/1.1 200 OK
Server: nginx/1.17.2
Date: Wed, 15 Jan 2020 15:01:14 GMT
Content-Type: image/jpeg
Content-Length: 62771
cache-control: max-age=86400, public, s-maxage=86400
Via: 1.1 google
I suggest you a couple of pages that could help you.
a) Not all HTTP responses are cacheable. Cloud CDN caches only those responses that meet all the requirements in this section. Some of these requirements are specified by RFC 7234, and others are specific to Cloud CDN.
Cacheability for HTTP responses
Responses aren't being cached--Troubleshooting
The following example demonstrates using curl to check the HTTP response headers for http://example.com/style.css:
$ curl -s -D - -o /dev/null http://example.com/style.css
HTTP/1.1 200 OK
Date: Tue, 16 Feb 2016 12:00:00 GMT
Content-Type: text/css
Content-Length: 1977
Via: 1.1 google
Although perhaps because of the added response, you may have already read it.

Amazon CloudFront: Identify if there have been any updates with the file - Reload file (and re-cache)

I have recently set Cloudfront to cache files (images, css and javascript):
It requires me to clear cache manually from within the browser sometimes to get the new file.
Caching files within Cloudfront is really tough - A week ago - I tried enabling GZIP compression on Cloudfront and noticed that until I didn't add my file to the "Invalidaion" table - the file wasn't displayed updated on my website.
(I cleared cache from my browser and this didn't help either btw)
Is there any "smart" way for caching on Amazon Cloudfront?
Something like: "if the file has been updated - send the new file (as gzip of course). "
Here are my headers:
Request URL:https://abc.cloudfront.net/live/static/rcss/bootstrap3.min.css
Request Method:GET
Status Code:200 OK
Remote Address:77.77.77.77:443
Referrer Policy:no-referrer-when-downgrade
Response Headers
HTTP/1.1 200 OK
Content-Type: text/css
Connection: keep-alive
Date: Sun, 07 May 2017 08:14:04 GMT
Last-Modified: Wed, 26 Apr 2017 08:43:05 GMT
Server: AmazonS3
Content-Encoding: gzip
Vary: Accept-Encoding
X-Cache: Miss from cloudfront
Via: 1.1 abc.cloudfront.net (CloudFront)
X-Amz-Cf-Id: abc-iuxzccRYcxce8LjxzcDeYdasdehHqFJGj80iczxcz8Q4asdDpWg==
Request Headers
Accept:text/css,*/*;q=0.1
Accept-Encoding:gzip, deflate, sdch, br
Accept-Language:en-US,en;q=0.8
Cache-Control:no-cache
Connection:keep-alive
Host:abc.cloudfront.net
Pragma:no-cache
Referer:https://example.com/
User-Agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36
If there is anything else I can add to my post - please do ask.

AWS Cloudfront behaviors not working as expected

I have a PHP app on AWS Elastic Beanstalk, I have created an assets bucket on S3. I'm trying to setup a Cloudfront distribution with behaviors to send requests for assets/* to S3 with a default behavior to send requests to EB. The domain points to Cloudfront.
All requests are going to EB which returns a 404 since there is no assets diretory in the EB environment.
I have created 2 Cloudfront origins, one for EB and one for the S3 bucket. This is what my behaviors look like:
Precedence Path Pattern Origin Protocol Policy Fwd Query Strings
0 assets/* S3-example-bucket HTTP and HTTPS No
1 Default (*) Custom-example.us-east-1.elasticbeanstalk.com HTTP and HTTPS Yes
It seems as though this should be pretty straight forward so I assume I'm missing something basic. Any help is greatly appreciated.
Edit:
Request header:
GET /assets/images/10waysaudiobook.png HTTP/1.1
Host: example.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Cookie: wordpress_logged_in_8a27500b7747be1e4fbad7f473f238e5=snickerspixy%7C1466021823%7Cr7rE5moINjanjHEqb1TGbsSkn9F7OCZLfX69IbcnGJu%7C28fc452885f3fe6e954243abab585a188f6511cdd6eeec6fa5ec5c50b9f3d393; wp-settings-7674=m4%3Do%26m5%3Do%26m9%3Do%26m6%3Do%26editor%3Dhtml%26m10%3Do%26m0%3Do%26m3%3Do%26hidetb%3D1%26m2%3Dc%26m1%3Do%26m8%3Do%26m12%3Do%26m7%3Do%26m11%3Do%26urlbutton%3Dnone%26m13%3Do%26tml1%3D1%26imgsize%3Dfull%26align%3Dcenter%26libraryContent%3Dbrowse%26ed_size%3D569%26unfold%3D1%26wplink%3D1%26mfold%3Do%26post_dfw%3Doff%26advImgDetails%3Dshow%26posts_list_mode%3Dlist; wp-settings-time-7674=1464816549; AWSELB=1FCB85F51606EBAFF15FEADB01C8069AEDE17E2A043407E615EF1A0E1ABF24607545A45D3DC206631F7AAE4503ADA423788B5E6B5B48FAE93EE916DE068509E64F92AC10FF; PHPSESSID=cpi2su7s967phu87rlpjgneel6; wordpress_test_cookie=WP+Cookie+check
Connection: keep-alive
Response header:
HTTP/1.1 404 Not Found
Cache-Control: no-cache, must-revalidate, max-age=0
Content-Type: text/html; charset=UTF-8
Date: Sun, 05 Jun 2016 00:54:23 GMT
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Link: <http://example.com/wp-json/>; rel="https://api.w.org/"
Pragma: no-cache
Server: Apache
Transfer-Encoding: chunked
Connection: keep-alive
The response headers indicate that this request wasn't served by CloudFront at all, because there are headers that should be present... but are absent.
CloudFront adds Via:, X-Cache:, and x-amz-cf-id: headers to every response, and sometimes Age: (on cache hits and errors) or Vary: if you're forwarding the CloudFront-Is-*-Viewer: headers to the origin.
The absence of these headers suggest that the DNS for the site hasn't been pointed to CloudFront and may still be pointing directly to the EB environment, or if the change was recent, that the former TTL for the DNS entry may not yet have expired.

URL forbidden 403 when using a tool but fine from browser

I have some images that I need to do a HttpRequestMethod.HEAD in order to find out some details of the image.
When I go to the image url on a browser it loads without a problem.
When I attempt to get the Header info via my code or via online tools it fails
An example URL is http://www.adorama.com/images/large/CHHB74P.JPG
As mentioned, I have used the online tool Hurl.It to try and attain the Head request but I am getting the same 403 Forbidden message that I am getting in my code.
I have tried adding many various headers to the Head request (User-Agent, Accept, Accept-Encoding, Accept-Language, Cache-Control, Connection, Host, Pragma, Upgrade-Insecure-Requests) but none of this seems to work.
It also fails to do a normal GET request via Hurl.it. Same 403 error.
If it is relevant, my code is a c# web service and is running on the AWS cloud (just in case the adorama servers have something against AWS that I dont know about). To test this I have also spun up an ec2 (linux box) and run curl which also returned the 403 error. Running curl locally on my personal computer returns the binary image which is presumably just the image data.
And just to remove the obvious thoughts, my code works successfully for many many other websites, it is just this one where there is an issue
Any idea what is required for me to download the image headers and not get the 403?
same problem here.
Locally it works smoothly. Doing it from an AWS instance I get the very same problem.
I thought it was a DNS resolution problem (redirecting to a malfunctioning node). I have therefore tried to specify the same IP address as it was resolved by my client but didn't fix the problem.
My guess is that Akamai (the service is provided by an Akamai CDN in this case) is blocking AWS. It is understandable somehow, customers pay by traffic for CDN, by abusing it, people can generate huge bills.
Connecting to www.adorama.com (www.adorama.com)|104.86.164.205|:80... connected.
HTTP request sent, awaiting response...
HTTP/1.1 403 Forbidden
Server: **AkamaiGHost**
Mime-Version: 1.0
Content-Type: text/html
Content-Length: 301
Cache-Control: max-age=604800
Date: Wed, 23 Mar 2016 09:34:20 GMT
Connection: close
2016-03-23 09:34:20 ERROR 403: Forbidden.
I tried that URL from Amazon and it didn't work for me. wget did work from other servers that weren't on Amazon EC2 however. Here is the wget output on EC2
wget -S http://www.adorama.com/images/large/CHHB74P.JPG
--2016-03-23 08:42:33-- http://www.adorama.com/images/large/CHHB74P.JPG
Resolving www.adorama.com... 23.40.219.79
Connecting to www.adorama.com|23.40.219.79|:80... connected.
HTTP request sent, awaiting response...
HTTP/1.0 403 Forbidden
Server: AkamaiGHost
Mime-Version: 1.0
Content-Type: text/html
Content-Length: 299
Cache-Control: max-age=604800
Date: Wed, 23 Mar 2016 08:42:33 GMT
Connection: close
2016-03-23 08:42:33 ERROR 403: Forbidden.
But from another Linux host it did work. Here is output
wget -S http://www.adorama.com/images/large/CHHB74P.JPG
--2016-03-23 08:43:11-- http://www.adorama.com/images/large/CHHB74P.JPG
Resolving www.adorama.com... 23.45.139.71
Connecting to www.adorama.com|23.45.139.71|:80... connected.
HTTP request sent, awaiting response...
HTTP/1.0 200 OK
Content-Type: image/jpeg
Last-Modified: Wed, 23 Mar 2016 08:41:57 GMT
Server: Microsoft-IIS/8.5
X-AspNet-Version: 2.0.50727
X-Powered-By: ASP.NET
ServerID: C01
Content-Length: 15131
Cache-Control: private, max-age=604800
Date: Wed, 23 Mar 2016 08:43:11 GMT
Connection: keep-alive
Set-Cookie: 1YDT=CT; expires=Wed, 20-Apr-2016 08:43:11 GMT; path=/; domain=.adorama.com
P3P: CP="NON DSP ADM DEV PSD OUR IND STP PHY PRE NAV UNI"
Length: 15131 (15K) [image/jpeg]
Saving to: \u201cCHHB74P.JPG\u201d
100%[=====================================>] 15,131 --.-K/s in 0s
2016-03-23 08:43:11 (460 MB/s) - \u201cCHHB74P.JPG\u201d saved [15131/15131]
I would guess that the image provider is deliberately blocking requests from EC2 address ranges.
The reason the wget outgoing ip address is different in the two examples is due to DNS resolution on the cdn provider that adorama are providing
Web Server may implement ways to check particular fingerprint attributes to prevent automated bots . Here a few of them they can check
Geoip, IP
Browser headers
User agents
plugin info
Browser fonts return
You may simulate the browser header and learn some fingerprinting "attributes" here : https://panopticlick.eff.org
You can try replicate how a browser behave and inject similar headers/user-agent. Plain curl/wget are not likely to satisfied those condition, even tools like phantomjs occasionally get blocked. There is a reason why some prefer tools like selenium webdriver that launch actual browser.
I found using another url also being protected by AkamaiGHost was blocking due to certain parts in the user agent. Particulary using a link with protocol was blocked:
Using curl -H 'User-Agent: some-user-agent' https://some.website I found the following results for different user agents:
Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:70.0) Gecko/20100101 Firefox/70.0 okay
facebookexternalhit/1.1 (+http\://www.facebook.com/externalhit_uatext.php): 403
https ://bar: okay
https://bar: 403
All I could find for now is this (downvoted) answer https://stackoverflow.com/a/48137940/230422 stating that colons (:) are not allowed in header values. That is clearly not the only thing happening here as the Mozilla example also has a colon, only not a link.
I guess that at least most webservers don't care and allow facebook's bot and other bots having a contact url in their user agent. But appearently AkamaiGHost does block it.

MISS from Cloudfront after HIT from Cloudfront

I am switching to Amazon Cloudfront for serving images on my website. To reduce load when we finally make it live, I thought of warming up the cache by hitting image URLs (I am making these request from India and expect majority of users to request from the same region so no need to have a copy of object on all edge locations worldwide).
The problem is that script uses curl to request image and when I access the same URL in browser I get MISS from Cloudfront. So Cloudfront is making two copies of object for these two request.
My current Cloudfront configuration forwards Content-Type request Header to origin.
How should I configure Cloudfront so that it doesn't care about request headers at all and once I made a request (whether curl or using browser) it should serve all future request for same resource from edge and not origin.
Request/Response headers-
I am afraid that the Cloudfront url won't be accessible from outside (until we go live) but I am posting request/response headers, this should give you fair idea. Also you can check out caching headers at origin - https://origin.ixigo.com/image/upload/t_thumb,f_auto/r7y6ykuajvlumkp4lk2a.jpg
Response after two successive request using browser
Remote Address:54.230.156.66:443
Request URL:https://youcannotaccess.com/image/upload/t_thumb,f_auto/r7y6ykuajvlumkp4lk2a.jpg
Request Method:GET
Status Code:200 OK
Response Headers
view source
Accept-Ranges:bytes
Age:23
Cache-Control:public, max-age=31557600
Connection:keep-alive
Content-Length:8708
Content-Type:image/jpg
Date:Fri, 27 Nov 2015 09:16:03 GMT
ETag:"-170562206"
Last-Modified:Sun, 29 Jun 2014 03:44:59 GMT
Vary:Accept-Encoding
Via:1.1 7968275877e438c758292828c0593684.cloudfront.net (CloudFront)
X-Amz-Cf-Id:fcbGLv8uBOP89qfR52OWa-NlqWkEREJPpZpy9ix0jdq8-a4oTx7lNw==
X-Backend:image6_40
X-Cache:Hit from cloudfront
X-Cache-Hits:0
X-Device:pc
X-DeviceType:pc
X-Powered-By:xyz
Now same url requested using curl but gave me miss
curl manu-mdc:cache manuc$ curl -I https://youcannotaccess.com/image/upload/t_thumb,f_auto/r7y6ykuajvlumkp4lk2a.jpg
HTTP/1.1 200 OK
Content-Type: image/jpg
Content-Length: 8708
Connection: keep-alive
Age: 0
Cache-Control: public, max-age=31557600
Date: Fri, 27 Nov 2015 09:16:47 GMT
ETag: "-170562206"
Last-Modified: Sun, 29 Jun 2014 03:44:59 GMT
X-Backend: image6_40
X-Cache-Hits: 0
X-Device: pc
X-DeviceType: pc
X-Powered-By: xyz
Vary: Accept-Encoding
X-Cache: Miss from cloudfront
Via: 1.1 4d42171c56a4c8b5c627040e6aa0938d.cloudfront.net (CloudFront)
X-Amz-Cf-Id: fY0LXhp7NlqB-I8F5-1TIMnA6bONjPD3CEp7dsyVdykP-7N2mbffvw==
Now this will give HIT
manu-mdc:cache manuc$ curl -I https://youcannotaccess.com/image/upload/t_thumb,f_auto/r7y6ykuajvlumkp4lk2a.jpg
HTTP/1.1 200 OK
Content-Type: image/jpg
Content-Length: 8708
Connection: keep-alive
Cache-Control: public, max-age=31557600
Date: Fri, 27 Nov 2015 09:16:47 GMT
ETag: "-170562206"
Last-Modified: Sun, 29 Jun 2014 03:44:59 GMT
X-Backend: image6_40
X-Cache-Hits: 0
X-Device: pc
X-DeviceType: pc
X-Powered-By: xyz
Age: 3
Vary: Accept-Encoding
X-Cache: Hit from cloudfront
Via: 1.1 6877899d48ba844a34ea4378ce336f06.cloudfront.net (CloudFront)
X-Amz-Cf-Id: qpPhbLX_5t2Xj0XZuZdjWD2w-BI80DUVyL496meQkLfSEn3ikt7hNg==
This is similar to this issue: Why are two requests with different clients from the same computer cache misses on cloudfront?
Depending on whether you provide the "Accept-Encoding: gzip" header or not, CloudFront edge server caches the object separately. Since browsers provides this header by default, and your site is likely to be accessed majorly via browser, I will suggest changing your curl call to include this header.
I was facing the same problem, after making the change in my curl call, I started to get a Hit from the browser on my first try via browser (after making a curl call).
Another thing I noticed is that CloudFront requires the full requested object to be downloaded before it will be cached. If you try to download the file partially by specifying the byte range in the curl, the intended object does not get cached, only the downloaded part gets cached as a different object. Same goes for a curl that was terminated in between. The other options I tried were wget call with spider option, but that internally does a HEAD call only and thus does not get the content cached on the edge server.