How best to manage Cloudfront/Nginx 502 Bad Gateway errors in AWS - amazon-web-services

We have a website which is served over CloudFront. Sometime this week the origin EC2 (ECS) server crashed and for a short time it started returning 502 errors:
502 Bad Gateway | Nginx
This issue was resolved quickly, but we have had a couple of users still seeing the errors in their browsers. They are both using Google Chrome and the problem seems to be constant (like the browser/CloudFront has cached the error). One user fixed the issue by entering Incognito mode, the other sees the issue every time they click on a link from our newsletter. Some other users have fixed the issue only by using a different browser.
I am unsure how to start debugging this. Also, I'd imagine if the receives a 502 error it wouldn't cache the page content. Also, I'm unable to replicate from my end.
To add extra information to the question:
I'm not looking for advice on how to stop or manage 502 bad gateway errors. We know why these happen(ed) this question is purely advice on fixing cached 502 errors after they have been delivered to the user.
From the feedback so far it looks like we have can uncache 502 errors in CloudFront after 10 seconds. This was enabled, but the issue still persists.
My feeling here is that the user's browser has Cached the 503 error page and isn't requesting an update from the server. Without getting them to clear their cache, is there a way to set CloudFront or their browser only to cache a 502 error for a short period before requesting an updated page from the server?
Also, thinking about this again. The error is '502 Bad Gateway | Nginx' is this even coming from CloudFront? could my server be sending long
Cache-Control headers with 502 errors?

After going down a lot of dead ends, I finally found a solution to this issue. Apologies the initial question was incorrect in its assumptions. But thanks for everyone's input anyway. My previous experience of 502 errors was limited to instances where the origin server went down. So when a small number of our users started receiving constant 502 errors, when the server was functioning correctly, I immediately thought it was a CloudFront caching issue. The origin server had crashed, and the 502 error was being cached for these unfortunate users.
After debugging more, the actual issue was due to a large (growing) cookie being set when the user came to the website from our emails. If the user wasn't logged in the cookie would save more data over time and get larger in filesize. This was limited to the max filesize of a cookie. But it didn't count on Nginx's header limits. So this created an 'upstream sent too big header' error. Hence the 502. Removing the cookie and increasing the header limits fixed the issue. We will lower the limits back over time once the cookie has been deleted or expired for our users.
fastcgi_buffers 8 16k;
updated to:
fastcgi_buffers 16 16k;
upstream sent too big header while reading response header from upstream

If you have an error 502 please do an invalidation... this clean cache for all your users.
Cloudfront -> Distributions -> Your Distribution -> Invalidations Tab -> Create Invalidation -> Put in the textbox "/*" without quotation marks -> Invalidate
And that's all.
I'm suggest you to research why you have Bad Gateway (maybe the scale in specific day of the week) and schedule more container for that day in specific hour. :)

Related

Canceling multiple HTTP/2 requests leads to unrecoverable connection timeouts

I have a native iOS app that loads multiple videos from AWS cloudfront via HTTP/2 requests. If the user skips to the next video, I cancel the request and start the new one.
After some cancels I get timeouts for the following requests and the connection seems to be unrecoverably broken.
edit
CloudFront monitoring shows only 200 responses – so no errors/timeouts are reported.
Using Charles Proxy for debugging shows new requests but they never receive any data ...
edit end
To check if this is a iOS problem or not, I rebuilt the same logic in NodeJS (using got) and ran into the same problems. So it's not iOS-related.
When using axios (which only supports HTTP/1.1) for doing the actual requests in Node, everything worked as expected.
So I tried and disabled HTTP/2 for my cloudfront distribution and after that the iOS implementation also worked.
Is this a known problem with HTTP/2? That canceling requests can lead to timeouts? I tried searching the web/SO but couldn't find anything helpful.
How can I get this to work with HTTP/2? Or should I just keep using HTTP/1.1?

Cloudfront 403 bypass

to give some context I am for the first time trying to participate in a BugBounty program and I found out which cloudfront URL is serving content to a website and the content being served is in json format.
Each time I try to access the information through the url of the website I get the next message:
{"error":"You need to sign in or sign up before continuing."}
If I try to access the Cloudfront url(xxxxxxx.cloudfront.net) I get:
403 ERROR
The request could not be satisfied.
Bad request. We can't connect to the server for this app or website at this time. There might be too much traffic or a configuration error. Try again later, or contact the app or website owner.
If you provide content to customers through CloudFront, you can find steps to troubleshoot and help prevent this error by reviewing the CloudFront documentation.
Generated by cloudfront (CloudFront) and a request id
I would like to know if somebody know a good article that explains how to bypass those 403 or 401 messages and obtain the JSON output
Thank you very much for all the information provided in advance.

Single Sign On in FF or Chrome creates 502 NGINX Error while IE works

I have a Django, NGINX setup that integrates with Single-Sign-On. Recently we had to change domain names, and are using Akamai to spoof the new URL, while the old domain still resolved to our loadbalancer.
SSO attempts to log in are successful in IE but in Chrome or Firefox there is instead a 502 error.
When IE logs in there is a post from oktapreview.com that generates a 302.
When its firefox or Chrome, there are 3 consecutive posts from oktapreview.com that each creates a 502. The first 2 posts have identical timestamps and the 3rd is 3-4 seconds later. For both Firefox and Chrome, upon refreshing the user finds they are actually logged in.
Any advice on what is causing this? Why are there 3 logs of posts from the SSO server? Why would IE (not edge, but IE) work while Chrome and FF fail?
For future seekers here is the resolution:
SSO creates a huge URL string, which first hits the server in its HTTP buffer. My NGINX and uWSGI http buffers were at default levels, around 4kb each. But the Okta SSO was creating URLs that were something like 20KB in their own right. I had to expand the HTTP buffers for both pieces of software to prevent chopping that string into bits. The error message was an unhelpful 500 error, but was resolved with expanded buffers. In short, keep in mind that SSO adds a lot to an HTTP header.

Django add exemption to Same Origin Policy (only for one site)

I am getting errors thrown from google.g.doubleclick.net when I try to load google ads on my site through plain html.
Blocked a frame with origin "https://googleads.g.doubleclick.net" from accessing a frame with origin "https://example.com". Protocols, domains, and ports must match.
Oddly enough I have a section of my site where I add some ads through javascript and that section does not throw any errors.
I read about adding a crossdomain.xml to the site root and I tried that (and also serving it with NGINX and that does not work either...
Is there any way to add an exception to django's CSRF rules, or any other way to get around this? It is driving me nuts. This error is only thrown in safari (only tried safari and chrome) but it adds a LOT to the data transfer for loading the page and I do not want things to be slowed down.
This has nothing to do with CSRF, but rather this has to do with the same origin policy security restriction which you can fix by implementing CORS and sending the appropriate headers.
You can use django-cors-headers to help with this.

OpenGraph Debugger reporting bad HTTP response codes

For a number of sites that are functioning normally, when I run them through the OpenGraph debugger at developers facebook com/tools/debug, Facebook reports that the server returned a 502 or 503 response code.
These sites are clearly working fine on servers that are not under heavy load. URLs I've tried include but are not limited to:
http://ac.mediatemple.net
http://freespeechforpeople.org
These are in fact all sites hosted by MediaTemple. After talking to people at MediaTemple, though, they've insisted that it must be a bug in the API and is not an issue on their end. Anyone else getting unexpected 500/502/503 HTTP response codes from the Facebook Debug tool, with sites hosted by MediaTemple or anyone else? Is there a fix?
Note that I've reviewed the Apache logs on one of these and could find no evidence of Apache receiving the request from Facebook, or of a 502 response etc.
Got this response of them:
At this time, it would appear that (mt) Media Temple servers are returning 200 response codes to all requests from Facebook, including the debugger. This can be confirmed by searching your access logs for hits from the debugger. For additional information regarding viewing access logs, please review the following KnowledgeBase article:
Where are the access_log and error_log files for my server?
http://kb.mediatemple.net/questions/732/Where+are+the+access_log+and+error_log+files+for+my+server%3F#gs
You can check your access logs for hits from Facebook by using the following command:
cat <name of access log> | grep 'facebook'
This will return all hits from Facebook. In general, the debugger will specify the user-agent 'facebookplatform/1.0 (+http://developers.facebook.com),' while general hits from Facebook will specify 'facebookexternalhit/1.0 (+http://www.facebook.com/externalhit_uatext.php).'
Using this information, you can perform even further testing by using 'curl' to emulate a request from Facebook, like so:
curl -Iv -A "facebookplatform/1.0 (+http://developers.facebook.com)" http://domain.com
This should return a 200 or 206 response code.
In summary, all indications are that our servers are returning 200 response codes, so it would seem that the issue is with the way that the debugger is interpreting this response code. Bug reports have been filed with Facebook, and we are still working to obtain more information regarding this issue. We will be sure to update you as more information becomes available.
So good news, is that they are busy with it solving it. Bad news, it's out of our control.
There's a forum post here of the matter:
https://forum.mediatemple.net/topic/6759-facebook-503-502-same-html-different-servers-different-results/
With more than 800 views, and recent activity, it states that they are working hard on it.
I noticed that https MT sites don't even give a return code:
Error parsing input URL, no data was scraped.
RESOLUTION
MT admitted it was their fault and fixed it:
During our investigation of the Facebook debugger issue, we have found that multiple IPs used by this tool were being filtered by our firewall due to malformed requests. We have whitelisted the range of IP addresses used by the Facebook debugger tool at this time, as listed on their website, which should prevent this from occurring again.
We believe our auto-banning system has been blocking several Facebook IP addresses. This was not immediately clear upon our initial investigation and we apologize this was not caught earlier.
The reason API requests may intermittently fail is because only a handful of the many Facebook IP addresses were blocked. The API is load-balanced across several IP ranges. When our system picks up abusive patterns, like HTTP requests resulting in 404 responses or invalid PUT requests, a global firewall rule is added to mitigate the behavior. More often than not, this system works wonderfully and protects our customers from constant threats.
So, that being said, we've been in the process of whitelisting the Facebook API ranges today and confirming our system is no longer blocking these requests. We'd still like those affected to confirm if the issue persists. If for any reason you're still having problems, please open up or respond to your existing support request