Single Sign On in FF or Chrome creates 502 NGINX Error while IE works - django

I have a Django, NGINX setup that integrates with Single-Sign-On. Recently we had to change domain names, and are using Akamai to spoof the new URL, while the old domain still resolved to our loadbalancer.
SSO attempts to log in are successful in IE but in Chrome or Firefox there is instead a 502 error.
When IE logs in there is a post from oktapreview.com that generates a 302.
When its firefox or Chrome, there are 3 consecutive posts from oktapreview.com that each creates a 502. The first 2 posts have identical timestamps and the 3rd is 3-4 seconds later. For both Firefox and Chrome, upon refreshing the user finds they are actually logged in.
Any advice on what is causing this? Why are there 3 logs of posts from the SSO server? Why would IE (not edge, but IE) work while Chrome and FF fail?

For future seekers here is the resolution:
SSO creates a huge URL string, which first hits the server in its HTTP buffer. My NGINX and uWSGI http buffers were at default levels, around 4kb each. But the Okta SSO was creating URLs that were something like 20KB in their own right. I had to expand the HTTP buffers for both pieces of software to prevent chopping that string into bits. The error message was an unhelpful 500 error, but was resolved with expanded buffers. In short, keep in mind that SSO adds a lot to an HTTP header.

Related

Problem handling cookies for Blazor Server using OpenID server (Keycloak)

I have a baffling issue with cookie handling in a Blazor server app (.NET Core 6) using openid (Keycloak). Actually, more than a couple which are may or may not linked. It’s a typical (?) reverse proxy architecture:
A central nginx receives queries for services like Jenkins, JypyterHub, SonarQube, Discourse etc. These are mapped through aliases in internal IPs where the nginx can access them. This nginx intercepts URL like: https://hub.domain.eu
A reverse proxy which resolves to https://dsc.domain.eu. This forwards request to a Blazor app running in Kestrel in port 5001. Both Kestrel and nginx under SSL – required to get the websockets working.
Some required background: the Blazor app is essentially a ‘hub’ where its various razor pages ‘host’ in iframe-like the above mentioned services. How it works: When the user asks for the root path (https://hub.domain.eu) it opens the root page of the Blazor app (/).
The nav menu contains the links to razor pages which contain the iframes for the abovementioned services. For example:
The relative path is intercepted by the ‘central’ nginx which loads Jenkins. Everything is under the same Keycloak OpenID server. Note that everything works fine without the Blazor app.
Scenarios that cause the same problem
Assume the user logins in my app using the login page of Keycloak (NOT the REST API) through redirection. Then proceeds to link and he is indeed logged in as well. The controls in the App change accordingly to indicate that the user is indeed authenticated. If you close the tab and open a new one, the Blazor app will act as if it’s not logged in while the other services (e.g Jenkins) will show the logged in user from before. When you press the Login link, you’ll be greeted with a 502 nginx error. If you clean the cookies from browser (or in private / stealth mode) everything works again. Or of you just log off e.g. from Jenkins.
Assume that the user is now in a service such as Jenkins, SonarQube, etc. if you press F5 now you have two problems: you get a 404 Error but only on SOME services such as Sonarcube but not in others. This is a side problem for another post. The thing is that Blazor app appears not logged in again by pressing Back / Refresh
The critical part of Program.cs looks like the following:
This class handles the login / logoff:
Side notes:
SaveTokens = false still causes large header errors and results in empty token (shown in the above code with the Warning: Token received was null). I’m still able to obtain user details though from httpContext.
No errors show up in the reverse proxy error.log and in Kestrel (all deployed in Linux)
MOST important: if I copy-paste the failed login link (the one that produced the 502 error) to a "clean" browser, it works fine.
There are lots of properties affecting the OpenID connect, it could also be an nginx issue but I’ve run out of ideas the last five days. The nginx config has been accommodated for large headers and websockets.
Any clues as to where I should at least focus my research to track the error??
The 502 error shows an error at NGINX's side. The reverse proxy had proper configuration but as it turned out, not the front one. Once we set the header size to suggested size, everything played out.

How best to manage Cloudfront/Nginx 502 Bad Gateway errors in AWS

We have a website which is served over CloudFront. Sometime this week the origin EC2 (ECS) server crashed and for a short time it started returning 502 errors:
502 Bad Gateway | Nginx
This issue was resolved quickly, but we have had a couple of users still seeing the errors in their browsers. They are both using Google Chrome and the problem seems to be constant (like the browser/CloudFront has cached the error). One user fixed the issue by entering Incognito mode, the other sees the issue every time they click on a link from our newsletter. Some other users have fixed the issue only by using a different browser.
I am unsure how to start debugging this. Also, I'd imagine if the receives a 502 error it wouldn't cache the page content. Also, I'm unable to replicate from my end.
To add extra information to the question:
I'm not looking for advice on how to stop or manage 502 bad gateway errors. We know why these happen(ed) this question is purely advice on fixing cached 502 errors after they have been delivered to the user.
From the feedback so far it looks like we have can uncache 502 errors in CloudFront after 10 seconds. This was enabled, but the issue still persists.
My feeling here is that the user's browser has Cached the 503 error page and isn't requesting an update from the server. Without getting them to clear their cache, is there a way to set CloudFront or their browser only to cache a 502 error for a short period before requesting an updated page from the server?
Also, thinking about this again. The error is '502 Bad Gateway | Nginx' is this even coming from CloudFront? could my server be sending long
Cache-Control headers with 502 errors?
After going down a lot of dead ends, I finally found a solution to this issue. Apologies the initial question was incorrect in its assumptions. But thanks for everyone's input anyway. My previous experience of 502 errors was limited to instances where the origin server went down. So when a small number of our users started receiving constant 502 errors, when the server was functioning correctly, I immediately thought it was a CloudFront caching issue. The origin server had crashed, and the 502 error was being cached for these unfortunate users.
After debugging more, the actual issue was due to a large (growing) cookie being set when the user came to the website from our emails. If the user wasn't logged in the cookie would save more data over time and get larger in filesize. This was limited to the max filesize of a cookie. But it didn't count on Nginx's header limits. So this created an 'upstream sent too big header' error. Hence the 502. Removing the cookie and increasing the header limits fixed the issue. We will lower the limits back over time once the cookie has been deleted or expired for our users.
fastcgi_buffers 8 16k;
updated to:
fastcgi_buffers 16 16k;
upstream sent too big header while reading response header from upstream
If you have an error 502 please do an invalidation... this clean cache for all your users.
Cloudfront -> Distributions -> Your Distribution -> Invalidations Tab -> Create Invalidation -> Put in the textbox "/*" without quotation marks -> Invalidate
And that's all.
I'm suggest you to research why you have Bad Gateway (maybe the scale in specific day of the week) and schedule more container for that day in specific hour. :)

nginx API cross origin calls not working only from some browsers

TLDR: React app's API calls are returning with status code 200 but without body in response, happens only when accessing the web app from some browsers.
I have a React + Django application deployed using nginx and uwsgi on a single centOS7 VM.
The React app is served by nginx on the domain, and when users log in on the javascript app, REST API requests are made to the same nginx on a sub domain (ie: backend.mydomain.com), for things like validate token and fetch data.
This works on all recent version of Firefox, Chrome, Safari, and Edge. However, some users have complained that they could not log in from their work network. They can visit the site, so obviously the javascript application is served to them, but when they log in, all of the requests come back with status 200, except the response has an empty body. (and the log in requires few pieces of information to be sent back with the log in response to work).
For example, when I log in from where I am, I would get response with status=200, and a json object with few parameters in the body of the response.
But when one of the users showed me the same from their browser, they get Status=200 back, but the Response is empty. They are using the same version of browsers as I have. They tried both Firefox and Chrome with the same behaviours.
After finally getting hold of one of the user to send me some screenshots. I found the problem. In my browser that works with the site, the API calls to the backend had Referrer Policy set to strict-origin-when-cross-origin in the Headers. However on their browser, the same was showing up as no-referrer-when-downgrade.
I had not explicitly set the referrer policy so the browsers were using each of their default values, and it differed between different versions of browsers (https://developers.google.com/web/updates/2020/07/referrer-policy-new-chrome-default)
To fix this, I added add_header 'Referrer-Policy' 'strict-origin-when-cross-origin'; to the nginx.conf file and restarted the server. More details here: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Referrer-Policy
The users who had trouble before can now access the site API resources after clearing cache in their browsers.

Ember App: Looking for a workaround for Safari bug that ignores the '#' in my url on https redirects by Cloudfront

UPDATE 2:
I found the reason, but not a solution yet. This has to do with how safari browsers deal with the '#' when Cloudfront redirects HTTP to HTTPS. Safari ignores the hashtag, and apparently this is a bug that's existed in Safari for years. I'm not 100% sure this is my issue, but it seems to be. Still looking for a solution.
END UPDATE 2
For some reason I'm having trouble figuring out, Safari browsers (mobile and desktop) act differently from Chrome and Firefox when I refresh a page or try to access a route directly on my app.
I have a playlists route:
Router.map(function() {
...
this.resource("playlists", function () {});
...
});
I can hit the playlists route directly with rooturl.com/playlists on Chrome and Firefox and in the console logs, I see this:
Attempting URL transition to /playlists
When I try to hit the playlists route directly in Safari, I see this:
Attempting URL transition to /
Another strange thing is that when I use my localhost the transition is correct on all browsers including safari (mobile and desktop). This makes me think it has something to do with the production environment. I'm using AWS S3 and Cloudfront, but I'm not fully sure that has anything to do with this.
I can provide more information here if asked.
UPDATE:
When I use the fragment (followed by a '#') in the url, safari redirects correctly. So this redirects correctly:
example.com/#/playlists
But this does not:
example.com/playlists
Again, this problem only occurs in production, on AWS S3/Cloudfront. On localhost, Safari works as expected.
In my case, this was fixed by making sure the protocol is not switched from https to http at any point. Specifically, S3 was issuing a redirect over http while the original request was over https. This leads to Safari stripping the request from any (potentially) sensitive data.
Fixed it by adding https to our S3 redirect setup.

OpenGraph Debugger reporting bad HTTP response codes

For a number of sites that are functioning normally, when I run them through the OpenGraph debugger at developers facebook com/tools/debug, Facebook reports that the server returned a 502 or 503 response code.
These sites are clearly working fine on servers that are not under heavy load. URLs I've tried include but are not limited to:
http://ac.mediatemple.net
http://freespeechforpeople.org
These are in fact all sites hosted by MediaTemple. After talking to people at MediaTemple, though, they've insisted that it must be a bug in the API and is not an issue on their end. Anyone else getting unexpected 500/502/503 HTTP response codes from the Facebook Debug tool, with sites hosted by MediaTemple or anyone else? Is there a fix?
Note that I've reviewed the Apache logs on one of these and could find no evidence of Apache receiving the request from Facebook, or of a 502 response etc.
Got this response of them:
At this time, it would appear that (mt) Media Temple servers are returning 200 response codes to all requests from Facebook, including the debugger. This can be confirmed by searching your access logs for hits from the debugger. For additional information regarding viewing access logs, please review the following KnowledgeBase article:
Where are the access_log and error_log files for my server?
http://kb.mediatemple.net/questions/732/Where+are+the+access_log+and+error_log+files+for+my+server%3F#gs
You can check your access logs for hits from Facebook by using the following command:
cat <name of access log> | grep 'facebook'
This will return all hits from Facebook. In general, the debugger will specify the user-agent 'facebookplatform/1.0 (+http://developers.facebook.com),' while general hits from Facebook will specify 'facebookexternalhit/1.0 (+http://www.facebook.com/externalhit_uatext.php).'
Using this information, you can perform even further testing by using 'curl' to emulate a request from Facebook, like so:
curl -Iv -A "facebookplatform/1.0 (+http://developers.facebook.com)" http://domain.com
This should return a 200 or 206 response code.
In summary, all indications are that our servers are returning 200 response codes, so it would seem that the issue is with the way that the debugger is interpreting this response code. Bug reports have been filed with Facebook, and we are still working to obtain more information regarding this issue. We will be sure to update you as more information becomes available.
So good news, is that they are busy with it solving it. Bad news, it's out of our control.
There's a forum post here of the matter:
https://forum.mediatemple.net/topic/6759-facebook-503-502-same-html-different-servers-different-results/
With more than 800 views, and recent activity, it states that they are working hard on it.
I noticed that https MT sites don't even give a return code:
Error parsing input URL, no data was scraped.
RESOLUTION
MT admitted it was their fault and fixed it:
During our investigation of the Facebook debugger issue, we have found that multiple IPs used by this tool were being filtered by our firewall due to malformed requests. We have whitelisted the range of IP addresses used by the Facebook debugger tool at this time, as listed on their website, which should prevent this from occurring again.
We believe our auto-banning system has been blocking several Facebook IP addresses. This was not immediately clear upon our initial investigation and we apologize this was not caught earlier.
The reason API requests may intermittently fail is because only a handful of the many Facebook IP addresses were blocked. The API is load-balanced across several IP ranges. When our system picks up abusive patterns, like HTTP requests resulting in 404 responses or invalid PUT requests, a global firewall rule is added to mitigate the behavior. More often than not, this system works wonderfully and protects our customers from constant threats.
So, that being said, we've been in the process of whitelisting the Facebook API ranges today and confirming our system is no longer blocking these requests. We'd still like those affected to confirm if the issue persists. If for any reason you're still having problems, please open up or respond to your existing support request