Match browsers set to Scandinavian languages based on "Accept-Language" - regex

Question
I am trying to match browsers set to Scandinavian languages based on HTTP header "Accept-Language".
My regex is:
^(nb|nn|no|sv|se|da|dk).*
My question is if this is sufficient, and if anyone know about any other odd scandinavian (but "valid") language codes or obscure browser bugs causing false positives?
Used for
The regex is used for displaying a english link in the top of the Norwegian web pages (which is the primary language and the root of the domain and sub-domains) that takes you to the English web pages (secondary language and folder under root) when the browser language is not Scandinavian. The link can be closed / "opted-out" with hash stored in JavaScript localStorage if the user don't want to see the link again. We decided not to use IP geo-location because of limited time to implement.

Depending on the language you are working in there may be code in place you can use to parse this easily, e.g. this post: Parse Accept-Language header in Java <-- Also provides a good code example
Further - are you sure you want to limit your regex to the start of the string, as several lanaguages can be provided (the first is intended to be "I prefer x but also accept the following") : http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.4
Otherwise your regex should work fine based on the what you were asking and here is a list of all browser language codes: http://www.metamodpro.com/browser-language-codes
I would also - in your shoes, make the "switch to X language" link easy to find for all users until they had opted not to see it again. I would expect many people may have a preference set by default in their browser but find a site actually using it to be unexpected i.e. a user experience like:
I prefer english but don't know enough to change this setting and have never had a reason to before as so few sites make use of it.

That regular expression is enough if you are testing each item in accept-language individually.
If not individually, there are 2 problems:
One of the expected languages could not appear at the beginning of the header, but after.
Some of the expected languages abbreviations could appear as qualifier of a completely different language.

Related

Lowercasing only the file extension in a requested URL

THE ISSUE
I believe I need to rewrite only the file extension of PDF files requested by end users to be lowercase, but don't want to lowercase the entire URL.
BACKGROUND
I have a web application which includes links to a few PDF files. These files have names in mixed case, but the file extensions are lowercase (Bobs_Your_Mom.pdf).
An older version of this application was just static web content. In that version, the files had uppercase file extensions (Bobs_Your_Mom.PDF)
For the sake of this example, I have no access to change the names of the PDF files, or to force all URLs in the application to be lowercase.
In front of this application is an Apache webserver acting as a ReverseProxy. Traffic coming in on :80 gets redirected to :443 and the proxy redirects the traffic through the internal firewall to the backend server, etc etc.
Presently, there is no manipulation of the URL requested by the end user via their web browser. However, though the old 'site' and new 'application' are obviously different from a technical perspective… roughly two months after the switch I am still getting requests for the relevant PDF files with uppercase file extensions (.PDF).
The web application actually doesn't expose the PDF files directly the way the old site did and the user has to take some special action to even make that request now.
I had been hoping this issue would settle as 404 errors received would alert people to changes having been made, etc. But it has not, and I continue to receive 404's for the uppercase file names even as of a few minutes ago.
TRIED
Check Code for Errors
I have validated with the developers and manually myself that no reference exists to PDF files in all caps in the application. This is actually how I discovered the old site did have it this way.
Ask Devs to Change App / Lowercase Entire URI / Business to Change File Names
Developers have indicated no ability/time to alter the application to force all URIs lowercase within the application itself, and the business has indicated the client doesn't want us to actually alter the file names in any way.
301/302 REDIRECTS
This change isn't really for SEO, the old file names redirecting to the new file names would be ok if I knew the 404s were coming from the same set of users with bookmarks (the old site was live for two months before switching to the web app). But requests are coming from entirely new users in entirely different geographic regions, and I cannot make sense of how so many random users would have a bookmark to a URL which existed for only two months without much publicity.
DUDE DUPLICATE ISSUE / CHECK OTHER POSTS
From what I have seen other have required help rewriting whole URIs or part of URIs which aren't the file extension, or simply to hide the extension altogether.
I am not sufficiently skilled with regex to figure this one out on my own (regex is a life struggle for me). I can't really make heads or tails of the expressions in the other posts which makes understanding what I change and why as confusing for me as regex looks to my grandmother.
YOUR HELP
However, With dozens of what I believe to be unnecessarily negative user experiences each day, I am hoping mod-rewrite and Apache can come to my rescue. (ALERT: I am regex illiterate).
Normally on the stacks I like to ask just to be pointed in the right direction. I believe users (including myself) should be able to piece things together and get things working with only some guidance.
In this case, I have no-one around me sufficiently talented with regular expressions to assist in this quest of mine to simply convert .PDF to .pdf whenever requested in-flight.
If I can get help to convert:
Im_An_Example.PDF
to
Im_An_Example.pdf
You will be my savior this day and win 25 whole internets.
FINAL SOLUTION
The final solution, suggested by #signal2013 is as follows:
RewriteEngine on
RewriteRule ^(.*).PDF$ http://exmple.com/$1.pdf [R=301,L]
The solution is simple, and I acknowledge I was making it much more complex in my mind when trying to solve this on my own.
Yep, like #marekful said, your question is a bit too long. Are you looking for something like this... can go in a .htaccess file.
RewriteEngine on
RewriteRule ^(.*).PDF$ http://exmple.com/$1.pdf [R=301,L]

How does an XSS script get executed?

For example,
http://testsite.test/<script>alert("TEST");</script>
I know that browsers either send a request for the url if it contains only domain and resource path. If there is a query string, it gets sent by GET method. But how exactly is a script executed in the client's browser?
And why would anyone "enable" XSS?
I'm learning XSS, so please help me out!
For a very basic example, say you have a form where you ask for your user's name. Then on the next page, you (as the application developer) write "Hello, [anything the user entered]". The problem is that if the user entered something like <script>alert(1)</script> for his name, this would be printed in a vulnerable page as is, and run in the browser. This is called reflected xss, and this is only the very tip of the iceberg, for example your users might store their real name in a database and a query on a different page might list user names, which may also contain javascript which in this case would be run in a different user's browser.
The solution btw is output encoding, which in practice means replacing certain characters with safe ones, like < or > with < and > so that when the script tag is written, it won't be executed (but this is just a small portion of html encoding, sometimes a different one is needed).
So the point is that XSS is not deliberately enabled by a developer, quite the opposite, everybody wants to avoid it. It's a vulnerability of the code. However, sometimes it is not that straightforward, especially for people not being aware of secure coding practices.
Please note that there is much more to XSS than I have mentioned here, both on the how it can manifest itself and the how you can prevent it side.

Language agnostic cookie encoding / decoding standards

I'm having difficulties to figure out what is the standard (or is there any?) for encoding/decoding cookie values regardless to backend platforms.
According to RFC 2109:
The VALUE is opaque to the user agent and may be anything the origin server chooses to send, possibly in a server-selected printable ASCII encoding. "Opaque" implies that the content is of interest and relevance only to the origin server. The content may, in fact, be readable by anyone that examines the Set-Cookie header.
which sounds like "server is the boss" and it decides whatever the encoding will apply. This makes it quite difficult to set a cookie from, say PHP backend and read it from Python or Java or whatever, without writing any manual encode/decode handling on both sides.
Let's say we have a value needs to be encoded. Russian /"печенье (*} значения"/ means "cookie value" with some additional non alpha-numeric chars in it.
Python:
Almost every WSGI server does the same and uses Python's SimpleCookie class that encodes to / decodes from octal literals even though many says that octal literals are depreciated in ECMA-262, strict mode. Wtf?
So, our raw cookie value becomes "/\"\320\277\320\265\321\207\320\265\320\275\321\214\320\265 (*} \320\267\320\275\320\260\321\207\320\265\320\275\320\270\321\217\"/"
Node.js:
Haven't tested at all but I'm just guessing a JavaScript backend would do it with native encodeURIComponent and decodeURIComponent functions that use hexadecimal escaping / unescaping?
PHP:
PHP applies urlencode to the cookie values that is similar to encodeURIComponent but not exactly the same.
So the raw value becomes; %2F%22%D0%BF%D0%B5%D1%87%D0%B5%D0%BD%D1%8C%D0%B5+%28%2A%7D+%D0%B7%D0%BD%D0%B0%D1%87%D0%B5%D0%BD%D0%B8%D1%8F%22%2F that is not even wrapped with double quotes.
However; if the JavaScript value variable has the PHP encoded value above, decodeURIComponent(value) gives /"печенье+(*}+значения"/, see "+" chars instead of spaces..
What is the situation in Java, Ruby, Perl and .NET? Which language is following (or closest) to the desired behaviour. Actually, is there any standard for this defined by W3?
I think you've got things a bit mixed up here. The server's encoding does not matter to the client, and it shouldn't. That is what RFC 2109 is trying to say here.
The concept of cookies in http is similar to this in real life: Upon paying the entrance fee to a club you get an ink stamp on your wrist. This allows you to leave and reenter the club without paying again. All you have to do is show your wrist to the bouncer. In this real life example, you don't care what it looks like, it might even be invisible in normal light - all that is important is that the bouncer recognises the thing. If you were to wash it off, you'll lose the privilege of reentering the club without paying again.
In HTTP the same thing is happening. The server sets a cookie with the browser. When the browser comes back to the server (read: the next HTTP request), it shows the cookie to the server. The server recognises the cookie, and acts accordingly. Such a cookie could be something as simple as a "WasHereBefore" marker. Again, it's not important that the browser understands what it is. If you delete your cookie, the server will just act as if it has never seen you before, just like the bouncer in that club would if you washed off that ink stamp.
Today, a lot of cookies store just one important piece of information: a session identifier. Everything else is stored server-side and associated with that session identifier. The advantage of this system is that the actual data never leaves the server and as such can be trusted. Everything that is stored client-side can be tampered with and shouldn't be trusted.
Edit: After reading your comment and reading your question yet again, I think I finally understood your situation, and why you're interested in the cookie's actual encoding rather than just leaving it to your programming language: If you have two different software environments on the same server (e.g.: Perl and PHP), you may want to decode a cookie that was set by the other language. In the above example, PHP has to decode the Perl cookie or vice versa.
There is no standard in how data is stored in a cookie. The standard only says that a browser will send the cookie back exactly as it was received. The encoding scheme used is whatever your programming language sees fit.
Going back to the real life example, you now have two bouncers one speaking English, the other speaking Russian. The two will have to agree on one type of ink stamp. More likely than not this will involve at least one of them learning the other's language.
Since the browser behaviour is standardized, you can either imitate one languages encoding scheme in all other languages used on your server, or simply create your own standardized encoding scheme in all languages being used. You may have to use lower level routines, such as PHP's header() instead of higher level routines, such as start_session() to achieve this.
BTW: In the same manner, it is the server side programming language that decides how to store server side session data. You cannot access Perl's CGI::Session by using PHP's $_SESSION array.
Regardless of the cookie being opaque to the client, it still needs to conform to the HTTP spec. rfc2616 specifies that all HTTP headers should be ASCII (ISO-8859-1). rfc5987 extends that to support other character sets, but I don't know how widely supported it is.
I prefer to encode into UTF8 and wrap with base64 encoding. It's fast, ubiquitous, and will never mangle your data at either end.
You will need to ensure an explicit conversion into UTF8 even when wrapping it. Other languages & runtimes, while supporting Unicode, may not store strings as UTF8 internally... like many Windows APIs. Python 2.x, in my experience, rarely gets Unicode strings right without explicit conversion.
ENCODE: nativeString -> utfEncode() -> base64Encode()
DECODE: base64Decode() -> utfDecode() -> nativeString
Almost every language I know of, these days, supports this. You can look for a universal single-function encode, but I err on the side of caution and choose the two-step approach... especially with foreign character sets.

Allowing code snippets in form input while preventing XSS and SQL injection attacks

How can one allow code snippets to be entered into an editor (as stackoverflow does) like FCKeditor or any other editor while preventing XSS, SQL injection, and related attacks.
Part of the problem here is that you want to allow certain kinds of HTML, right? Links for example. But you need to sanitize out just those HTML tags that might contain XSS attacks like script tags or for that matter even event handler attributes or an href or other attribute starting with "javascript:". And so a complete answer to your question needs to be something more sophisticated than "replace special characters" because that won't allow links.
Preventing SQL injection may be somewhat dependent upon your platform choice. My preferred web platform has a built-in syntax for parameterizing queries that will mostly prevent SQL-Injection (called cfqueryparam). If you're using PHP and MySQL there is a similar native mysql_escape() function. (I'm not sure the PHP function technically creates a parameterized query, but it's worked well for me in preventing sql-injection attempts thus far since I've seen a few that were safely stored in the db.)
On the XSS protection, I used to use regular expressions to sanitize input for this kind of reason, but have since moved away from that method because of the difficulty involved in both allowing things like links while also removing the dangerous code. What I've moved to as an alternative is XSLT. Again, how you execute an XSL transformation may vary dependent upon your platform. I wrote an article for the ColdFusion Developer's Journal a while ago about how to do this, which includes both a boilerplate XSL sheet you can use and shows how to make it work with CF using the native XmlTransform() function.
The reason why I've chosen to move to XSLT for this is two fold.
First validating that the input is well-formed XML eliminates the possibility of an XSS attack using certain string-concatenation tricks.
Second it's then easier to manipulate the XHTML packet using XSL and XPath selectors than it is with regular expressions because they're designed specifically to work with a structured XML document, compared to regular expressions which were designed for raw string-manipulation. So it's a lot cleaner and easier, I'm less likely to make mistakes and if I do find that I've made a mistake, it's easier to fix.
Also when I tested them I found that WYSIWYG editors like CKEditor (he removed the F) preserve well-formed XML, so you shouldn't have to worry about that as a potential issue.
The same rules apply for protection: filter input, escape output.
In the case of input containing code, filtering just means that the string must contain printable characters, and maybe you have a length limit.
When storing text into the database, either use query parameters, or else escape the string to ensure you don't have characters that create SQL injection vulnerabilities. Code may contain more symbols and non-alpha characters, but the ones you have to watch out for with respect to SQL injection are the same as for normal text.
Don't try to duplicate the correct escaping function. Most database libraries already contain a function that does correct escaping for all characters that need escaping (e.g. this may be database-specific). It should also handle special issues with character sets. Just use the function provided by your library.
I don't understand why people say "use stored procedures!" Stored procs give no special protection against SQL injection. If you interpolate unescaped values into SQL strings and execute the result, this is vulnerable to SQL injection. It doesn't matter if you are doing it in application code versus in a stored proc.
When outputting to the web presentation, escape HTML-special characters, just as you would with any text.
The best thing that you can do to prevent SQL injection attacks is to make sure that you use parameterized queries or stored procedures when making database calls. Normally, I would also recommend performing some basic input sanitization as well, but since you need to accept code from the user, that might not be an option.
On the other end (when rendering the user's input to the browser), HTML encoding the data will cause any malicious JavaScript or the like to be rendered as literal text rather than executed in the client's browser. Any decent web application server framework should have the capability.
I'd say one could replace all < by <, etc. (using htmlentities on PHP, for example), and then pick the safe tags with some sort of whitelist. The problem is that the whitelist may be a little too strict.
Here is a PHP example
$code = getTheCodeSnippet();
$code = htmlentities($code);
$code = str_ireplace("<br>", "<br>", $code); //example to whitelist <br> tags
//One could also use Regular expressions for these tags
To prevent SQL injections, you could replace all ' and \ chars by an "innofensive" equivalent, like \' and \, so that the following C line
#include <stdio.h>//'); Some SQL command--
Wouldn't have any negative results in the database.

Regular expressions: Differences between browsers

I'm increasingly becoming aware that there must be major differences in the ways that regular expressions will be interpreted by browsers.
As an example, a co-worker had written this regular expression, to validate that a file being uploaded would have a PDF extension:
^(([a-zA-Z]:)|(\\{2}\w+)\$?)(\\(\w[\w].*))(.pdf)$
This works in Internet Explorer, and in Google Chrome, but does NOT work in Firefox. The test always fails, even for an actual PDF. So I decided that the extra stuff was irrelevant and simplified it to:
^.+\.pdf$
and now it works fine in Firefox, as well as continuing to work in IE and Chrome.
Is this a quirk specific to asp:FileUpload and RegularExpressionValidator controls in ASP.NET, or is it simply due to different browsers supporting regex in different ways? Either way, what are some of the latter that you've encountered?
Regarding the actual question: The original regex requires the value to start with a drive letter or UNC device name. It's quite possible that Firefox simply doesn't include that with the filename. Note also that, if you have any intention of being cross-platform, that regex would fail on any non-Windows system, regardless of browser, as they don't use drive letters or UNC paths. Your simplified regex ("accept anything, so long as it ends with .pdf") is about as good of a filename check as you're going to get.
However, Jonathan's comment to the original question cannot be overemphasized. Never, ever, ever trust the filename as an adequate means of determining its contents. Or the MIME type, for that matter. The client software talking to your web server (which might not even be a browser) can lie to you about anything and you'll never know unless you verify it. In this case, that means feeding the received file into some code that understands the PDF format and having that code tell you whether it's a valid PDF or not. Checking the filename may help to prevent people from trying to submit obviously incorrect files, but it is not a sufficient test of the files that are received.
(I realize that you may know about the need for additional validation, but the next person who has a similar situation and finds your question may not.)
As far as I know firefox doesn't let you have the full path of an upload. Interpretation of regular expressions seems irrelevant in this case. I have yet to see any difference between modern browsers in regular expression execution.
If you're using javascript, not enclosing the regex with slashes causes error in Firefox.
Try doing var regex = /^(([a-zA-Z]:)|(\\{2}\w+)\$?)(\\(\w[\w].*))(.pdf)$/;
As Dave mentioned, Firefox does not give the path, only the file name. Also as he mentioned, it doesn't account for differences between operating systems. I think the best check you could do would be to check if the file name ends with PDF. Also, this doesn't ensure it's a valid PDF, just that the file name ends with PDF. Depending on your needs, you may want to verify that it's actually a PDF by checking the content.
I have not noticed a difference between browsers in regards to the pattern syntax. However, I have noticed a difference between C# and Javascript as C#'s implementation allows back references and Javascript's implementation does not.
I believe JavaScript REs are defined by the ECMA standard, and I doubt there are many differences between JS interpreters. I haven't found any, in my programs, or seen mentioned in an article.
Your message is actually a bit confusing, since you throw ASP stuff in there. I don't see how you conclude it is the browser's fault when you talk about server-side technology or generated code. Actually, we don't even know if you are talking about JS on the browser, validation of upload field (you can no longer do it, at least in a simple way, with FF3) or on the server side (neither FF nor Opera nor Safari upload the full path of the uploaded file. I am surprised to learn that Chrome does like IE...).