Regular expressions: Differences between browsers - regex

I'm increasingly becoming aware that there must be major differences in the ways that regular expressions will be interpreted by browsers.
As an example, a co-worker had written this regular expression, to validate that a file being uploaded would have a PDF extension:
^(([a-zA-Z]:)|(\\{2}\w+)\$?)(\\(\w[\w].*))(.pdf)$
This works in Internet Explorer, and in Google Chrome, but does NOT work in Firefox. The test always fails, even for an actual PDF. So I decided that the extra stuff was irrelevant and simplified it to:
^.+\.pdf$
and now it works fine in Firefox, as well as continuing to work in IE and Chrome.
Is this a quirk specific to asp:FileUpload and RegularExpressionValidator controls in ASP.NET, or is it simply due to different browsers supporting regex in different ways? Either way, what are some of the latter that you've encountered?

Regarding the actual question: The original regex requires the value to start with a drive letter or UNC device name. It's quite possible that Firefox simply doesn't include that with the filename. Note also that, if you have any intention of being cross-platform, that regex would fail on any non-Windows system, regardless of browser, as they don't use drive letters or UNC paths. Your simplified regex ("accept anything, so long as it ends with .pdf") is about as good of a filename check as you're going to get.
However, Jonathan's comment to the original question cannot be overemphasized. Never, ever, ever trust the filename as an adequate means of determining its contents. Or the MIME type, for that matter. The client software talking to your web server (which might not even be a browser) can lie to you about anything and you'll never know unless you verify it. In this case, that means feeding the received file into some code that understands the PDF format and having that code tell you whether it's a valid PDF or not. Checking the filename may help to prevent people from trying to submit obviously incorrect files, but it is not a sufficient test of the files that are received.
(I realize that you may know about the need for additional validation, but the next person who has a similar situation and finds your question may not.)

As far as I know firefox doesn't let you have the full path of an upload. Interpretation of regular expressions seems irrelevant in this case. I have yet to see any difference between modern browsers in regular expression execution.

If you're using javascript, not enclosing the regex with slashes causes error in Firefox.
Try doing var regex = /^(([a-zA-Z]:)|(\\{2}\w+)\$?)(\\(\w[\w].*))(.pdf)$/;

As Dave mentioned, Firefox does not give the path, only the file name. Also as he mentioned, it doesn't account for differences between operating systems. I think the best check you could do would be to check if the file name ends with PDF. Also, this doesn't ensure it's a valid PDF, just that the file name ends with PDF. Depending on your needs, you may want to verify that it's actually a PDF by checking the content.

I have not noticed a difference between browsers in regards to the pattern syntax. However, I have noticed a difference between C# and Javascript as C#'s implementation allows back references and Javascript's implementation does not.

I believe JavaScript REs are defined by the ECMA standard, and I doubt there are many differences between JS interpreters. I haven't found any, in my programs, or seen mentioned in an article.
Your message is actually a bit confusing, since you throw ASP stuff in there. I don't see how you conclude it is the browser's fault when you talk about server-side technology or generated code. Actually, we don't even know if you are talking about JS on the browser, validation of upload field (you can no longer do it, at least in a simple way, with FF3) or on the server side (neither FF nor Opera nor Safari upload the full path of the uploaded file. I am surprised to learn that Chrome does like IE...).

Related

Coldfusion Form.getPartsArray function

These two links reference the ability, in ColdFusion, to get the name of an uploaded file using form.getPartsArray(). However I can not find ColdFusion documentation on it. I would like to use this but not if it has been deprecated or will be. Does anyone have more information on the origin and fate of this function?
ColdFusion: get the name of a file before uploading
http://www.stillnetstudios.com/get-filename-before-calling-cffile/
ColdFusion: get the name of a file before uploading
Ignoring the main question for a moment, can you elaborate on why you want to use it? Reason for asking is the title of the first question might give you a mistaken impression about what that method actually does. Form.getPartsArray() does not provide access to file information before the file is uploaded. The file is already on the server at that point, so in later versions of CF (with additional functionality) it does not necessarily buy you much over just using cffile action=upload.
Does anyone have more information on the origin and fate of this
function?
However, to answer your other question - it is an undocumented feature last I checked. (It was more useful in earlier versions of CF, which lacked some of the newer features relating to form fields and uploads.)
Internally, most form data can be handled using standard request objects, ie HttpServletRequest. However, those do not support multipart requests, ie file uploads. So a special handler is needed. Macromedia/Adobe chose to use the com.oreilly.servlet library for their internal implementation. That is what you are accessing when using FORM.getPartsArray().
The O'Reilly stuff has been bundled with CF since (at least) CF8, which is a good indicator. However, using any internal feature always comes with the risk the implementation will change and break your application. Also, if you move to another engine, the code may not be supported/compatible. So "You pays your money, you takes your choice".
CF8 / Form Scope

Is there any browsers that sends multipart/form-data sub-parts?

I am writing a webserver in C++. I am looking at the POST documentation on w3:
http://www.w3.org/TR/html401/interact/forms.html#h-17.13.4
I see that a POST is supposed to support the full multi-parts scheme: parts and sub-parts (and obviously, sub-sub-parts...) just like for email attachments.
Is there any browser and/or tool that do that on a normal basis? In other words, is it really important for a server to support parts and sub-parts?
The obvious problem with that is the fact that it could mean that two files are uploaded under the same name. That's quite a problem if you ask me. Also, from what I can see in PHP it is not supported at all in that realm. Am I correct?
Ah! I guess I should have searched a little more and to tell you the truth I had not thought of looking at HTML5 for the answer.
The following paragraph actually includes the answer:
http://www.w3.org/html/wg/drafts/html/master/forms.html#multipart-form-data
Note: In particular, this means that multiple files submitted as
part of a single element will
result in each file having its own field; the "sets of
files" feature ("multipart/mixed") of RFC 2388 is not used.
So it is clear that sub-parts (multipart/mixed) are not to be supported.

Match browsers set to Scandinavian languages based on "Accept-Language"

Question
I am trying to match browsers set to Scandinavian languages based on HTTP header "Accept-Language".
My regex is:
^(nb|nn|no|sv|se|da|dk).*
My question is if this is sufficient, and if anyone know about any other odd scandinavian (but "valid") language codes or obscure browser bugs causing false positives?
Used for
The regex is used for displaying a english link in the top of the Norwegian web pages (which is the primary language and the root of the domain and sub-domains) that takes you to the English web pages (secondary language and folder under root) when the browser language is not Scandinavian. The link can be closed / "opted-out" with hash stored in JavaScript localStorage if the user don't want to see the link again. We decided not to use IP geo-location because of limited time to implement.
Depending on the language you are working in there may be code in place you can use to parse this easily, e.g. this post: Parse Accept-Language header in Java <-- Also provides a good code example
Further - are you sure you want to limit your regex to the start of the string, as several lanaguages can be provided (the first is intended to be "I prefer x but also accept the following") : http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.4
Otherwise your regex should work fine based on the what you were asking and here is a list of all browser language codes: http://www.metamodpro.com/browser-language-codes
I would also - in your shoes, make the "switch to X language" link easy to find for all users until they had opted not to see it again. I would expect many people may have a preference set by default in their browser but find a site actually using it to be unexpected i.e. a user experience like:
I prefer english but don't know enough to change this setting and have never had a reason to before as so few sites make use of it.
That regular expression is enough if you are testing each item in accept-language individually.
If not individually, there are 2 problems:
One of the expected languages could not appear at the beginning of the header, but after.
Some of the expected languages abbreviations could appear as qualifier of a completely different language.

Everything inside < > lost, not seen in html?

I have many source/text file, say file.cpp or file.txt . Now, I want to see all my code/text in browser, so that it will be easy for me to navigate many files.
My main motive for doing all this is, I am learning C++ myself, so whenever I learn something new, I create some sample code and then compile and run it. Also, along these codes, there are comments/tips for me to be aware of. And then I create links for each file for easy navigation purpose. Since, there are many such files, I thought it would be easy to navigate it if I use this html method. I am not sure if it is OK or good approach, I would like to have some feedback.
What I did was save file.cpp/file.txt into file.html and then use pre and code html tag for formatting. And, also some more necessare html tags for viewing html files.
But when I use it, everything inside < > is lost
eg. #include <iostream> is just seen as #include, and <iostream> is lost.
Is there any way to see it, is there any tag or method that I can use ?
I can use regular HTML escape code < and > for this, to see < > but since I have many include files and changing it for all of them is bit time-consuming, so I want to know if there is any other idea ??
So is there any other solution than s/</< and s/>/>
I would also like to know if there any other ideas/tips than just converting cpp file into html.
What I want to have is,
in my main page something like this,
tip1 Do this
tip2 Do that
When I click tip1, it will open tip1.html which has my codes for that tip. And also there is back link in tip1.html, which will take me back to main page on clicking it. Everything is OK just that everything inside < > is lost,not seen.
Thanks.
You might want to take a look at online tools such as CodeHtmler, which allows you to copy into the browser, select the appropriate language, and it'll convert to HTML for you, together with keyword colourisation etc.
Or, do like many other people and put your documentation in Doxygen format (/** */) with code samples in #verbatim/#endverbatim tags. Doxygen is good stuff.
A few ideas:
If you serve the files as mimetype text/plain, the browser should display the text for you.
You could also possibly configure your browser to assume .cpp is text/plain.
Instead of opening the files directly in the browser, you could serve them with a web server than can change the characters for you.
You could also use SyntaxHighlighter to display the code on the client side using JavaScript.
It is pretty much essential that somewhere along the line you use a program to prevent the characters '<>&' from being (mis-)interpreted by your browser (and expand significant repeated blanks into '` '). You have a couple of options for when/how to do that. You could use static HTML, simply converting each file once before putting it into the web server document hierarchy. This has the least conversion overhead if the files are looked at more often than they are modified. Alternatively, you can configure your web server to server the pages via a filter program (CGI, or something more sophisticated) and serve the output of that in lieu of the file. The advantage is that files are only converted when needed; the disadvantage is that the files are converted each time they are needed. You could get fancy and consider a caching solution - convert the file on first demand but retain the converted file for future use. The main downside there is that the web server needs to be able to write to where the converted file is cached - not necessarily a good idea for security reasons. (A minimalist approach to security requires the document hierarchy to be owned by and only writable by one user, say webmaster, and the web server runs as another user, say webserver. Now the web server cannot do any damage because it cannot write anywhere in the document hierarchy. Simple; effective; restrictive.)
The program can be a simple Perl script or a simple C program (the C source for webcode 1.3 is available here).

Allowing code snippets in form input while preventing XSS and SQL injection attacks

How can one allow code snippets to be entered into an editor (as stackoverflow does) like FCKeditor or any other editor while preventing XSS, SQL injection, and related attacks.
Part of the problem here is that you want to allow certain kinds of HTML, right? Links for example. But you need to sanitize out just those HTML tags that might contain XSS attacks like script tags or for that matter even event handler attributes or an href or other attribute starting with "javascript:". And so a complete answer to your question needs to be something more sophisticated than "replace special characters" because that won't allow links.
Preventing SQL injection may be somewhat dependent upon your platform choice. My preferred web platform has a built-in syntax for parameterizing queries that will mostly prevent SQL-Injection (called cfqueryparam). If you're using PHP and MySQL there is a similar native mysql_escape() function. (I'm not sure the PHP function technically creates a parameterized query, but it's worked well for me in preventing sql-injection attempts thus far since I've seen a few that were safely stored in the db.)
On the XSS protection, I used to use regular expressions to sanitize input for this kind of reason, but have since moved away from that method because of the difficulty involved in both allowing things like links while also removing the dangerous code. What I've moved to as an alternative is XSLT. Again, how you execute an XSL transformation may vary dependent upon your platform. I wrote an article for the ColdFusion Developer's Journal a while ago about how to do this, which includes both a boilerplate XSL sheet you can use and shows how to make it work with CF using the native XmlTransform() function.
The reason why I've chosen to move to XSLT for this is two fold.
First validating that the input is well-formed XML eliminates the possibility of an XSS attack using certain string-concatenation tricks.
Second it's then easier to manipulate the XHTML packet using XSL and XPath selectors than it is with regular expressions because they're designed specifically to work with a structured XML document, compared to regular expressions which were designed for raw string-manipulation. So it's a lot cleaner and easier, I'm less likely to make mistakes and if I do find that I've made a mistake, it's easier to fix.
Also when I tested them I found that WYSIWYG editors like CKEditor (he removed the F) preserve well-formed XML, so you shouldn't have to worry about that as a potential issue.
The same rules apply for protection: filter input, escape output.
In the case of input containing code, filtering just means that the string must contain printable characters, and maybe you have a length limit.
When storing text into the database, either use query parameters, or else escape the string to ensure you don't have characters that create SQL injection vulnerabilities. Code may contain more symbols and non-alpha characters, but the ones you have to watch out for with respect to SQL injection are the same as for normal text.
Don't try to duplicate the correct escaping function. Most database libraries already contain a function that does correct escaping for all characters that need escaping (e.g. this may be database-specific). It should also handle special issues with character sets. Just use the function provided by your library.
I don't understand why people say "use stored procedures!" Stored procs give no special protection against SQL injection. If you interpolate unescaped values into SQL strings and execute the result, this is vulnerable to SQL injection. It doesn't matter if you are doing it in application code versus in a stored proc.
When outputting to the web presentation, escape HTML-special characters, just as you would with any text.
The best thing that you can do to prevent SQL injection attacks is to make sure that you use parameterized queries or stored procedures when making database calls. Normally, I would also recommend performing some basic input sanitization as well, but since you need to accept code from the user, that might not be an option.
On the other end (when rendering the user's input to the browser), HTML encoding the data will cause any malicious JavaScript or the like to be rendered as literal text rather than executed in the client's browser. Any decent web application server framework should have the capability.
I'd say one could replace all < by <, etc. (using htmlentities on PHP, for example), and then pick the safe tags with some sort of whitelist. The problem is that the whitelist may be a little too strict.
Here is a PHP example
$code = getTheCodeSnippet();
$code = htmlentities($code);
$code = str_ireplace("<br>", "<br>", $code); //example to whitelist <br> tags
//One could also use Regular expressions for these tags
To prevent SQL injections, you could replace all ' and \ chars by an "innofensive" equivalent, like \' and \, so that the following C line
#include <stdio.h>//'); Some SQL command--
Wouldn't have any negative results in the database.