What's an example of something dangerous that would not be caught by the code below?
EDIT: After some of the comments I added another line, commented below. See Vinko's comment in David Grant's answer. So far only Vinko has answered the question, which asks for specific examples that would slip through this function. Vinko provided one, but I've edited the code to close that hole. If another of you can think of another specific example, you'll have my vote!
public static string strip_dangerous_tags(string text_with_tags)
{
string s = Regex.Replace(text_with_tags, #"<script", "<scrSAFEipt", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"</script", "</scrSAFEipt", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"<object", "</objSAFEct", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"</object", "</obSAFEct", RegexOptions.IgnoreCase);
// ADDED AFTER THIS QUESTION WAS POSTED
s = Regex.Replace(s, #"javascript", "javaSAFEscript", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"onabort", "onSAFEabort", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"onblur", "onSAFEblur", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"onchange", "onSAFEchange", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"onclick", "onSAFEclick", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"ondblclick", "onSAFEdblclick", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"onerror", "onSAFEerror", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"onfocus", "onSAFEfocus", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"onkeydown", "onSAFEkeydown", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"onkeypress", "onSAFEkeypress", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"onkeyup", "onSAFEkeyup", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"onload", "onSAFEload", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"onmousedown", "onSAFEmousedown", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"onmousemove", "onSAFEmousemove", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"onmouseout", "onSAFEmouseout", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"onmouseup", "onSAFEmouseup", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"onmouseup", "onSAFEmouseup", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"onreset", "onSAFEresetK", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"onresize", "onSAFEresize", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"onselect", "onSAFEselect", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"onsubmit", "onSAFEsubmit", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"onunload", "onSAFEunload", RegexOptions.IgnoreCase);
return s;
}
It's never enough – whitelist, don't blacklist
For example javascript: pseudo-URL can be obfuscated with HTML entities, you've forgotten about <embed> and there are dangerous CSS properties like behavior and expression in IE.
There are countless ways to evade filters and such approach is bound to fail. Even if you find and block all exploits possible today, new unsafe elements and attributes may be added in the future.
There are only two good ways to secure HTML:
convert it to text by replacing every < with <.
If you want to allow users enter formatted text, you can use your own markup (e.g. markdown like SO does).
parse HTML into DOM, check every element and attribute and remove everything that is not whitelisted.
You will also need to check contents of allowed attributes like href (make sure that URLs use safe protocol, block all unknown protocols).
Once you've cleaned up the DOM, generate new, valid HTML from it. Never work on HTML as if it was text, because invalid markup, comments, entities, etc. can easily fool your filter.
Also make sure your page declares its encoding, because there are exploits that take advantage of browsers auto-detecting wrong encoding.
You're much better off turning all < into < and all > into >, then converting acceptable tags back. In other words, whitelist, don't blacklist.
As David shows, there's no easy way to protect with just some regexes you can always forget something, like javascript: in your case. You better escape the HTML entities on output. There is a lot of discussion about the best way to do this, depending on what you actually need to allow, but what's certain is that your function is not enough.
Jeff has talked a bit about this here.
example
Any time you can write a string to the document, a big door swings open.
There are myriad places to inject malicious things into HTML/JavaScript. For this reason, Facebook didn't initially allow JavaScript in their applications platform. Their solution was to later implement a markup/script compiler that allows them to seriously filter out the bad stuff.
As said already, whitelist a few tags and attributes and strip out everything else. Don't blacklist a few known malicious attributes and allow everything else.
Although I can't provide a specific example of why not, I am going to go ahead and outright say no. This is more on principal. Regex's are an amazing tool but they should only be used for certain problems. They are fantastic for data matching and searching.
They are not however a good tool for security. It is too easy to mess up a regex and have it be only partially correct. Hackers can find lots of wiggle room inside a poorly or even well constructed regex. I would try another avenue to prevent cross site scripting.
Take a look at the XSS cheatsheet at http://ha.ckers.org/xss.html it's not a complete list but a good start.
One that comes to mind is <img src="http://badsite.com/javascriptfile" />
You also forgot onmouseover, and the style tag.
The easiest thing to do really is entity escaping. If the vector can't render properly in the first place, an incomplete blacklist won't matter.
As an example of an attack that makes it through this:
<div style="color: expression('alert(4)')">
Shameless plug:
The Caja project defines whitelists of HTML elements and attributes so that it can control how and when scripts in HTML get executed.
See the project at http://code.google.com/p/google-caja/
and the whitelists are the JSON files in
http://code.google.com/p/google-caja/source/browse/#svn/trunk/src/com/google/caja/lang/html
and
http://code.google.com/p/google-caja/source/browse/#svn/trunk/src/com/google/caja/lang/css
I still have not figured out why developers want to massage bad input into good input with a regular expression replace. Unless your site is a blog and needs to allow embedded html or javascript or any other sort of code, reject the bad input and return an error. The old saying is Garbage In - Garbage Out, why would you want to take in a nice steaming pile of poo and make it edible?
If your site is not internationalized, why accept any unicode?
If your site only does POST, why accept any URL encoded values?
Why accept any hex? Why accept html entities? What user inputs '
' or '&quot;' ?
As for regular expressions, using them is fine, however, you do not have to code a separate regular expression for the full attack string. You can reject many different attack signatures with just a few well constructed regex patterns:
patterns.put("xssAttack1", Pattern.compile("<script",Pattern.CASE_INSENSITIVE) );
patterns.put("xssAttack2", Pattern.compile("SRC=",Pattern.CASE_INSENSITIVE) );
patterns.put("xssAttack3", Pattern.compile("pt:al",Pattern.CASE_INSENSITIVE) );
patterns.put("xssAttack4", Pattern.compile("xss",Pattern.CASE_INSENSITIVE) );
<FRAMESET><FRAME SRC="javascript:alert('XSS');"></FRAMESET>
<DIV STYLE="width: expression(alert('XSS'));">
<LINK REL="stylesheet" HREF="javascript:alert('XSS');">
<IMG SRC="jav ascript:alert('XSS');"> // hmtl allows embedded tabs...
<IMG SRC="jav
ascript:alert('XSS');"> // hmtl allows embedded newline...
<IMG SRC="jav
ascript:alert('XSS');"> // hmtl allows embedded carriage return...
Notice that my patterns are not the full attack signature, just enough to detect if the value is malicious. It is unlikely that a user would enter 'SRC=' or 'pt:al' This allows my regex patterns to detect unknown attacks that have any of these tokens in them.
Many developers will tell you that you cannot protect a site with a blacklist. Since the set of attacks is infinite, that is basically true, however, if you parse the entire request (params, param values, headers, cookies) with a blacklist constructed based on tokens, you will be able to figure out what is an attack and what is valid. Remember, the attacker will most likely be shotgunning exploits at you from a tool. If you have properly hardened your server, he will not know what environment you are running and will have to blast you with lists of exploits. If he pesters you enough, put the attacker, or his IP on a quarantine list. If he has a tool with 50k exploits ready to hit your site, how long will it take him if you quarantine his id or ip for 30 min for each violation? Admittedly there is still exposure if the attacker uses a botnet to multiplex his attack. Still your site ends up being a much tougher nugget to crack.
Now having checked the entire request for malicious content you can now use whitelist type checks against length, referential/ logical, naming to determine validity of the request
Don't forget to implement some sort of CSRF protection. Maybe a honey token, and check the user-agent string from previous requests to see if it has changed.
Whitespace makes you vulnerable. Read this.
Another vote for whitelisting. But it looks like you're going about this the wrong way. The way I do it, is to parse the HTML into a tag tree. If the tag you're parsing is in the whitelist, give it a tree node, and parse on. Same goes for its attributes.
Dropped attributes are just dropped. Everything else is HTML-escaped literal content.
And the bonus of this route is because you're effectively regenerating all the markup, it's all completely valid markup! (I hate it when people leave comments and they screw up the validation/design.)
Re "I can't whitelist" (para): Blacklisting is a maintenance-heavy approach. You'll have to keep an eye on new exploits and make sure your covered. It's a miserable existence. Just do it right once and you'll never need to touch it again.
From a different point of view, what happens when someone wants to have 'javascript' or 'functionload' or 'visionblurred' in what they submit? This can happen in most places for any number of reasons... From what I understand, those will become 'javaSAFEscript', 'functionSAFEload' and 'visionSAFEblurred'(!!).
If this might apply to you, and you're stuck with the blacklist approach, be sure to use the exact matching regexes to avoid annoying the user. In other words, be at the optimum point between security and usability, compromising either as little as possible.
Related
Whats the best way to remove a query string (the question mark variables) from a image url.
Say I got a good image such as
http://i.ebayimg.com/00/s/MTYwMFgxNjAw/z/zoMAAOSwMpZUniWv/$_12.JPG?set_id=880000500F
But I can't really save it properly without adding a bunch of useless checking code because of the query string crap after it.
I just need
http://i.ebayimg.com/00/s/MTYwMFgxNjAw/z/zoMAAOSwMpZUniWv/$_12.JPG
Looking for the proper regular expression that handles this so I could replace it with blank.
It might be simple enough not to worry about regex.
This would work:
Dim cleaned = url.Substring(0, url.IndexOf("?"c))
I am attempting to write a MVC model validation that verifies that there is 10 or more words in a string. The string is being populated correctly, so I did not include the HTML. I have done a fair bit of research, and it seems that something along the lines of what I have tries should work, but, for whatever reason, mine always seem to fail. Any ideas as to what I am doing wrong here?
(using System.ComponentModel.DataAnnotations, in a mvc 4 vb.net environment)
Have tried ([\w]+){10,}, ((\\S+)\s?){10,}, [\b]{20,}, [\w+\w?]{10,}, (\b(\w+?)\b){10,}, ([\w]+?\s){10}, ([\w]+?\s){9}[\w], ([\S]+\s){9}[\S], ([a-zA-Z0-9,.'":;$-]+\s+){10,} and several more varaiations on the same basic idea.
<Required(ErrorMessage:="The Description of Operations field is required"), RegularExpression("([\w]+){20,}", ErrorMessage:="ERROZ")>
Public Property DescOfOperations As String = String.Empty
Correct Solution was ([\S]+\s+){9}[\S\s]+
EDIT Moved accepted version to the top, removing unused versions. Unless I am wrong and the whole sequence needs to match, then something like (also accounting for double spaces):
([\S]+\s+){9}[\S\s]+
Or:
([\w]+?\s+){9}[\w]+
Give this a try:
([a-zA-Z0-9,.'":;$-]+\s){10,}
To me regular expression validation seems straight forward and meaningfull rather than validating everything with asp.net validation controls. I am learning asp.net and do not want to memorize all asp.net validation controls, when any form input can be simply validated with reqular expression. Am I thinking right or should I use validation controls?
Example:`RequiredFieldValidator vs Regex Solution C#
if(TextBox1.Text == ""){
Label1.text = "Name Field is required, Please try again";
return;
}
CompareValidator vs Regex Solution
if(Regex.IsMatch(TextBox1.Text, #"^[0-9]")){
if(Convert.ToInt32(TextBox1.Text) > 18){
output.InnerHtml = #"some code";
} else{
Label1.Text = "You should be old enough to express out your political views ";
return;
}
} else{
Label1.Text = "You should be old enough to express out your political views";
return;
}
}
`
Thinking would not be better to do everything in C#, rather than remembring all those validation controls
The major advantage to the validation controls is that in most cases they will output JavaScript validation for the client side that matches the server-side validation. This can reduce round trips to the server which is always a benefit. However, if you're good with JavaScript, you can probably code the client side piece more efficiently than the control would output anyway.
One other thing to consider, when using the control you can turn the validation on/off on both client and server using one flag on the control, if using your own code, you have to handle those separately.
You are right, regular expression validator can replace a lot of other validators, provided you can write a validation expression that works well on client and server side.
You can do a lot of validation work in regular expressions, but there are some areas where regexes are not ideal:
date validation: Either you get a terribly unwieldy regex, or you'll miss lots of plausible but illegal dates (like Feb 29, 2000).
email validation. Same thing here - you either reject some valid addresses, or you allow invalid addresses (and in either case, you'll allow addresses that are syntactically OK but don't correspond to an actual mailbox).
number validation in general - regular expressions are good for matching textual data. Using them to validate numbers is cumbersome and error-prone. Have you thought of exponential notation, locale-dependent decimal separators, thousands separators, leading zeroes, etc...?
Apart from that, the JavaScript regex engine has some limitations (e. g., lack of lookbehind assertions) that you need to know about when trying to write regexes that have to work both on the client and the server side.
And finally, do you realize that there's an error in your example regex? Maybe using a validator is safer unless you really know how to build a regex that does exactly what you intend it to do...
I have a peculiar problem. I have an email group that pipes emails to a message board. The word wrap of the emails varies. In yahoo, the messages tend to fill the entire container on the message board. But in all other mail clients, only part of the container width is filled, because the original mail was wrapped. I want all of the email messages to fill the entire width of the container. I've thought of two possible solutions: CSS, or a Regex that eliminates line breaks. Because I am only a garage mechanic (at these sorts of things), I simply cannot get the job done. Any help out there?
Here is a link that shows the issue: http://seanwilson.org/forum/index.php?t=msg&th=1729&start=0&S=171399e41f2c10c4357dd9b217caaa3f
(compare the message of "sean" with that of "rob." One fills the container, the other not).
Can any of you suggest how to get all the mail to fill the container?
You gave too little information - what programming language are you using - PHP/Javascript/anything different?
I think you only need to replace \n, \r and \r\n with whitespace. PHP code for that:
$nowrap = str_replace('\r\n', ' ', $nowrap);
$nowrap = str_replace('\r', ' ', $nowrap);
$nowrap = str_replace('\n', ' ', $nowrap);
You can do that analogically in other languages (for JS see string.replace method: http://www.tizag.com/javascriptT/javascript-string-replace.php).
Depending on the situation (people always seem to add 2 linebreaks between paragraphs), you could say the problem is: replace all newlines not directly preceded or followed by a newline with a space.
//just to be sure, remove \r's
$string = str_replace("\r",'',$string);
$string = preg_replace('/(?<!\n)\n(?!\n)/',' ',$string);
While allowing \r's:
$string = preg_replace('/(?<!\r|\n)\r?\n(?!\r|\n)/',' ',$string);
Edit: nevermind: do not use: while people tend to write their email text in paragraphs, you will break their signature / signoff with this regex. One could fiddle around with a minimum linelength before deeming it 'breakable' (i chose 63), but fiddly it will be:
$string = preg_replace('/([^\r\n]{63,})\r?\n(?!\r|\n)/','$1 ',$string);
The problem is: there are no assurance the linebreak wasn't intended. With a fiddleable line-length you could base it on average users, but the question is: what do they mind more: the differences between breaking & non-breaking paragraphs, or the breaking of their signatures?
Thanks for getting back so quickly!
The discussion board uses php (and also CSS). The only trouble is that I am somewhat limited in my ability to tinker with its programing. If I am to do this at my current level of skilty, I have only one of two options.
using a preg-replace in php. The discussion board allows us to do this from a control panel. So If I could do it with one preg-replace statement, it should work.
Would Wrikken's solution work if I do not remove \r's? Because that seems to be spot on. (could the \r's be added to the preg-replace?)
I had hoped the solution could come through a css property of some sort. I guess that isn't possible.
Thanks so much for your help!
[NOTE: thanks so much for your help! The solution worked!!! I changed the number to 53 or so. It needed to be a little smaller. I don't care that a rare, long signature lines may lose its carriage return. That's a small price to pay for a full message box! You easily saved me several days of learning something that was bound to be moderately frustrating, Thanks so much for that quick fix. I am joyous at the help I received here.]
Although this seems like a trivial question, I am quite sure it is not :)
I need to validate names and surnames of people from all over the world. Imagine a huge list of miilions of names and surnames where I need to remove as well as possible any cruft I identify. How can I do that with a regular expression? If it were only English ones I think that this would cut it:
^[a-z -']+$
However, I need to support also these cases:
other punctuation symbols as they might be used in different countries (no idea which, but maybe you do!)
different Unicode letter sets (accented letter, greek, japanese, chinese, and so on)
no numbers or symbols or unnecessary punctuation or runes, etc..
titles, middle initials, suffixes are not part of this data
names are already separated by surnames.
we are prepared to force ultra rare names to be simplified (there's a person named '#' in existence, but it doesn't make sense to allow that character everywhere. Use pragmatism and good sense.)
note that many countries have laws about names so there are standards to follow
Is there a standard way of validating these fields I can implement to make sure that our website users have a great experience and can actually use their name when registering in the list?
I would be looking for something similar to the many "email address" regexes that you can find on google.
I sympathize with the need to constrain input in this situation, but I don't believe it is possible - Unicode is vast, expanding, and so is the subset used in names throughout the world.
Unlike email, there's no universally agreed-upon standard for the names people may use, or even which representations they may register as official with their respective governments. I suspect that any regex will eventually fail to pass a name considered valid by someone, somewhere in the world.
Of course, you do need to sanitize or escape input, to avoid the Little Bobby Tables problem. And there may be other constraints on which input you allow as well, such as the underlying systems used to store, render or manipulate names. As such, I recommend that you determine first the restrictions necessitated by the system your validation belongs to, and create a validation expression based on those alone. This may still cause inconvenience in some scenarios, but they should be rare.
I'll try to give a proper answer myself:
The only punctuations that should be allowed in a name are full stop, apostrophe and hyphen. I haven't seen any other case in the list of corner cases.
Regarding numbers, there's only one case with an 8. I think I can safely disallow that.
Regarding letters, any letter is valid.
I also want to include space.
This would sum up to this regex:
^[\p{L} \.'\-]+$
This presents one problem, i.e. the apostrophe can be used as an attack vector. It should be encoded.
So the validation code should be something like this (untested):
var name = nameParam.Trim();
if (!Regex.IsMatch(name, "^[\p{L} \.\-]+$"))
throw new ArgumentException("nameParam");
name = name.Replace("'", "'"); //' does not work in IE
Can anyone think of a reason why a name should not pass this test or a XSS or SQL Injection that could pass?
complete tested solution
using System;
using System.Text.RegularExpressions;
namespace test
{
class MainClass
{
public static void Main(string[] args)
{
var names = new string[]{"Hello World",
"John",
"João",
"タロウ",
"やまだ",
"山田",
"先生",
"мыхаыл",
"Θεοκλεια",
"आकाङ्क्षा",
"علاء الدين",
"אַבְרָהָם",
"മലയാളം",
"상",
"D'Addario",
"John-Doe",
"P.A.M.",
"' --",
"<xss>",
"\""
};
foreach (var nameParam in names)
{
Console.Write(nameParam+" ");
var name = nameParam.Trim();
if (!Regex.IsMatch(name, #"^[\p{L}\p{M}' \.\-]+$"))
{
Console.WriteLine("fail");
continue;
}
name = name.Replace("'", "'");
Console.WriteLine(name);
}
}
}
}
I would just allow everything (except an empty string) and assume the user knows what his name is.
There are 2 common cases:
You care that the name is accurate and are validating against a real paper passport or other identity document, or against a credit card.
You don't care that much and the user will be able to register as "Fred Smith" (or "Jane Doe") anyway.
In case (1), you can allow all characters because you're checking against a paper document.
In case (2), you may as well allow all characters because "123 456" is really no worse a pseudonym than "Abc Def".
I would think you would be better off excluding the characters you don't want with a regex. Trying to get every umlaut, accented e, hyphen, etc. will be pretty insane. Just exclude digits (but then what about a guy named "George Forman the 4th") and symbols you know you don't want like ##$%^ or what have you. But even then, using a regex will only guarantee that the input matches the regex, it will not tell you that it is a valid name.
EDIT after clarifying that this is trying to prevent XSS: A regex on a name field is obviously not going to stop XSS on its own. However, this article has a section on filtering that is a starting point if you want to go that route:
s/[\<\>\"\'\%\;\(\)\&\+]//g;
"Secure Programming for Linux and Unix HOWTO" by David A. Wheeler, v3.010 Edition (2003)
v3.72, 2015-09-19 is a more recent version.
BTW, do you plan to only permit the Latin alphabet, or do you also plan to try to validate Chinese, Arabic, Hindi, etc.?
As others have said, don't even try to do this. Step back and ask yourself what you are actually trying to accomplish. Then try to accomplish it without making any assumptions about what people's names are, or what they mean.
I don’t think that’s a good idea. Even if you find an appropriate regular expression (maybe using Unicode character properties), this wouldn’t prevent users from entering pseudo-names like John Doe, Max Mustermann (there even is a person with that name), Abcde Fghijk or Ababa Bebebe.
You could use the following regex code to validate 2 names separeted by a space with the following regex code:
^[A-Za-zÀ-ú]+ [A-Za-zÀ-ú]+$
or just use:
[[:lower:]] = [a-zà-ú]
[[:upper:]] =[A-ZÀ-Ú]
[[:alpha:]] = [A-Za-zÀ-ú]
[[:alnum:]] = [A-Za-zÀ-ú0-9]
It's a very difficult problem to validate something like a name due to all the corner cases possible.
Corner Cases
Anything anything here
Sanitize the inputs and let them enter whatever they want for a name, because deciding what is a valid name and what is not is probably way outside the scope of whatever you're doing; given the range of potential strange - and legal names is nearly infinite.
If they want to call themselves Tricyclopltz^2-Glockenschpiel, that's their problem, not yours.
A very contentious subject that I seem to have stumbled along here. However sometimes it's nice to head dear little-bobby tables off at the pass and send little Robert to the headmasters office along with his semi-colons and SQL comment lines --.
This REGEX in VB.NET includes regular alphabetic characters and various circumflexed european characters. However poor old James Mc'Tristan-Smythe the 3rd will have to input his pedigree in as the Jim the Third.
<asp:RegularExpressionValidator ID="RegExValid1" Runat="server"
ErrorMessage="ERROR: Please enter a valid surname<br/>" SetFocusOnError="true" Display="Dynamic"
ControlToValidate="txtSurname" ValidationGroup="MandatoryContent"
ValidationExpression="^[A-Za-z'\-\p{L}\p{Zs}\p{Lu}\p{Ll}\']+$">
This one worked perfectly for me in JavaScript:
^[a-zA-Z]+[\s|-]?[a-zA-Z]+[\s|-]?[a-zA-Z]+$
Here is the method:
function isValidName(name) {
var found = name.search(/^[a-zA-Z]+[\s|-]?[a-zA-Z]+[\s|-]?[a-zA-Z]+$/);
return found > -1;
}
Steps:
first remove all accents
apply the regular expression
To strip the accents:
private static string RemoveAccents(string s)
{
s = s.Normalize(NormalizationForm.FormD);
StringBuilder sb = new StringBuilder();
for (int i = 0; i < s.Length; i++)
{
if (CharUnicodeInfo.GetUnicodeCategory(s[i]) != UnicodeCategory.NonSpacingMark) sb.Append(s[i]);
}
return sb.ToString();
}
This somewhat helps:
^[a-zA-Z]'?([a-zA-Z]|\.| |-)+$
This one should work
^([A-Z]{1}+[a-z\-\.\']*+[\s]?)*
Add some special characters if you need them.