ReqEx expression for form validation - regex

I am trying to add form validation to my html site in order to prevent xss injection attacks.
I am using a simple java form validator genvalidator_v4.js that allows me to use regex expressions to determine what is allowed in a text box. I am trying to write one that would prevent "<" or ">" or any other tags that could be used in this kind of attack, but still allow alphanumeric, punctuation, and other special characters.
Any ideas? Also open to other methods of preventing xss attacks but I am very inexperienced in this area so please keep it as simple as possible.

You are trying to blacklist dangerous input. That's very tricky, it's very easy to get it wrong because of the sheer number of tokens that could be dangerous.
Thus, the following two practices are recommended instead:
Escape everything read from the database before outputting it on a web page. If you correctly HtmlEncode everything (your language of choice surely has a library method for that), it doesn't matter if a user entered <script>/* do something evil */</script> and that code got stored in your database. Correctly encoded, this will just be printed verbatim and do no harm.
If you still want to filter input (which might be useful as an additional layer of security), whitelists are generally safer than blacklists. So, instead of saying that < is harmful, you say that letters, digits, punctuation, etc. are safe. What exactly is safe depends on what type of field you are filtering.

Related

How do I select src between <> if img exists?

I need to select src=" using a regular expression in the form: //, but only if it is within an image tag.
This should return true:
<img alt="Alt text" src="/directory/Images/my-image.jpg" />
This to return false:
<script type="text/javascript" async="" src="https://www.google-analytics.com/analytics.js"></script>
The end result will be replacing the scr=", which the application I am using performs, I need the regex for the find.
First, the standard disclaimer: if you are using regexes to parse a HTML DOM, you are DOING IT WRONG. With all structured data (XML, JSON, and so forth), the right way to parse HTML is to use something built for that purpose, and query it using its querying system.
That said, it is often the case that what you want is a quick hack on the commandline or the search field of an editor or whatever, and you don't want or need to faff with writing an application that loads in DOM-parsing libraries.
In that case, if you're not actually writing a program, and you don't mind that there are edge-cases where any regex you try will break, then consider something like this:
/<img\b[^<>]+\bsrc\s*=\s*"([^"]+)"/i ... maybe replacing the leading / and trailing /i with whatever other thing your language uses to denote a case-insensitive regular expression.
Note that this makes assumptions, that the url is quoted with doublequotes, the tag is correctly formed, there are no extraneous <img strings in the document, there are no doublequotes in the URL, and countless others that I didn't think of, but a proper parser would. These assumptions are a large part of why using a parser is so important: it makes no such assumptions, and if fed garbage, will correctly let you know that you did so, rather than trying to digest it and giving you pain later on.
<img\b - an img tag. The word boundary ensures this isn't an imgur tag or whatever.
[^<>]+ - one or more characters, with no closing tag, and for safety, no opening tags either.
\bsrc\s*=\s* - 'src=', but with optional whitespace, and another word-boundary check.
"([^"]+)" - some URL consisting of non-quote characters, within quotes.
Now, be aware that since we're doing NO security checking on the URL, you could be grabbing anything, such as javascript:...something malicious..., or it could be 6GB long - you just don't know. You could add in checking for such things, but you'll always miss something, unless you control the input and know exactly what you're parsing.
Your mention of "my application" does mean that I must reiterate: the above is almost certainly the wrong way to do it if you are writing an application, and the question you should be asking is probably closer to "how do I get the value of the src attribute of an img tag from a HTML page, in my chosen programming language?" rather than "how do I use regexes to extract this token from this HTML tag?"
When I say this, I don't mean "ivory-tower computer scientists will look down their nose at you" - though I admit there can be a lot of that kind of snootiness in programming :D
I mean something more like... "you're setting yourself up for pain as you run into edge-case after edge-case, and spiral down into a deep rabbit-hole of infinitely refining your regex. And you can likely avoid the pain with a simple one-liner, infinitely nicer than regex, perhaps document.querySelector('img[src^="/directory/Images"]') as #LGSon suggests in a comment.
People will say this because they've had this pain, and they're wincing at the idea that you might suffer it too.
There are several ways to match that. This RegEx is just an example and it is not certainly the best expression:
(src=")(.+)(.jpg|.JPG|.PNG|.png|.JPEG)"
You can wrap your target image URLs with a capturing group (), maybe similar to this expression:
(src=")((.+)(.jpg|.JPG|.PNG|.png|.JPEG))"
and simply call it using $2 (group #2).
You can also simplify it as you wish by adding ignore flag such as this expression:
src="((.+)(\.[a-rt-z]+))"

HTML5 Input Pattern vs. Non-Latin Letters

I want to make pre-validation of some input form with new HTML5 pattern attirbute. My dataset is "Domain Name", so <input type="url"> regex preset isn't applied.
But there is a problem, I wont use A-Za-z , because of damned IDN's (Internationalized domain name).
So question: is there any way to use <input pattern=""> for random non-english letters validation ?
I tried \w ofcource but it works only for latin...
Maybe someone has a set of some \xNN-\xNN which guarantees entering of ALL unicode alpha characters, or some another way?
edit: "This question may already have an answer here:" - no, there is no answer.
Based on my testing, HTML5 pattern attributes supports Unicode character code points in the exact same way that JavaScript does and does not:
It only supports \u notation for unicode code points so \u00a1 will match 'ยก'.
Because these define characters, you can use them in character ranges like [\u00a1-\uffff]
. will match Unicode characters as well.
You don't really specify how you want to pre-validate so I can't really help you more than that, but by looking up the unicode character values, you should be able to work out what you need in your regex.
Keep in mind that the pattern regex execution is rather dumb overall and isn't universally supported. I recommend progressive enhancement with some javascript on top of the pattern value (you can even re-use the regex more or less).
As always, never trust user input - It doesn't take a genius to make a request to your form endpoint and pass more or less whatever data they like. Your server-side validation should necessarily be more explicit. Your client-side validation can be more generous, depending upon whether false positives or false negatives are more problematic to your use case.
I know this isn't what you want to hear, but...
The HTML5 pattern attribute isn't really for the programmer so much as it's for the user. So, considering the unfortunate limitations of pattern, you are best off providing a "loose" pattern--one that doesn't give false negatives but allows for a few false positives. When I've run into this problem, I found that the best thing to do was a pattern consisting of a blacklist + a couple minimum requirements. Hopefully, that can be done in your case.

regex expressions prevent sql/script injection

I am trying to create a regex expression for client side validation (before server side validation which will also take place) to prevent sql/script injection i.e something like this - which does not work
(script)|(<)|(>)|(%3c)|(%3e)|(SELECT) |(UPDATE) |(INSERT) |(DELETE)|(GRANT) |(REVOKE)|(UNION)|(&lt;)|(&gt;)
What is the correct format for this (above) expression so I can get it to work?
e.g. my EMail checker is like this
(/^[^\\W][a-zA-Z0-9\\_\\-\\.]+([a-zA-Z0-9\\_\\-\\.]+)*\\#[a-zA-Z0-9_]+(\\.[a-zA-Z0-9_]+)*\\.[a-zA-Z]{2,4}$/))
Oh and if you can think of anything else to add please "shout".
You cannot in any way even hinder SQL injection attempts on the client side. It is a terrible, terrible idea which cannot help you but may cause a ball-ache for genuine users. It will not stop anyone who has a chance of actually exploiting an SQLi.
As far as the regex goes, you need to add the / at the beginning and end, like in your mail example, to denote it is a regex. Also, I think the regex design is flawed as it still allows many injection vectors. For example it allows the dreaded single quote ', -- comments and other. It doesn't even start to cover all the builtin functions of your RDBMS that might be knocking around. An attacker will often make use of, e.g. SELECT statements already on your server side, so removing them probably wouldn't help either.
Your best defense is to use parametrized queries on the server side (e.g. pg_prepare for php & postgres)
Generally Sql Injection occurs in the strings passed to the parameters of a sql command such as insert, update, delete, or select. This regular expression validates whether there is any inline or block comment in the sql command.
/[\t\r\n]|(--[^\r\n]*)|(\/\*[\w\W]*?(?=\*)\*\/)/gi
Only a-z or A-Z or 0-9 between 4-8 characters:
^([a-z]|[A-Z]|[0-9]){4,8}$
SQL injection and escaping sound magical to many people, something like shield against some mysterious danger, but: don't be scared of it - it is nothing magical. It is just the way to enable special characters being processed by the query.
So, don't invent new magial shields and ways how to protect the magical injection danger! Instead, try to understand how escaping of the input works.
It's more common to escape the control characters like `and ' that way one can still enter SQL code into the database, say it is on a CMS and I'm adding an article about SQL injection. I want to use those words and characters without triggering an injection. Looking at it, it seems to be for something with HTML base so convert the < and > to < and >, that will sanitize any and all html tags while still allowing HTML demo content to be displayed.
As already said, this should all be server side, as it comes into the system.

XSS regular expression

What is an regular expression that can be used to determine if a string is an XSS (cross site scripting) security risk?
That depends on the context in which that string is being used.
For instance, if the string is being printed out as part of an HTML page, then the special HTML characters <, >, ", and ' can potentially be XSS risks.
If it's being passed around via JSON, then ' and " could potentially be XSS risks.
If it's being included in SQL statements (which it really shouldn't be, at least not directly - use parameterized queries), then things like ; and backticks may be an issue.
Et cetera.
There can never be a bullet proof function to stop all of xss and a regular expression isn't the best choice. XSS is highly dependent on where on the page and limiting charicters such as " ' < > is a good start, but by no means a comprehensive solution. Even with stopping these characters there are MANY other ways of exploiting XSS. To name a few there are malicious href's: javascript:alert(/xss/) and injection of event handlers: onload=alert(/xss/), nether of which will be stopped if you filter for the 4 characters listed.
HTMLPurifier is made up of literally thousands of regular expressions, and it gets bypassed all the time.
Look for any unencoded < characters in html generated from user data. Without any < characters, there can be no nasty html injected into your site.
If you want to allow for user-generated formatting, then limit the allowed html to a subset. It's going to be impossible to check this with regular expressions, so I recommend a good html parser instead.

How to reject names (people and companies) using whitelists with C# regex's?

I've run into a few problems using a C# regex to implement a whitelist of allowed characters on web inputs. I am trying to avoid SQL injection and XSS attacks. I've read that whitelists of the allowable characters are the way to go.
The inputs are people names and company names.
Some of the problems are:
Company names that have ampersands. Like "Jim & Sons". The ampersand is important, but it is risky.
Unicode characters in names (we have asian customers for example), that enter their names using their character sets. I need to whitelist all these.
Company names can have all kinds of slashes, like "S/A" and "S\A". Are those risky?
I find myself wanting to allow almost every character after seeing all the data that is in the DB already (and being entered by new users).
Any suggestions for a good whitelist that will handle these (and other) issues?
NOTE: It's a legacy system, so I don't have control of all the code. I was hoping to reduce the number of attacks by preventing bad data from getting into the system in the first place.
This SO thread has a lot of good discussion on protecting yourself from injection attacks.
In short:
Filter your input as best as you can
Escape your strings using framework based methods
Parameterize your sql statements
In your case, you can limit the name field to a small character set. The company field will be more difficult, and you need to consider and balance your users need for freedom of entry with your need for site security. As others have said, trying to write your own custom sanitation methods is tricky and risky. Keep it simple and protect yourself through your architecture - don't simply rely on strings being "safe", even after sanitization.
EDIT:
To clarify - if you're trying to develop a whitelist, it's not something that the community can hand out, since it's entirely dependent on the data you want. But let's look at a example of a regex whitelist, perhaps for names. Say I've whitelisted A-Z and a-z and space.
Regex reWhiteList = new Regex("^[A-Za-z ]+$")
That checks to see if the entire string is composed of those characters. Note that a string with a number, a period, a quote, or anything else would NOT match this regex and thus would fail the whitelist.
if (reWhiteList.IsMatch(strInput))
// it's ok, proceed to step 2
else
// it's not ok, inform user they've entered invalid characters and try again
Hopefully this helps some more! With names and company names you'll have a tough-to-impossible time developing a rigorous pattern to check against, but you can do a simple allowable character list, as I showed here.
Do not try to sanitize names, especially with regex!
Just make sure that you are properly escaping the values and saving them safely in your DB, and them escaping them back when presenting in HTML
Company names might have almost any kind of symbol in them, so I don't know how well this is going to work for you. I'd concentrate on shielding yourself directly from various attacks, not hoping that your strings are "naturally" safe.
(Certainly they can have ampersands, colons, semicolons, exclamation points, hyphens, percent signs, and all kinds of other things that could be "unsafe" in a host of contexts.)
Why filter or regex the data at all, or even escape it, you should be using bind variables to access the database.
This way, the customer could enter something like: anything' OR 'x'='x
And your application doesn't care because your SQL code doesn't parse the variable because it's not set when you prepare the statement. I.e.
'SELECT count(username) FROM usertable WHERE username = ? and password = ?'
then you execute that code with those variables set.
This works in PHP, PERL, J2EE applications, and so on.
I think writing your own regexp is not a good idea: it would be very hard. Try leveraging existing functions of your web framework, there is lots of resources on the net. If you say C#, I assume you are using ASP.NET, try the following article:
How To: Protect From Injection Attacks in ASP.NET
This is my current regex WHITELIST for a company name. Any input outside of these characters is rejected:
"^[0-9\p{L} '\-\.,\/\&]{0,50}$"
The \p{L} matches any unicode "letter". So, the accents and asian characters are whitelisted.
The \& is a bit problematic because it potentially allows javascript special characters.
The \' is problematic if not using parameterized queries, because of SQL injection.
The \- could allow "--", also a potential for SQL injection if not using parameterized queries.
Also, the \p{L} won't work client-side, so you can't use it in the ASP.NET regular expression validator without disabling clientside validation:
EnableClientScript="False"