regex expressions prevent sql/script injection - regex

I am trying to create a regex expression for client side validation (before server side validation which will also take place) to prevent sql/script injection i.e something like this - which does not work
(script)|(<)|(>)|(%3c)|(%3e)|(SELECT) |(UPDATE) |(INSERT) |(DELETE)|(GRANT) |(REVOKE)|(UNION)|(&lt;)|(&gt;)
What is the correct format for this (above) expression so I can get it to work?
e.g. my EMail checker is like this
(/^[^\\W][a-zA-Z0-9\\_\\-\\.]+([a-zA-Z0-9\\_\\-\\.]+)*\\#[a-zA-Z0-9_]+(\\.[a-zA-Z0-9_]+)*\\.[a-zA-Z]{2,4}$/))
Oh and if you can think of anything else to add please "shout".

You cannot in any way even hinder SQL injection attempts on the client side. It is a terrible, terrible idea which cannot help you but may cause a ball-ache for genuine users. It will not stop anyone who has a chance of actually exploiting an SQLi.
As far as the regex goes, you need to add the / at the beginning and end, like in your mail example, to denote it is a regex. Also, I think the regex design is flawed as it still allows many injection vectors. For example it allows the dreaded single quote ', -- comments and other. It doesn't even start to cover all the builtin functions of your RDBMS that might be knocking around. An attacker will often make use of, e.g. SELECT statements already on your server side, so removing them probably wouldn't help either.
Your best defense is to use parametrized queries on the server side (e.g. pg_prepare for php & postgres)

Generally Sql Injection occurs in the strings passed to the parameters of a sql command such as insert, update, delete, or select. This regular expression validates whether there is any inline or block comment in the sql command.
/[\t\r\n]|(--[^\r\n]*)|(\/\*[\w\W]*?(?=\*)\*\/)/gi

Only a-z or A-Z or 0-9 between 4-8 characters:
^([a-z]|[A-Z]|[0-9]){4,8}$

SQL injection and escaping sound magical to many people, something like shield against some mysterious danger, but: don't be scared of it - it is nothing magical. It is just the way to enable special characters being processed by the query.
So, don't invent new magial shields and ways how to protect the magical injection danger! Instead, try to understand how escaping of the input works.

It's more common to escape the control characters like `and ' that way one can still enter SQL code into the database, say it is on a CMS and I'm adding an article about SQL injection. I want to use those words and characters without triggering an injection. Looking at it, it seems to be for something with HTML base so convert the < and > to < and >, that will sanitize any and all html tags while still allowing HTML demo content to be displayed.
As already said, this should all be server side, as it comes into the system.

Related

Validate Regex Input, preferably using Regex

I'm looking to have the (admin) user enter some pattern matching string, to give different users of my website access to different database rows, depending on if the text in a particular field of the row matches the pattern matching string against that user.
I decided on Regex because it is trivial to integrate into the MySQL statements directly.
I don't really know where to start with validating that a string is a valid regular expression, with a regular expression.
I did some searching for similar questions, couldn't see one. Google produced the comical answer, sadly not so helpful.
Do people do this in the wild, or avoid it?
Is it able to be done with a simple regex, or will the set of all valid regex need to be limited to a usable subset?
Validating a regex is an incredibly complex task. A regex would not be able to do it.
A simple approach would be to catch any errors that occur when you try to run the SQL statement, then report an appropriate error back to the user.
I am assuming that the 'admin' is a trusted user here. It is quite dangerous to give a non-trusted user the ability to enter regexes, because it is easy to attack your system with a regex that is constructed to take a really long time to execute. And that is before you even start to worry about the Bobby Tables problems.
in javascript:
input = "hello**";
try{
RegExp(input);
// sumbit the regex
}catch(err){
// regex is not valid
}
You cannot validate that a string contains a valid regular expression with a regular expression. But you might be able to compromise.
If you only need to know that only characters which are valid in regular expressions were used in the string, you can use the regex:
^[\d\w \-\}\{\)\(\+\*\?\|\.\$\^\[\]\\]*$
This might be enough depending on the application.

ReqEx expression for form validation

I am trying to add form validation to my html site in order to prevent xss injection attacks.
I am using a simple java form validator genvalidator_v4.js that allows me to use regex expressions to determine what is allowed in a text box. I am trying to write one that would prevent "<" or ">" or any other tags that could be used in this kind of attack, but still allow alphanumeric, punctuation, and other special characters.
Any ideas? Also open to other methods of preventing xss attacks but I am very inexperienced in this area so please keep it as simple as possible.
You are trying to blacklist dangerous input. That's very tricky, it's very easy to get it wrong because of the sheer number of tokens that could be dangerous.
Thus, the following two practices are recommended instead:
Escape everything read from the database before outputting it on a web page. If you correctly HtmlEncode everything (your language of choice surely has a library method for that), it doesn't matter if a user entered <script>/* do something evil */</script> and that code got stored in your database. Correctly encoded, this will just be printed verbatim and do no harm.
If you still want to filter input (which might be useful as an additional layer of security), whitelists are generally safer than blacklists. So, instead of saying that < is harmful, you say that letters, digits, punctuation, etc. are safe. What exactly is safe depends on what type of field you are filtering.

Create a valid CSV with regular expressions

I have a horribly formated, tab delimited, "CSV" that I'm trying to clean up.
I would like to quote all the fields; currently only some of them are. I'm trying to go through, tab by tab, and add quotes if necessary.
This RegEx will give me all the tabs.
\t
This RegEx will give me the tabs that do not END with a ".
\t(?!")
How do I get the tabs that do not start with a "?
Generally for these kinds of problems, if it's a one time occurrence, I will use Excels capabilities or other applications (SSIS? T-SQL?) to produce the desired output.
A general purpose regex will usually run into bizarre exceptions and getting it just right will often take longer and is prone to missed groups your regex didn't catch.
If this is going to happen regularly, try to fix the problem at the source and/or create a special utility program to do it.
Use negative lookbehind: (?<!")\t
For one shots like this I usually just write a little program to clean up the data, that way I also can add some validation to make sure it really has converted properly after the run. I have nothing against regex but often in my case it takes longer for me figure out the regex expression than writing a small program. :)
edit: come to think about it, the main motivator is that it is more fun - for me at least :)

How to reject names (people and companies) using whitelists with C# regex's?

I've run into a few problems using a C# regex to implement a whitelist of allowed characters on web inputs. I am trying to avoid SQL injection and XSS attacks. I've read that whitelists of the allowable characters are the way to go.
The inputs are people names and company names.
Some of the problems are:
Company names that have ampersands. Like "Jim & Sons". The ampersand is important, but it is risky.
Unicode characters in names (we have asian customers for example), that enter their names using their character sets. I need to whitelist all these.
Company names can have all kinds of slashes, like "S/A" and "S\A". Are those risky?
I find myself wanting to allow almost every character after seeing all the data that is in the DB already (and being entered by new users).
Any suggestions for a good whitelist that will handle these (and other) issues?
NOTE: It's a legacy system, so I don't have control of all the code. I was hoping to reduce the number of attacks by preventing bad data from getting into the system in the first place.
This SO thread has a lot of good discussion on protecting yourself from injection attacks.
In short:
Filter your input as best as you can
Escape your strings using framework based methods
Parameterize your sql statements
In your case, you can limit the name field to a small character set. The company field will be more difficult, and you need to consider and balance your users need for freedom of entry with your need for site security. As others have said, trying to write your own custom sanitation methods is tricky and risky. Keep it simple and protect yourself through your architecture - don't simply rely on strings being "safe", even after sanitization.
EDIT:
To clarify - if you're trying to develop a whitelist, it's not something that the community can hand out, since it's entirely dependent on the data you want. But let's look at a example of a regex whitelist, perhaps for names. Say I've whitelisted A-Z and a-z and space.
Regex reWhiteList = new Regex("^[A-Za-z ]+$")
That checks to see if the entire string is composed of those characters. Note that a string with a number, a period, a quote, or anything else would NOT match this regex and thus would fail the whitelist.
if (reWhiteList.IsMatch(strInput))
// it's ok, proceed to step 2
else
// it's not ok, inform user they've entered invalid characters and try again
Hopefully this helps some more! With names and company names you'll have a tough-to-impossible time developing a rigorous pattern to check against, but you can do a simple allowable character list, as I showed here.
Do not try to sanitize names, especially with regex!
Just make sure that you are properly escaping the values and saving them safely in your DB, and them escaping them back when presenting in HTML
Company names might have almost any kind of symbol in them, so I don't know how well this is going to work for you. I'd concentrate on shielding yourself directly from various attacks, not hoping that your strings are "naturally" safe.
(Certainly they can have ampersands, colons, semicolons, exclamation points, hyphens, percent signs, and all kinds of other things that could be "unsafe" in a host of contexts.)
Why filter or regex the data at all, or even escape it, you should be using bind variables to access the database.
This way, the customer could enter something like: anything' OR 'x'='x
And your application doesn't care because your SQL code doesn't parse the variable because it's not set when you prepare the statement. I.e.
'SELECT count(username) FROM usertable WHERE username = ? and password = ?'
then you execute that code with those variables set.
This works in PHP, PERL, J2EE applications, and so on.
I think writing your own regexp is not a good idea: it would be very hard. Try leveraging existing functions of your web framework, there is lots of resources on the net. If you say C#, I assume you are using ASP.NET, try the following article:
How To: Protect From Injection Attacks in ASP.NET
This is my current regex WHITELIST for a company name. Any input outside of these characters is rejected:
"^[0-9\p{L} '\-\.,\/\&]{0,50}$"
The \p{L} matches any unicode "letter". So, the accents and asian characters are whitelisted.
The \& is a bit problematic because it potentially allows javascript special characters.
The \' is problematic if not using parameterized queries, because of SQL injection.
The \- could allow "--", also a potential for SQL injection if not using parameterized queries.
Also, the \p{L} won't work client-side, so you can't use it in the ASP.NET regular expression validator without disabling clientside validation:
EnableClientScript="False"

Under what situations are regular expressions really the best way to solve the problem?

I'm not sure if Jeff coined it but it's the joke/saying that people who say "oh, I know I'll use regular expressions!" now have two problems. I've always taken this to mean that people use regular expressions in very inappropriate contexts.
However, under what circumstances are regular expressions really the best answer? What problems are they really the best or maybe only way to solve a situation?
RexExprs are good for:
Text Format Validations (email, url, numbers)
Text searchs/substitution.
Mappings (e.g. url pattern to function call)
Filtering some texts (related to substitution)
Lexical analysis during parsing.
They can be used to validate anything that have a pattern like :
Social Security Number
Telephone Number ( 555-555-5555 )
Email Address (something#example.com)
IP Address (but it's more complex to make sure it's valid)
All those have patterns and are easily verifiable by RegEx.
They are difficultly used for entry that have a logic instead of a pattern like a credit card number but they still can be used to do some client validation.
So the best ways?
To sanitize data entry on the client
side before sanitizing them on the
server.
To make "Search and Replace" of some
strings that contains pattern
I'm sure I am missing a lot of other cases.
Regular expressions are a great way to parse text that doesn't already have a parser (i.e. XML) I have used it to create a parser for the mod_rewrite syntax in the .htaccess file or in my URL Rewriter project http://www.codeplex.com/urlrewriter for example
they are really good when you want to be more specific than "*" or "?" like "3 letters then 2 numbers then a $ sign then a period"
The quote is from an anti-Perl rant from Jamie Zawinski. I think Perl used to do regex really badly but now it seems to be a standard engine for a lot of programs.
But the same sentiment still applies. If you don't know how to use regex, you better not try something real fancy other wise you get one of these tags too (see bronze list) ;o)
https://stackoverflow.com/users/730/keng
They are good for matching or finding text that takes a very specific and simple format. By "simple" I mean not nested and smaller than the entire html spec, for example.
They are primarily of value for highly structured text parsing. If you used named groups (and option in most mature regex systems), you have a phenomenally powerful and crisp way to handle the strings.
Here's an example. Consider that netstat in its various iterations on different linux OSes, and versions of netstat can return different results. Sometimes there is an extra column, sometimes there is a shift if the date/time format. Regexes give you a powerful way to handle that with a single expression. Couple that with named groups, and you can retrieve the data without hacks like:
1) split on spaces
2) ok, the netstat version is X so add I need to add 1 to all array references past column 5.
3) ok, the netstat version is Y so I need to make sure that I use multiple array references for the date info.
YUCK. Simple to fix in a Regex :-)