Is there any way to put malicious code into a regular expression? - regex

I want to add regular expression search capability to my public web page. Other than HTML encoding the output, do I need to do anything to guard against malicious user input?
Google searches are swamped by people solving the converse problem-- using regular expressions to detect malicious input--which I'm not interested in. In my scenario, the user input is a regular expression.
I'll be using the Regex library in .NET (C#).

Denial‐of‐Service Concerns
The most common concern with regexes is a denial‐of‐service attack through pathological patterns that go exponential — or even super‐exponential! — and so appear to take forever to solve. These may only show up on particular input data, but one can generally create one wherein this doesn’t matter.
Which ones these are will depend somewhat on how smart the regex compiler you’re using happens to be, because some of these can be detected during compilation time. Regex compilers that implement recursion usually have a built‐in recursion‐depth counter for checking non‐progression.
Russ Cox’s excellent 2007 paper on Regular Expression Matching Can Be Simple And Fast
(but is slow in Java, Perl, PHP, Python, Ruby, ...) talks about ways that most modern NFAs, which all seem to derive from Henry Spencer’s code, suffer severe performance degradation, but where a Thompson‐style NFA has no such problems.
If you only admit patterns that can be solved by DFAs, you can compile them up as such, and they will run faster, possibly much faster. However, it takes time to do this. The Cox paper mentions this approach and its attendant issues. It all comes down to a classic time–space trade‐off.
With a DFA, you spend more time building it (and allocating more states), whereas with an NFA you spend more time executing it, since it can be multiple states at the same time, and backtracking can eat your lunch — and your CPU.
Denial‐of‐Service Solutions
Probably the most reasonable way to address these patterns that are on the losing end of a race with the heat‐death of the universe is to wrap them with a timer that effectively places a maximum amount of time allowed for their execution. Usually this will be much, much less than the default timeout that most HTTP servers provide.
There are various ways to implement these, ranging form a simple alarm(N) at the C level, to some sort of try {} block the catches alarm‐type exceptions, all the way to spawning off a new thread that’s specially created with a timing constraint built right into it.
Code Callouts
In regex languages that admit code callouts, some mechanism for allowing or disallowing these from the string you’re going to compile should be provided. Even if code callouts are only to code in the language you are using, you should restrict them; they don’t have to be able to call external code, although if they can, you’ve got much bigger problems.
For example, in Perl one cannot have code callouts in regexes created from string interpolation (as these would be, as they’re compiled during run‐time) unless the special lexically‐scoped pragma use re "eval"; in active in the current scope.
That way nobody can sneak in a code callout to run system programs like rm -rf *, for example. Because code callouts are so security‐sensitive, Perl disables them by default on all interpolated strings, and you have to go out of your way to re‐enable them.
User‐Defined \P{roperties}
There remains one more security‐sensitive issue related to Unicode-style properties — like \pM, \p{Pd}, \p{Pattern_Syntax}, or \p{Script=Greek} — that may exist in some regex compilers that support that notation.
The issue is that in some of these, the set of possible properties is user‐extensible. That means you can have custom properties that are actual code callouts to named functions in some particular namepace, like \p{GoodChars} or \p{Class::Good_Characters}. How your language handles those might be worth looking at.
Sandboxing
In Perl, a sandboxed compartment via the Safe module would give control over namespace visibility. Other languages offer similar sandboxing technologies. If such devices are available, you might want to look into them, because they are specifically designed for limited execution of untrusted code.

Adding to tchrist's excellent answer: the same Russ Cox who wrote the "Regular Expression" page has also released code! re2 is a C++ library which guarantees O(length_of_regex) runtime and configurable memory-use limit. It's used within Google so that you can type a regex into google code search -- meaning that it's been battle tested.

Yes.
Regexes can be used to perform DOS attacks.
There is no simple solution.

You'll want to read this paper:
Insecure Context Switching: Inoculating regular expressions for survivability The paper is more about what can go wrong with regular expression engines (e.g. PCRE), but it may help you understand what you're up against.

You have to not only worry about the matching itself, but how you do the matching. For example, if your input goes through some sort of eval phase or command substitution on its way to the regular expression engine there could be code that gets executed inside the pattern. Or, if your regular expression syntax allows for embedded commands you have to be wary of that, too. Since you didn't specify the language in your question it's hard to say for sure what all the security implications are.

A good way to test your RegEx's for security issues (at least for Windows) is the SDL RegEx fuzzing tool released by Microsoft recently. This can help avoid pathologically bad RegEx construction.

Related

Optimization techniques for backtracking regex implementations

I'm trying to implement a regular expression matcher based on the backtracking approach sketched in Exploring Ruby’s Regular Expression Algorithm. The compiled regex is translated into an array of virtual machine commands; for the backtracking the current command and input string indices as well as capturing group information is maintained on a stack.
In Regular Expression Matching: the Virtual Machine Approach Cox gives more detailed information about how to compile certain regex components into VM commands, though the discussed implementations are a bit different. Based on thoese articles my implementation works quite well for the standard grouping, character classes and repetition components.
Now I would like to see what extensions and optimization options there are for this type of implementation. Cox gives in his article a lot of useful information on the DFA/NFA approach, but the information about extensions or optimization techniques for the backtracking approach is a bit sparse.
For example, about backreferences he states
Backreferences are trivial in backtracking implementations.
and gives an idea for the DFA approach. But it's not clear to me how this can be "trivially" done with the VM approach. When the backreference command is reached, you'd have to compile the previously matched string from the corresponding group into another list of VM commands and somehow incorporate those commands into the current VM, or maintain a second VM and switch execution temporarily to that one.
He also mentions a possible optimization in repetitions by using look-aheads, but doesn't elaborate on how that would work. It seems to me this could be used to reduce the number items on the backtracking stack.
tl;dr What general optimization techniques exist for VM-based backtracking regex implementations and how do they work? Note that I'm not looking for optimizations specific to a certain programming language, but general techniques for this type of regex implemenations.
Edit: As mentioned in the first link, the Oniguruma library implements a regex matcher with exactly that stack-based backtracking approach. Perhaps someone can explain the optimizations done by that library which can be generalized to other implementations. Unfortunately, the library doesn't seem to provide any documentation on the source code and the code also lacks comments.
Edit 2: When reading about parsing expression grammars (PEGs), I stumbled upon a paper on a Lua PEG implementation which makes use of a similar VM-based approach. The paper mentions several optimization options to reduce the number of executed VM commands and an unnecessary growth of the backtracking stack.
I suggest you to watch full lection, it is very interesting, but here is outline:
Complexity explosion in backtracking. This happens then pattern have
ambiguity ([a-x]*[a-x0-9]*z in video, as an example) in it, so engine have to backtrack and test all conditions, until it become certain the pattern did (or didn't) match.
It can take up to O(Nᵖ), where p is "measure of ambiguity" of pattern.
To get O(pN), we need to avoid evaluating equivalent threads again and again.
...
Solution:
At one step ajust all threads by one character, "Breadth-first" execution results in linear comlexity.
Tricks to save every bit of performance
Inside std::regex
Hope this helps!
P.S Lector's repository

How to figure out if a regex implementation uses DFA or NFA?

I'm facing the question, whether a certain regex implementation is based on a DFA or NFA.
What are the starting points for me to figure this out. One could also ask: What am I looking for? What are the basic patterns and / or characteristics? A good and explanatory link or a little comparisons (even if not directly dedicated to regex) is perfectly fine.
If it's a black box, then give it some input and measure its time characteristics with a pathological case, with reference to the graphs in this discussion of NFS vs backtracking regex implementations. (note the NFS graph is microseconds not seconds).
Also, if it's a pure NFA, then it won't have some non-regular features which are found is some 'regular expression' parsers, which require backtracking.
Alternatively, look at the documentation of the RxParser class; documentation appears to be unavailable on the web and requires a squeak runtime to browse.
I think you mean "regex implementation" rather than algorithm (in the usual sense).
You could test with know expressions that are known to cause problems with one approach or the other. Also looking for features that are easier to implement in one or the other (this is not a reliable approach – the developers of regex engines find new ways to implement previously hard things).
Normally the answer is to read the documentation, or look in a known reference ("Mastering Regular Expressions" documents many popular cases). Finally why not ask the authors?

Are regular expressions worth the hassle?

It strikes me that regular expressions are not understood well by the majority of developers. It also strikes me that for a lot of problems where regular expressions are used, a lump of code could be used instead. Granted, it might be slower and be 20 lines for something like email validation, but if performance of the code is not desperately important, is it reasonable to assume that not using regular expressions might be better practise?
I'm thinking in terms of maintenance of the code rather that straight line execution time.
Maintaining one regular expression is a lot less effort than maintaining 20 lines of code. And you underestimate the amount of code needed - for a regex of any complexity, the replacement code could easily be 200 rather than 20 lines.
Professional developers should be familiar with basic syntax
At the very least. In all the years long I've been a professional developer I haven't come across a developer that wouldn't know what Regular Expressions are. It's true, not everybody likes using them or is very good at knowing its syntax, but that doesn't mean one shouldn't use them. Developers should learn the syntax and regular expressions should be used.
It's like: "Ok. We have Lambda expressions, but who cares, I can still do it the old fashioned way."
Not learning key aspects of professional development is pure laziness and shouldn't be tolerated for too long.
Whenever i use a Regex i always try to leave a comment explaining exactly how it's structured because I agree with you that not all developers understand them and going back to a regex, even if you've written it yourself, can be a headache to understand again.
That said, they definitely have their uses. Try stripping out all html elements from a box of text without it!
I'm thinking in terms of maintenance of the code rather that straight line execution time.
Code size is the single most important factor in reducing maintainability.
And while Regexps can be very hard to decipher, so are 50 line string processing methods - and the latter are more likely to contain bugs in rare corner cases.
The thing is: any non-trivial regexp must be commented just as thoroughly as you'd comment a 50 line method.
Regular expressions are a domain-specific language: no generic programming language is quite as expressive or quite as efficient at doing what regular expressions do with string matching. The sheer size of the lump of code you will have to write in a standard programming language (even one with a good string library) will make it harder to maintain. It is also a good separation-of-concerns to make sure that the regular expression only does the matching. Having a code blob that basically does matching, but does something else in-between can produce some surprising bugs.
Also note that there are mechanisms to make regular expressions more readable. In Python you can enable verbose mode, which allows you to write things like this:
a = re.compile(r"""\d + # the integral part
\. # the decimal point
\d * # some fractional digits""", re.X)
Another possibility is to build the regular expression up from strings, by line and comment each line, like this:
a = re.compile("\d+" # the integral part
"\." # the decimal point
"\d *" # fraction digits
)
This is possible in different ways in most programming languages. My advice is to keep using regular expressions where appropriate, but treat them like you do other code. Write them as clear as possible, comment them and test them.
You raise a very good point with regards to maintainability. Regular expressions can require some deciphering to understand but I doubt the code which would replace them would be easier to maintain. Regular Expressions are VERY powerful and a valuable tool. Use them but use them carefully, and think about how to make it clear what the intent of the regular expression is.
Regards
With great power comes great responsibility!
Regular expressions are great, but there can be a tendancy to over-use them! There are not suitable in all cases!
In my opinion, it might make more sense to enforce better practices with using regular expressesions other than forgoing it all together.
Always comment your regular expressions. You might know what it does now, but someone else might not and even you might not remember in two weeks. Moreover, descriptive comments should be used, stating exactly what the regular expression is meant to do.
Use unit testing. Create unit tests for your regular expressions. So can have a degree of assurance as to the reliability and correctness of your regular expression statement. And if the regex is being maintained, it would ensure that any code changes does not break existing functionality.
Using regular expression has some advantages:
Time. You don't have to write your own code to do exactly what is built in.
Maintainability. You have to maintain only a couple of lines as opposed to 30 or 300
Performance. The code is optimized
Reliability. If your regex statement is correct, it should function correctly.
Flexibility. Regex gives you a lot of power which is very useful if used properly
Think of regular expressions as the lingua Franca of string processing. You simply need to know them if you are going tocode in a professional capacity. Unless you just write SQL maybe.
I would just like to add that unit testing is the ideal way to make your regular expressions maintainable. I consider Regex an essential developer skill that is always a practical alternative to writing many lines of string manipulation code.
The most hassle I see is when people try to parse non-regular languages with regular expressions (yes, that includes all programming and many markup languages, yes, also HTML). I sometimes wish all coders had to demonstrate that they have understood at least the difference between context-free and regular languages before they are allowed to use regular expressions. Alternatively, they could get their regex license revoked when they are caught trying to parse non-regular languages with them. Yes, I'm joking, but only half.
The next problem arises when people try to do more than character matching in a regular expression, for example, checking for a valid date, perhaps even including leap year considerations (this could also lead to regex license revokation).
Regular expressions really are just a convenient shorthand for a finite state automaton (You know what that is, don't you? Where is your regex license, please?). The problems come from people expecting some kind of magic from them, not from the regular expressions themselves.
I see regex as a fast, readable and preferable way to perform pattern matching on string data. So many languages support regex for this reason. If you wanted to write string manipulation code to match say, a Canadian zip code, be my guest, but the regex equivalent is so much more succinct. Definitely worth it.
In .NET regex'es you can have comments, and break them up into multiple lines, use indenting etc. (I don't know about other dialects...)
Use the "ignore pattern whitespace" setting, and either # for commenting out the rest of the line, or "(#comments)" in your pattern...
So if you wanted to, you can actually make them sort of readable/maintainable...
I just ran into this issue. I built a regular expression to pull out groups of data from a long string of numbers and some other noise. The regex was quite long, though concise, and it got even bigger when i tried to add it to the C# app i was writing. In total the reg ex was 3 lines of code.
However it was painful to look at after i escaped it for C# and the other developers i work with don't under stand regular expressions. I ended up stripping out most of the noise characters and splitting on space to get the groups of data. Very simple code and only 5 lines.
Which is better? My ego says Regular Expressions. Any new hire would say character stripping.
I would never wish for fewer options in programming. Regular expressions can be very powerful, but do require skill. I like problems that can be solved in a few lines of code. It is really cool how many elements of validation can be accomplished. As long as the code is commented on what the expression checks for, I do not see a problem. I also have never seen a professional programmer not know what a regex was. It is another tool in the tool box.
Regex is one tool among many. But as many craftsmen will attest, the more tools you have at your disposal, and the more skilled you are at using them, the more likely you will become a Master Craftsman.
Is Regex worth the hassle to you? Dunno. Depends how seriously you take what you do.
It's a lot easier to see at first glance that a regex is probably correct. Why would I write a long state machine in code (probably containing bugs at first) when I could write a simple one line regex?
Regexes may be considered "write only", but I think that is sometimes a benefit. When writing a relatively simple regex from scratch, it's pretty easy to get it right.
True, learning to decipher regexes is difficult -- but so is learning to decipher the hosting program code in the first place. But is that so difficult, that we would rather write out manual instruction for a person to perform? No -- because that would be ridiculously longer and complicated. Same thing for not using a properly-formed regex.
I've found with reg ex it's easier to maintain, but fine tuning someone else's reg ex is a bit of a pain. I think you underestimate the developers by saying most people don't understand it. Usually what I found is that over time, requirements adjust, and the regex that used to validate something is no longer effective and attempting to remove portions that are no longer valid is harder than to just rewrite the entire thing.
Also, imagine if you were validating phone numbers, and you decided to use code instead of reg ex. So it amounts to let's say 20 lines. Over time, your company decides to expand to other regions where now the phone validation is no longer totally true. So you have to adjust it to fit the other requirements. It could be possible that the code would be harder to maintain because you have to adjust over 20 lines of code rather than simply removing the old reg ex, and replacing it with a new one.
However, I think code can be used in certain cases along with regex. For example, let's say you want to validate US phone numbers, in every case, it has 10 digits numbers, but there are literally a ton of ways to write it out. For example (xxx) xxx-xxxx, or xxx-xxx-xxxx, or xxx xxx xxxx, etc, etc, etc. So if you write reg ex, you'd have to account for each of the cases. However, if you just strip all non-numerics and spaces with a regex replace, then go for a second pass and check if it has 10 digits, you'd find it easier than accounting each and every possible way to write a phone number.
One thing that doesn't seem mentioned (from a quick scan of the answers above) is that regular expressions are useful outside of code too. That means they are worth the hassle for a coder, or even for end users.
For example, I just wrote a bunch of unit tests for a formatter. I then made a copy of the test, and used a single regex in my editor to invert values and resulting strings (changing the method name too), giving expected value to a string to parse...
Another example: in our product, we allow using regular expressions for searching or filtering columns of data: sometime it is useful to get only names starting with something, ending with something, with letters followed by digits, or similar: no need to be a master of regexes to use them.
In these cases, writing code isn't an option (well, I could have made a small Lua script in the first case, but it would have been longer) and performance isn't a major issue.
But even in code, I often find easier and more readable to use a simple regular expression than a bunch of substring() with complex offsets and whatnot. Beside, they shine to validate user input where, again, performance isn't an issue.
Due to the type of apps I build, the only RegEx's I regularly use are for email validation, html stripping, and character stripping to remove the garbage around phone numbers.
It's rare that I need to do very much string manipulation other than concatenation.
Incidentally, the apps are typically CRM's.
So the hassle for me is limited to googling for a regex in the event I find myself in need. ;)
Read the section under "Using Benchmarks" at JavaWorld.
Sure regular expressions are a very helpful tool, but I agree that they are overused and over complicate what can easily be a simple solution.
That being said, you should use regular expressions whenever the situation calls for it. Some things, such as searching for text in a string, can just as easily be done with an iterative search (or using the API searches), but for more complex situations you need regular expressions.
Surly all code needs to be optimized where possible!
In the context where code need not be optimized, and the logic will need to be maintained then it is down to the skill set of the team.
If the bulk of the team responsible for the code is regEX savvy then do it with a regEX. Else write it in the way the team is likely to be most comfortable with.
VB.net is best, No, C# is, No F# is the best. It's really more a matter of what will be the people maintaining be better suited to handle, in my opinion. That's more a flame question, than something that is absolutely answerable.
Personally I'd choose regex whenever there's complex string validation (phone numbers,emails, ss#, ip addresses) where there are well known regex's out there. Get it from regex.org, give attribution with a comment and/or get the authors permission whichever is appropriate, and be done with it.
Also, for extracting pieces of a string, or complex splitting of strings, regex can be a great time saver.
But if you're writing your own, rather than using someone else's, using something like regex buddy or sells brothers regexdesigner is a must for testing and validation.
It always depends on where it's used. If by doing the same task using a lump of code is being too complex and hard to maintain which can be a 1 liner less complex regex, then go with regex. Other wise use the lump of code.
Also I encountered problems which can I believe can only be answered by regex effective and concisely. Such question like this which can only be answered by another regex effectively: Dart regex for capturing groups but ignoring certain similar patterns

Should I avoid regular expressions? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 13 years ago.
Someone I know has been telling me that RegEx should be avoided, as it is heavyweight or involves heavy processing. Is this true? This made a clap in my ears, ringing my eardrums up until now.
I don't know why he told me that. Could it have been from experience or merely 3rd-hand information (you know what I mean...)?
So, stated plainly, why should I avoid regular expressions?
If you can easily do the same thing with common string operations, then you should avoid using a regular expression.
In most situations regular expressions are used where the same operation would require a substantial amount of common string operations, then there is of course no point in avoiding regular expressions.
Don't avoid them. They're an excellent tool, and when used appropriately can save you a lot of time and effort. Moreover, a good implementation used carefully should not be particularly CPU-intensive.
Overhyped? No. They're extremely powerful and flexible.
Overused? Absolutely. Particularly when it comes to parsing HTML (which frequently comes up here).
This is another of those "right tool for the job" scenarios. Some go too far and try to use it for everything.
You are right though in that you can do many things with substring and/or split. You will often reach a point with those where what you're doing will get so complicated that you have to change method or you just end up writing too much fragile code. Regexes are (relatively) easy to expand.
But hand written code will nearly always be faster. A good example of this is Putting char into a java string for each N characters. The regex solution is terser but has some issues that a hand written loop doesn't and is much slower.
You can substitute "regex" in your question with pretty much any technology and you'll find people who poorly understand the technology or too lazy to learn the technology making such claims.
There is nothing heavy-weight about regular expressions. The most common way that programmers get themselves into trouble using regular expressions is that they try to do too much with a single regular expression. If you use regular expressions for what they're intended (simple pattern matching), you'll be hard-pressed to write procedural code that's more efficient than the equivalent regular expression. Given decent proficiency with regular expressions, the regular expression takes much less time to write, is easier to read, and can be pasted into tools such as RegexBuddy for visualization.
As a basic parser or validator, use a regular expression unless the parsing or validation code you would otherwise write would be easier to read.
For complex parsers (i.e. recursive descent parsers) use regex only to validate lexical elements, not to find them.
The bottom line is, the best regex engines are well tuned for validation work, and in some cases may be more efficient than the code you yourself could write, and in others your code would perform better. Write your code using handwritten state machines or regex as you see fit, but change from regex to handwritten code if performance tests show you that regex is significantly inefficient.
"When you have a hammer, everything looks like a nail."
Regular expressions are a very useful tool; but I agree that they're not necessary for every single place they're used. One positive factor to them is that because they tend to be complex and very heavily used where they are, the algorithms to apply regular expressions tend to be rather well optimized. That said, the overhead involved in learning the regular expressions can be... high. Very high.
Are regular expressions the best tool to use for every applicable situation? Probably not, but on the other hand, if you work with string validation and search all the time, you probably use regular expressions a lot; and once you do, you already have the knowledge necessary to use the tool probably more efficiently and quickly than any other tool. But if you don't have that experience, learning it is effectively a drag on your productivity for that implementation. So I think it depends on the amount of time you're willing to put into learning a new paradigm, and the level of rush involved in your project. Overall, I think regular expressions are very worth learning, but at the same time, that learning process can, frankly, suck.
I think that if you learn programming in language that speaks regular expressions natively you'll gravitate toward them because they just solve so many problems. IE, you may never learn to use split because regexec() can solve a wider set of problems and once you get used to it, why look anywhere else?
On the other hand, I bet C and C++ programmers will for the most part look at other options first, since it's not built into the language.
You know, given the fact that I'm what many people call "young", I've heard too much criticism about RegEx. You know, "he had a problem and tried to use regex, now he has two problems".
Seriously, I don't get it. It is a tool like any other. If you need a simple website with some text, you don't need PHP/ASP.NET/STG44. Still, no discussion on whether any of those should be avoided. How odd.
In my experience, RegEx is probably the most useful tool I've ever encountered as a developer. It's the most useful tool when it comes to #1 security issue: parsing user input. I has saved me hours if not days of coding and creating potentially buggy (read: crappy) code.
With modern CPUs, I don't see what's the performance issue here. I'm quite willing to sacrifice some cycles for some quality and security. (It's not always the case, though, but I think those cases are rare.)
Still, RegEx is very powerful. With great power, comes great responsibility. It doesn't mean you'll use it whenever you can. Only where it's power is worth using.
As someone mentioned above, HTML parsing with RegEx is like a Russian roulette with a fully loaded gun. Don't overdo anything, RegEx included.
You should also avoid floating-point numbers at all cost. That is when you're programming in an embedded-environment.
Seriously: if you're in normal software-development you should actually use regex if you need to do something that can't be achieved with simpler string-operations. I'd say that any normal programmer won't be able to implement something that's best done using regexps in a way that is faster than the correspondig regular expression. Once compiled, a regular expression works as a state-maschine that is optimized to near perfection.
Overhyped? No
Under-Utilized Properly? Yes
If more people knew how to use a decent parser generator, there would be fewer people using regular expressions.
In my belief, they are overused by people quite a bit (I've had this discussion a number of times on SO).
But they are a very useful construct because they deliver a lot of expressive power in a very small piece of code.
You only have to look at an example such as a Western Australian car registration number. The RE would be
re.match("[1-9] [A-Z]{3} [0-9]{3}")
whilst the code to check this would be substantially longer, in either a simple 9-if-statement or slightly better looping version.
I hardly ever use complex REs in my code because:
I know how the RE engines work and I can use domain knowledge to code up faster solutions (that 9-if variant would almost certainly be faster than a one-shot RE compile/execute cycle); and
I find code more readable if it's broken up logically and commented. This isn't easy with most REs (although I have seen one that allows inline comments).
I have seen people suggest the use of REs for extracting a fixed-size substring at a fixed location. Why these people don't just use substring() is beyond me. My personal thought is that they're just trying to show how clever they are (but it rarely works).
Don't avoid it, but ask youself if they're the best tool for the task you have to solve. Maybe sometimes regex are difficult to use or debug, but they're really usefull in some situations. The question is to use the apropiate tool for each task, and usually this is not obvious.
There is a very good reason to use regular expressions in scripting languages (such as Ruby, Python, Perl, JavaScript and Lua): parsing a string with carefully optimized regular expression executes faster than the equivalent custom while loop which scans the string character-by-character. For compiled languages (such as C and C++, and also C# and Java most of the time) usually the opposite is true: the custom while loop executes faster.
One more reason why regular expressions are so popular: they express the programmer's intention in an extremely compact way: a single-line regexp can do as much as a 10- or 20-line while loop.
Overhyped? No, if you have ever taken a Parsing or Compiler course, you would understand that this is like saying addition and multiplication is overhyped for math problems.
It is a system for solving parsing problems.
some problems are simpler and don't require regular expressions, some are harder and require better tools.
I've seen so many people argue about whether a given regex is correct or not that I'm starting to think the best way to write one is to ask how to do it on StackOverflow and then let the regex gurus fight it out.
I think they're especially useful in JavaScript. JavaScript is transmitted (so should be small) and interpreted from text (although this is changing in the new browsers with V8 and JIT compilation), so a good internal regex engine has a chance to be faster than an algorithm.
I'd say if there is a clear and easy way to do it with string operations, use the string operations. But if you can do a nice regex instead of writing your own state machine interpreter, use the regex.
Regular Expressions are one of the most useful things programmers can learn, they allow to speed up and minimize your code if you know how to handle them.
Regular expressions are often easier to understand than the non-regex equivalent, especially in a language with native regular expressions, especially in a code section where other things that need to be done with regexes are present.
That doesn't meant they're not overused. The only time string.match(/\?/) is better than string.contains('?') is if it's significantly more readable with the surrounding code, or if you know that .contains is implemented with regexes anyway
I often use regex in my IDE to quick fix code. Try to do the following without regex.
glVector3f( -1.0f, 1.0f, 1.0f ); -> glVector3f( center.x - 1.0f, center.y + 1.0f, center.z + 1.0f );
Without regex, it's a pain, but WITH regex...
s/glVector3f\((.*?),(.*?),(.*?)\)/glVector3f(point.x+$1,point.y+$2,point.z+$3)/g
Awesome.
I'd agree that regular expressions are sometimes used inappropriately. Certainly for very simple cases like what you're describing, but also for cases where a more powerful parser is needed.
One consideration is that sometimes you have a condition that needs to do something simple like test for presence of a question mark character. But it's often true that the condition becomes more complex. For example, to find a question mark character that isn't preceded by a space or beginning-of-line, and isn't followed by an alphanumeric character. Or the character may be either a question mark or the Spanish "¿" (which may appear at the start of a word). You get the idea.
If conditions are expected to evolve into something that's less simple to do with a plain call to String.contains("?"), then it could be easier to code it using a very simple regular expression from the start.
It comes down to the right tool for the job.
I usually hear two arguments against regular expressions:
1) They're computationally inefficient, and
2) They're hard to understand.
Honestly, I can't understand how either are legitimate claims.
1) This may be true in an academic sense. A complex expression can double back on itself may times over. Does it really matter though? How many millions of computations a second can a server processor do these days? I've dealt with some crazy expressions, and I've never seen a regexp be the bottle neck. By far it's DB interaction, followed by bandwidth.
2) Hard for about a week. The most complicated regexp is no more complex than HTML - it's just a familiarity problem. If you needed HTML once every 3 months, would you get it 100% each time? Work with them on a daily basis and they're just as clear as any other language syntax.
I write validation software. REGEXP's are second nature. Every fifth line of code has a regexp, and for the life of me I can't understand why people make a big deal about them. I've never seen a regexp slow down processing, and I've seen even the most dull 'programmers' pick up the syntax.
Regexp's are powerful, efficient, and useful. Why avoid them?
I wouldn't say avoid them entirely, as they are QUITE handy at times. However, it is important to realize the fundamental mechanisms underneath. Depending on your implementation, you could have up to exponential run-time for a search, but since searches are usually bounded by some constant number of backtraces, you can end up with the slowest linear run-time you ever saw.
If you want the best answer, you'll have to examine your particular implementation as well as the data you intend to search on.
From memory, wikipedia has a decent article on regular expressions and the underlying algorithms.

Are regex tools (like RegexBuddy) a good idea?

One of my developers has started using RegexBuddy for help in interpreting legacy code, which is a usage I fully understand and support. What concerns me is using a regex tool for writing new code. I have actually discouraged its use for new code in my team. Two quotes come to mind:
Some people, when confronted with a
problem, think "I know, I’ll use
regular expressions." Now they have
two problems. - Jamie Zawinski
And:
Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as
cleverly as possible, you are, by
definition, not smart enough to debug
it. - Brian Kernighan
My concerns are (respectively:)
That the tool may make it possible to solve a problem using a complicated regular expression that really doesn't need it. (See also this question).
That my one developer, using regex tools, will start writing regular expressions which (even with comments) can't be maintained by anyone who doesn't have (and know how to use) regex tools.
Should I encourage or discourage the use of regex tools, specifically with regard to producing new code? Are my concerns justified? Or am I being paranoid?
Poor programming is rarely the fault of the tool. It is the fault of the developer not understanding the tool. To me, this is like saying a carpenter should not own a screwdriver because he might use a screw where a nail would have been more appropriate.
Regular expressions are just one of the many tools available to you. I don't generally agree with the oft-cited Zawinski quote, as with any technology or technique, there are both good and bad ways to apply them.
Personally, I see things like RegexBuddy and the free Regex Coach primarily as learning tools. There are certainly times when they can be helpful to debug or understand existing regexes, but generally speaking, if you've written your regex using a tool, then it's going to be very hard to maintain it.
As a Perl programmer, I'm very familiar with both good and bad regular expressions, and have been using even complicated ones in production code successfully for many years. Here are a few of the guidelines I like to stick to that have been gathered from various places:
Don't use a regex when a string match will do. I often see code where people use regular expressions in order to match a string case-insensitively. Simply lower- or upper-case the string and perform a standard string comparison.
Don't use a regex to see if a string is one of several possible values. This is unnecessarily hard to maintain. Instead place the possible values in an array, hash (whatever your language provides) and test the string against those.
Write tests! Having a set of tests that specifically target your regular expression makes development significantly easier, particularly if it's a vaguely complicated one. Plus, a few tests can often answer many of the questions a maintenance programmer is likely to have about your regex.
Construct your regex out of smaller parts. If you really need a big complicated regex, build it out of smaller, testable sections. This not only makes development easier (as you can get each smaller section right individually), but it also makes the code more readable, flexible and allows for thorough commenting.
Build your regular expression into a dedicated subroutine/function/method. This makes it very easy to write tests for the regex (and only the regex). it also makes the code in which your regex is used easier to read (a nicely named function call is considerably less scary than a block of random punctuation!). Dropping huge regular expressions into the middle of a block of code (where they can't easily be tested in isolation) is extremely common, and usually very easy to avoid.
You should encourage the use of tools that make your developers more efficient. Having said that, it is important to make sure they're using the right tool for the job. You'll need to educate all of your team members on when it is appropriate to use a regular expression, and when (less|more) powerful methods are called for. Finally, any regular expression (IMHO) should be thoroughly commented to ensure that the next generation of developers can maintain it.
I'm not sure why there is so much diffidence against regex.
Yes, they can become messy and obscure, exactly as any other piece of code somebody may write but they have an advantage over code: they represent the set of strings one is interested to in a formally specified way (at least by your language if there are extensions). Understanding which set of strings is accepted by a piece of code will require "reverse engineering" the code.
Sure, you could discurage the use of regex as has already been done with recursion and goto's but this would be justifed to me only if there's a good alternative.
I would prefer maintain a single line regex code than a convoluted hand-made functions that tries to capture a set of strings.
On using a tool to understand a regex (or write a new one) I think it's perfectly fine! If somebody wrote it with the tool, somebody else could understand it with a tool! Actually, if you are worried about this, I would see tools like RegexBuddy your best insurance that the code will not be unmaintainable just because of the regex's
Regex testing tools are invaluable. I use them all the time. My job isn't even particularly regex heavy, so having a program to guide me through the nuances as I build my knowledge base is crucial.
Regular expressions are a great tool for a lot of text handling problems. If you have someone on your team who is writing regexes that the rest of the team don't understand, why not get them to teach the rest of you how they are working? Rather than a threat, you could be seeing this as an opportunity. That way you wouldn't have to feel threatened by the unknown and you'll have another very valuable tool in your arsenal.
Zawinski's comments, though entertainingly glib, are fundamentally a display of ignorance and writing Regular Expressions is not the whole of coding so I wouldn't worry about those quotes. Nobody ever got the whole of an argument into a one-liner anyways.
If you came across a Regular Expression that was too complicated to understand even with comments, then probably a regex wasn't a good solution for that particular problem, but that doesn't mean they have no use. I'd be willing to bet that if you've deliberately avoided them, there will be places in your codebase where you have many lines of code and a single, simple, Regex would have done the same job.
Regexbuddy is a useful shortcut, to make sure that the regular expressions you are writing do what you expect- it certainly makes life easier, but it's the matter of using them at all that is what seems important to me about your question.
Like others have said, I think using or not using such a tool is a neutral issue. More to the point: If a regular expression is so complicated that it needs inline comments, it is too complicated. I never comment my regexps. I approach large or complex matching problems by breaking it down into several steps of matching, either with multiple match statements (=~), or by building up a regexp with sub regexps.
Having said all that, I think any developer worth his salt should be reasonably proficient in regular expression writing and reading. I've been using regular expressions for years and have never encountered a time where I needed to write or read one that was terrifically complex. But a moderately sized one may be the most elegant and concise way to do a validation or match, and regexps should not be shied away from only because an inexperienced developer may not be able to read it -- better to educate that developer.
What you should be doing is getting your other devs hooked up with RB.
Don't worry about that whole "2 probs" quote; it seems that may have been a blast on Perl (said back in 1997) not regex.
I prefer not to use regex tools. If I can't write it by hand, then it means the output of the tool is something I don't understand and thus can't maintain. I'd much rather spend the time reading up on some regex feature than learning the regex tool. I don't understand the attitude of many programmers that regexes are a black art to be avoided/insulated from. It's just another programming language to be learned.
It's entirely possible that a regex tool would save me some time implementing regex features that I do know, but I doubt it... I can type pretty fast, and if you understand the syntax well (using a text editor where regexes are idiomatic really helps -- I use gVim), most regexes really aren't that complex. I think you're nearly always better served by learning a technology better rather than learning a crutch, unless the tool is something where you can put in simple info and get out a lot of boilerplate code.
Well, it sounds like the cure for that is for some smart person to introduce a regex tool that annotates itself as it matches. That would suggest that using a tool is not as much the issue as whether there is a big gap between what the tool understands and what the programmer understands.
So, documentation can help.
This is a real trivial example is a table like the following (just a suggestion)
Expression Match Reason
^ Pos 0 Start of input
\s+ " " At least one space
(abs|floor|ceil) ceil One of "abs", "floor", or "ceil"
...
I see the issue, though. You probably want to discourage people from building more complex regular expression than they can parse. I think standards can address this, by always requiring expanded REs and check that the annotation is proper.
However, if they just want to debug an RE, to make sure it's acting as they think it's acting, then it's not really much different from writing code you have to debug.
It's relative.
A couple of regex tools (for Node/JS, PHP and Python) i made (for some other projects) are available online to play and experiment.
regex-analyzer and regex-composer
github repo