Are regular expressions worth the hassle? - regex

It strikes me that regular expressions are not understood well by the majority of developers. It also strikes me that for a lot of problems where regular expressions are used, a lump of code could be used instead. Granted, it might be slower and be 20 lines for something like email validation, but if performance of the code is not desperately important, is it reasonable to assume that not using regular expressions might be better practise?
I'm thinking in terms of maintenance of the code rather that straight line execution time.

Maintaining one regular expression is a lot less effort than maintaining 20 lines of code. And you underestimate the amount of code needed - for a regex of any complexity, the replacement code could easily be 200 rather than 20 lines.

Professional developers should be familiar with basic syntax
At the very least. In all the years long I've been a professional developer I haven't come across a developer that wouldn't know what Regular Expressions are. It's true, not everybody likes using them or is very good at knowing its syntax, but that doesn't mean one shouldn't use them. Developers should learn the syntax and regular expressions should be used.
It's like: "Ok. We have Lambda expressions, but who cares, I can still do it the old fashioned way."
Not learning key aspects of professional development is pure laziness and shouldn't be tolerated for too long.

Whenever i use a Regex i always try to leave a comment explaining exactly how it's structured because I agree with you that not all developers understand them and going back to a regex, even if you've written it yourself, can be a headache to understand again.
That said, they definitely have their uses. Try stripping out all html elements from a box of text without it!

I'm thinking in terms of maintenance of the code rather that straight line execution time.
Code size is the single most important factor in reducing maintainability.
And while Regexps can be very hard to decipher, so are 50 line string processing methods - and the latter are more likely to contain bugs in rare corner cases.
The thing is: any non-trivial regexp must be commented just as thoroughly as you'd comment a 50 line method.

Regular expressions are a domain-specific language: no generic programming language is quite as expressive or quite as efficient at doing what regular expressions do with string matching. The sheer size of the lump of code you will have to write in a standard programming language (even one with a good string library) will make it harder to maintain. It is also a good separation-of-concerns to make sure that the regular expression only does the matching. Having a code blob that basically does matching, but does something else in-between can produce some surprising bugs.
Also note that there are mechanisms to make regular expressions more readable. In Python you can enable verbose mode, which allows you to write things like this:
a = re.compile(r"""\d + # the integral part
\. # the decimal point
\d * # some fractional digits""", re.X)
Another possibility is to build the regular expression up from strings, by line and comment each line, like this:
a = re.compile("\d+" # the integral part
"\." # the decimal point
"\d *" # fraction digits
)
This is possible in different ways in most programming languages. My advice is to keep using regular expressions where appropriate, but treat them like you do other code. Write them as clear as possible, comment them and test them.

You raise a very good point with regards to maintainability. Regular expressions can require some deciphering to understand but I doubt the code which would replace them would be easier to maintain. Regular Expressions are VERY powerful and a valuable tool. Use them but use them carefully, and think about how to make it clear what the intent of the regular expression is.
Regards

With great power comes great responsibility!
Regular expressions are great, but there can be a tendancy to over-use them! There are not suitable in all cases!

In my opinion, it might make more sense to enforce better practices with using regular expressesions other than forgoing it all together.
Always comment your regular expressions. You might know what it does now, but someone else might not and even you might not remember in two weeks. Moreover, descriptive comments should be used, stating exactly what the regular expression is meant to do.
Use unit testing. Create unit tests for your regular expressions. So can have a degree of assurance as to the reliability and correctness of your regular expression statement. And if the regex is being maintained, it would ensure that any code changes does not break existing functionality.
Using regular expression has some advantages:
Time. You don't have to write your own code to do exactly what is built in.
Maintainability. You have to maintain only a couple of lines as opposed to 30 or 300
Performance. The code is optimized
Reliability. If your regex statement is correct, it should function correctly.
Flexibility. Regex gives you a lot of power which is very useful if used properly

Think of regular expressions as the lingua Franca of string processing. You simply need to know them if you are going tocode in a professional capacity. Unless you just write SQL maybe.

I would just like to add that unit testing is the ideal way to make your regular expressions maintainable. I consider Regex an essential developer skill that is always a practical alternative to writing many lines of string manipulation code.

The most hassle I see is when people try to parse non-regular languages with regular expressions (yes, that includes all programming and many markup languages, yes, also HTML). I sometimes wish all coders had to demonstrate that they have understood at least the difference between context-free and regular languages before they are allowed to use regular expressions. Alternatively, they could get their regex license revoked when they are caught trying to parse non-regular languages with them. Yes, I'm joking, but only half.
The next problem arises when people try to do more than character matching in a regular expression, for example, checking for a valid date, perhaps even including leap year considerations (this could also lead to regex license revokation).
Regular expressions really are just a convenient shorthand for a finite state automaton (You know what that is, don't you? Where is your regex license, please?). The problems come from people expecting some kind of magic from them, not from the regular expressions themselves.

I see regex as a fast, readable and preferable way to perform pattern matching on string data. So many languages support regex for this reason. If you wanted to write string manipulation code to match say, a Canadian zip code, be my guest, but the regex equivalent is so much more succinct. Definitely worth it.

In .NET regex'es you can have comments, and break them up into multiple lines, use indenting etc. (I don't know about other dialects...)
Use the "ignore pattern whitespace" setting, and either # for commenting out the rest of the line, or "(#comments)" in your pattern...
So if you wanted to, you can actually make them sort of readable/maintainable...

I just ran into this issue. I built a regular expression to pull out groups of data from a long string of numbers and some other noise. The regex was quite long, though concise, and it got even bigger when i tried to add it to the C# app i was writing. In total the reg ex was 3 lines of code.
However it was painful to look at after i escaped it for C# and the other developers i work with don't under stand regular expressions. I ended up stripping out most of the noise characters and splitting on space to get the groups of data. Very simple code and only 5 lines.
Which is better? My ego says Regular Expressions. Any new hire would say character stripping.

I would never wish for fewer options in programming. Regular expressions can be very powerful, but do require skill. I like problems that can be solved in a few lines of code. It is really cool how many elements of validation can be accomplished. As long as the code is commented on what the expression checks for, I do not see a problem. I also have never seen a professional programmer not know what a regex was. It is another tool in the tool box.

Regex is one tool among many. But as many craftsmen will attest, the more tools you have at your disposal, and the more skilled you are at using them, the more likely you will become a Master Craftsman.
Is Regex worth the hassle to you? Dunno. Depends how seriously you take what you do.

It's a lot easier to see at first glance that a regex is probably correct. Why would I write a long state machine in code (probably containing bugs at first) when I could write a simple one line regex?
Regexes may be considered "write only", but I think that is sometimes a benefit. When writing a relatively simple regex from scratch, it's pretty easy to get it right.

True, learning to decipher regexes is difficult -- but so is learning to decipher the hosting program code in the first place. But is that so difficult, that we would rather write out manual instruction for a person to perform? No -- because that would be ridiculously longer and complicated. Same thing for not using a properly-formed regex.

I've found with reg ex it's easier to maintain, but fine tuning someone else's reg ex is a bit of a pain. I think you underestimate the developers by saying most people don't understand it. Usually what I found is that over time, requirements adjust, and the regex that used to validate something is no longer effective and attempting to remove portions that are no longer valid is harder than to just rewrite the entire thing.
Also, imagine if you were validating phone numbers, and you decided to use code instead of reg ex. So it amounts to let's say 20 lines. Over time, your company decides to expand to other regions where now the phone validation is no longer totally true. So you have to adjust it to fit the other requirements. It could be possible that the code would be harder to maintain because you have to adjust over 20 lines of code rather than simply removing the old reg ex, and replacing it with a new one.
However, I think code can be used in certain cases along with regex. For example, let's say you want to validate US phone numbers, in every case, it has 10 digits numbers, but there are literally a ton of ways to write it out. For example (xxx) xxx-xxxx, or xxx-xxx-xxxx, or xxx xxx xxxx, etc, etc, etc. So if you write reg ex, you'd have to account for each of the cases. However, if you just strip all non-numerics and spaces with a regex replace, then go for a second pass and check if it has 10 digits, you'd find it easier than accounting each and every possible way to write a phone number.

One thing that doesn't seem mentioned (from a quick scan of the answers above) is that regular expressions are useful outside of code too. That means they are worth the hassle for a coder, or even for end users.
For example, I just wrote a bunch of unit tests for a formatter. I then made a copy of the test, and used a single regex in my editor to invert values and resulting strings (changing the method name too), giving expected value to a string to parse...
Another example: in our product, we allow using regular expressions for searching or filtering columns of data: sometime it is useful to get only names starting with something, ending with something, with letters followed by digits, or similar: no need to be a master of regexes to use them.
In these cases, writing code isn't an option (well, I could have made a small Lua script in the first case, but it would have been longer) and performance isn't a major issue.
But even in code, I often find easier and more readable to use a simple regular expression than a bunch of substring() with complex offsets and whatnot. Beside, they shine to validate user input where, again, performance isn't an issue.

Due to the type of apps I build, the only RegEx's I regularly use are for email validation, html stripping, and character stripping to remove the garbage around phone numbers.
It's rare that I need to do very much string manipulation other than concatenation.
Incidentally, the apps are typically CRM's.
So the hassle for me is limited to googling for a regex in the event I find myself in need. ;)

Read the section under "Using Benchmarks" at JavaWorld.
Sure regular expressions are a very helpful tool, but I agree that they are overused and over complicate what can easily be a simple solution.
That being said, you should use regular expressions whenever the situation calls for it. Some things, such as searching for text in a string, can just as easily be done with an iterative search (or using the API searches), but for more complex situations you need regular expressions.

Surly all code needs to be optimized where possible!
In the context where code need not be optimized, and the logic will need to be maintained then it is down to the skill set of the team.
If the bulk of the team responsible for the code is regEX savvy then do it with a regEX. Else write it in the way the team is likely to be most comfortable with.

VB.net is best, No, C# is, No F# is the best. It's really more a matter of what will be the people maintaining be better suited to handle, in my opinion. That's more a flame question, than something that is absolutely answerable.
Personally I'd choose regex whenever there's complex string validation (phone numbers,emails, ss#, ip addresses) where there are well known regex's out there. Get it from regex.org, give attribution with a comment and/or get the authors permission whichever is appropriate, and be done with it.
Also, for extracting pieces of a string, or complex splitting of strings, regex can be a great time saver.
But if you're writing your own, rather than using someone else's, using something like regex buddy or sells brothers regexdesigner is a must for testing and validation.

It always depends on where it's used. If by doing the same task using a lump of code is being too complex and hard to maintain which can be a 1 liner less complex regex, then go with regex. Other wise use the lump of code.
Also I encountered problems which can I believe can only be answered by regex effective and concisely. Such question like this which can only be answered by another regex effectively: Dart regex for capturing groups but ignoring certain similar patterns

Related

When should I prefer regex over built-in string functions?

Some say I should use regex whenever possible, others say I should use it at least as possible. Is there something like a "Perl Etiquette" about that matter or just TIMTOWTDI?
The level of complexity generally dictates whether I use a regex or not. Some of the questions I ask when deciding whether or not to use a regex are:
Is there no built string function that handles this relatively easily?
Do I need to capture substring groups?
Do I need complex features like look behind or negative sets?
Am I going to make use of character sets?
Will using a regex make my code more readable?
If I answer yes to any of these, I generally use a regex.
I think a lot of the answers you got already are good. I want to address the etiquette part because I think there is some.
Summed up: if there is a robust parser available, use it instead of regular expressions; 100% of the time. Never recommend anything else to a novice. So–
Don'ts
Don't split or match against commas for CSV, use Text::CSV/Text::CSV_XS.
Don't write regexes against HTML or XML, use XML::LibXML, XML::Twig, HTML::TreeBuilder, HTML::TokeParser::Simple, et cetera.
Don't write regexes for things that are trivial to split or unpack.
Dos
Do use substr, index, and rindex where appropriate but recognize they can come off "unperly" so they are best used when benchmarking shows them superior to regular expressions; regexes can be surprisingly fast in many cases.
Do use regular expressions when there is no good parser available and writing a Parse::RecDescent grammar is overkill, too much work, or will be too slow.
Do use regular expressions for throw-away code like one-liners on well-known/predictable data including the HTML/CSV previously banned from regular expression use.
Do be aware of alternatives for bigger problems like P::RecD, Parse::Yapp, and Marpa.
Do keep your own council. Perl is supposed to be fun. Do whatever you like; just be prepared to get bashed if you complain when not following advice and it goes sideways. :P
I don't know of any "etiquette" about this.
Perl regex are highly optimized (that's one of the things the language is known for, although there are engines that are faster), and in the end, if your regex is so simple that it could be replaced by a string function, I don't believe that the regex will be any significantly less performant. If the problem you are trying to resolve is so time sensitive that you might look into other possibilities of optimization.
Another important aspect is readability. And I think that handling all string transformations through regex also add to this, insteas of mixing and matching different approaches.
Just my two cents.
Though I would classify this as too opinionated for SO, I'll give my point of view.
Use regex when the string is:
"Too Dynamic" (The string could have a lot of variation to it, that making use of the string library(ies) would be cumbersome.
"Contains patterns" if there is a genuine pattern to the string (and may be as simple as 1 character or a group of characters) this is where (i feel) regex excels.
"Too Complex" If you find yourself declaring a whole function block just to do what a single pattern can do, I can see it being worthwhile just to use regex. (However, see "Too Complex" below, too).
Do not use regex to be:
"Fast" Consider the overhead involved in spinning up a regex library over grabbing information directly from a string.
"Too Complex" Good code isn't always short. If you begin making a huge pattern to circumvent several lines of code, that's fine, but keep in mind it's at the risk of readability. Coming back to that piece and trying to wrap your head around it again may not be worth just doing the plain-jane method.
I'd say, if you need more than one or two string function calls to do it, use a regex. ;)
For things that are not too complex that the regex becomes bloated, affects the readability of code and cause performance issues. You can do it via a serious of steps, using builtin functions and other means. You may not have a cool single line regex, but your code will be readable and maintanable.
And also not too simple problems because, again, regexes are heavy weight and there are usually built-in functions that handled the simple scenarios.
It is going to depend on what you are going to do. Ofcourse, please don't use regex for parsing ( especially HTML etc. )
Perl is a great language for regex. It honestly has one of the greatest parsers of any language, so that is why you see so many "use regex" answers. I am not sure what the aversion to regex is, however.
My answer would be: can you sum up the work in a single pattern easier than using the string function, or do you need to use multiple string functions versus a single regex? In either case, I would aim for regex. Otherwise, do what feels comfortable for you.

does regex comparisons consume lots of resources?

i dunno, but will your machine suffer great slowdown if you use a very complex regex?
like for example the famous email validation module proposed just recently? which can be found here RFC822
update: sorry i had to ask this question in a hurry anyway i posted the link to the email regex i was talking about
It highly depends on the individual regex: features like look-behind or look-ahead can get very expensive, while simple regular expressions are fine for most situations.
Tutorials on http://www.regular-expressions.info/ offer performance advice, so that can be a good start.
Regexes are usually implemented as one of two algorithms (NFA or DFA) that correspond to two different FSMs. Different languages and even different versions of the same language may have a different type of regex. Naturally, some regexes work faster in one and some work faster in the other. If it's really critical, you might want to find what type of regex FSM is implemented.
I'm no expert here. I got all this from reading Mastering Regular Expressions by Jeffrey E. F. Friedl. You might want to look that up.
Depends also on how well you optimise your query, and knowing the internal working of regex.
Using the negated character class, for example, saves the cost of having the engine backtracking characters (i.e. /<[^>]+>/ instead of /<.+?>/)(*).Trivial in small matches, but saves a lot of cycles when you have to match inside a big chunk of text.
And there are many other ways to save resources in regex operations, so performance can vary wildly.
example taken from http://www.regular-expressions.info/repeat.html
You might be interested by articles like: Regular Expression Matching Can Be Simple And Fast or Understanding Regular Expressions.
It is, alas, easy to write inefficient REs, which can match quite quickly on success but can look for hours if no match is found, because the engine stupidly try a long match on every position of a long string!
There are a few recipes for this, like anchoring whenever it is possible, avoiding greediness if possible, etc.
Note that the giant e-mail expression isn't recent, and not necessarily slow: a short, simple expression can be slower than a more convoluted one!
Note also that in some situations (like e-mail, precisely), it can be more efficient (and maintainable!) to use a mix of regexes and code to handle cases, like splitting at #, handling different cases (first part starts with " or not, second part is IP address or domain, etc.).
Regexes are not the ultimate tool able to do everything, but it is a very useful tool well worth to master!
It depends on your regexp engine. As explained here (Regular Expression Matching Can Be Simple And Fast) there may be some important difference in the performance depending on the implementation.
You can't talk about regexes in general any more than you can talk about code in general.
Regular expressions are little programs on their own. Just as any given program may be fast or slow, any given regex may be fast or slow.
One thing to remember, however, is that the regular expression handler is is very well optimized to do its job and run the regex quickly.
I once made a program that analyzed a lot of text (a big code base, >300k lines). First I used regex but when I switched to using regular string functions it got a lot faster, like taking 40% of the time of the regex version. So while of course it depends, my thing got a lot faster.
Once I had written a greedy - accidentally, of course :-) - a multi-line regex and had it search/replace on 10 * 200 GB of text files. It was damn slow... So it depends what you write, and what you check.
Depends on the complexity of the expression and the language the expression is used with.
In JavaScript; you have to optimize everything. In C#; not so much.

How important is knowing Regexs?

My personal experience is that regexs solve problems that can't be efficiently solved any other way, and are so frequently required in a world where strings are as important as they are that not having a firm grasp of the subject would be sufficient reason for me to consider not hiring you as a senior programmer (a junior is always allowed the leeway of training).
However.
A number of responses on the recurrent "What's the regex for this?" type-questions suggest that a great deal of coders find them somewhere between unintelligible and opaque.
This is not about whether a simple indexOf or substring is a better solution, that's a technical matter, and sometimes the simple way is correct, sometimes a regex is, and sometimes neither (looking at you html parser questions).
This is about how important it is to understand Regexs and whether the anti-Regex opinion (that trite "...now they have two problems" thing) is merited or FUD.
Should a programmer should be expected to understand Regexs? Is this a required skill?
edit: just in case it isn't clear, I'm not asking whether I need to learn them (I'm a defender of the faith) but whether the anti-camp have are an evolutionary dead end or whether it's an unnecessary niche skill like InstallShield.
REs let you solve relatively complex problems that would otherwise require you to code up full parsers with backtracking and all that messy sort of stuff. I liken the use of REs to using chainsaws to chop down a tree instead of trying to do it with a piece of celery.
Once you've learned how to use the chainsaw safely, you'll never go back. People who continue to spout anti-RE propaganda will never be as productive as those of us who have learned to love them.
So yes, you should know how to use REs, even if you understand only the basic constructs. They're a tool just like any other.
There are some tasks where regular expressions are the best tool to use.
There are some tasks where regular expressions are pointlessly obscure.
There are some tasks where they're reasonably appropriate, but a different approach may be more readable.
In general, I think of using a regular expression when an actual pattern is involved. If you're just looking for a specific string, I wouldn't generally use a regex. As an example of a grey area, someone once asked on a newsgroup the best way to check whether one string contained any of a number of other strings. The two ways which came up were:
Build a regex with alternatives and perform a single match.
Test each string in turn with string.Contains.
Personally I think the latter way is much simpler - it doesn't require any thought about escaping the strings you're looking for, or any other knowledge of regular expressions (and their different flavours across different platforms).
As an example of somewhere that regular expressions are quite clearly the wrong choice, someone seriously proposed using a regular expression to test whether or not a string three characters long. Their regular expression didn't even work, despite them claiming that the reason they thought of regular expressions first is because they'd been using them for so long, and that they naturally sort of "thought" in regular expressions.
There are, however, plenty of examples where regular expressions really do make life easier - as I say, when you're actually matching patterns: "I want one letter, then three digits, then another letter" or whatever. I don't find myself using regular expressions very often, but when I do use them, they save a lot of work.
In short, I believe it's good to know regular expressions - but equally to be careful about when to use them. It's easy to end up with write-only code which could be made simpler to understand by rewriting with simple string operations, even if the resulting code is slightly longer.
EDIT: In response to the edit of the question...
I don't think it's a good idea to be evangelical about them - in my experience, that tends to lead to using them where an alternative would be simpler, and that just makes you look bad. On the other hand, if you come across someone writing complicated code to avoid using a regular expression, it's fine to point out that a regex would make the code simpler.
Personally I like to comment my regular expressions in quite a detailed way, splitting them up onto several lines with a comment between each line. That way they're easier to maintain, and it doesn't look like you're just trying to be "hard core" geeky (which can be the impression, even if it's not the actual intended aim).
I think the most important thing is to remember that short != readable. Never claim that using a regex is better because it requires less code - claim that it's better when it's genuinely simpler and easier to understand (or where there's a significant performance benefit, of course).
As a developer you should know the pros and cons of as many tools as possible that could provide pre-made solutions for your problems. Every developer should know how to work with regular expressions and have a feeling when they should be used and when it is besser to use simple string functions to achieve a goal.
Rejecting them outright because they are hard to read is no option in my opinion. A developer who thinks so strips himself of a valuable tool for searching and validating complex string patterns.
I have really mixed feelings. I have used them and know the bones of the syntax and something in me loves their conciseness. However they are not commonly understood and are a highly obfuscated form of code. I too would like to see performance comparisons against similar operations in plain code. There is no question that the exploded code will be more maintainable and more easily and widely understood, which is a serious consideration in any commercial software project.
Even if they turn out to be more performant, the argument for them taken to its logical conclusion would see us all embedding assembler into our code for important loops - perhaps we should. Neat and concise and very fast, but almost un-maintainable.
On balance I think that until the regex syntax becomes mainstream they probably cause more trouble than they solve and should be used only very carefully.
In the Steve Yegge's article, Five Essential Phone Screen Questions, you should read the section "Area Number Three: Scripting and Regular Expressions".
Steve Yegge has some interesting points. He gives real world problems he has encountered with clients having to parse 50,000 files for a particular pattern of a phone number. The applicants who know regular expressions tear through the problem in a few minutes while those who don't write monster multi-hundred line programs that are very unwieldy. This article convinced me I should learn regular expressions.
Not a brilliant answer but everywhere I've worked the following holds true
0 < Number of people who (fully) understand regex < 1
If I knew how to do it I'd write that previous expression as a regex, but I can't. The best I could come up with on the fly is s/fully/a little/g - that's my limit (and that's probably not a regex).
A more serious answer is that the right regex will solve all kinds of problems, with one(ish) line of code. But you'll have real problems debugging it if it goes wrong. Therefore IMHO a complex regex however 'clean/clever' is a liability, if it takes ten lines of code to replicate it, why's that a problem, is memory/disk space suddenly expensive again?
BTW I'd love to know if regexs are fast compared to code equivalent.
It is not clear what kind of answer you are expecting.
I can imagine roughly three kinds of answer to this question:
Regexen are essential to the education of professional programmers. They enable the use the powerful unix shell tools, and regex-based search-replace can dramatically cut down on text-munging handiwork that is a part of a programmer's life. Programmers that do not know regexen are just intelectually lazy which is a very bad trait for a programmer.
Regexps are kinda useful depending on the application domain. Surely, knowing how to write regexps is a valuable tool a programmer's chest, but most of the time you can do fine without using them. Also, regexps tend to be very hard to read, so abuse must be strongly discouraged.
Some nutcases like to put regexs everything (I'm looking at you, the perl guy who implemented a regex-based tetris in perl). But really, they are just a bit of computer science trivia whose only practical use is in writing parsers. They are widely taught because they make a good teaching topic on which to evaluate students, and like most such topics it can forgotten the second you step out of the exam room.
You will notice the careful use of the plural forms "regexen" (pro), "regexps" (careful neutral) and "regexs" (con).
Personally, I am of the first kind. Good programmers like to learn new languages, and they hate repetitive handiwork.
When you have to parse something (ranging from simple date strings to programming languages) you should know your tools and regular expressions are one of them.
But you should also know what you can do with regexes and what not. At this point it comes in handy if you know the Chomsky hierarchy
hierarchy. Otherwise you end up trying to use regular expressions to parse context-sensitive languages and wonder why you can't get your regex right.
The fact that all languages support regexs should mean something !
I think knowing a regex is a quite important skill. While the usage of regex in a programming environment/language is question of maintainable code, I find the knowledge of regex to be useful with some commands (say egrep), editors (vim, emacs etc.). Using a regex to do a find and replace in vim is very handy when you have a text file and you want to do some formatting once in a while.
I find it very useful to know regular expressions. They are a very powerful tool, and in my opinion there are problems that you simply can't solve without these.
I would however not take regular expressions as a killing criterion for "hiring you as a senior programmer". They are like the wealth of other tools in the world. You should really known them in a problem domain where you need them, but you cannot presume that someone already knows all of these.
"a junior is always allowed the leeway
of training"
If a senior isn't, then I would not hire him!
To the ones that argue how complex and unreadable a regular expression is: If the regexp solution to a problem is complex and unreadable, then probably the problem itself is! Good luck in solving it in an other way...
What does the following do?
"([A-Za-z][A-Za-z0-9+.-]{1,120}:A-Za-z0-9/{1,333}(#([a-zA-Z0-9][a-zA-Z0-9$_.+!*,;/?:#&~=%-]{0,1000}))?)"
How long did it take you to figure out? to debug?
Regexs are awesome for single-use throwaway programs, but long hairy regexps are not the best choice for programs that other people will need to maintain over the years.
I find that regex's can be very helpful depending on the type of programming that you do. However I probably write less than one regex a month, and because of this long interval between requiring regex's I forget alot about how they work.
I should probably go through mastering regular expressions or something similar someday.
Knowing when to use a regexp and the basics of how they work and what their limitations are is important. But filling your head with a lot of syntax rules that you probably won't need very often is just a pointless academic exercise.
A regexp crib sheet can be written on one sheet of A4 paper or a couple of pages in a textbook - no need to know this stuff by heart, If you use it every day it will stick. If you don't use it very often then the brain cells are probably better used for something else.
A developer thought he had one problem and tried to solve it using regex. Now he has 2 problems.
I agree with pretty much everything said here, and just need to include the mandatory quip:
Some people, when confronted with a
problem, think "I know, I'll use
regular expressions." Now they have
two problems.
(attributed to Jamie Zawinski)
Like most jokes, it contains a kernel of truth.

When is it best to use Regular Expressions over basic string splitting / substring'ing?

It seems that the choice to use string parsing vs. regular expressions comes up on a regular basis for me anytime a situation arises that I need part of a string, information about said string, etc.
The reason that this comes up is that we're evaluating a soap header's action, after it has been parsed into something manageable via the OperationContext object for WCF and then making decisions on that. Right now, the simple solution seems to be basic substring'ing to keep the implementation simple, but part of me wonders if RegEx would be better or more robust. The other part of me wonders if it'd be like using a shotgun to kill a fly in our particular scenario.
So I have to ask, what's the typical threshold that people use when trying to decide to use RegEx over typical string parsing. Note that I'm not very strong in Regular Expressions, and because of this, I try to shy away unless it's absolutely vital to avoid introducing more complication than I need.
If you couldn't tell by my choice of abbreviations, this is in .NET land (C#), but I believe that doesn't have much bearing on the question.
EDIT: It seems as per my typical Raybell charm, I've been too wordy or misleading in my question. I want to apologize. I was giving some background to help give clues as to what I was doing, not mislead people.
I'm basically looking for a guideline as to when to use substring, and variations thereof, over Regular Expressions and vice versa. And while some of the answers may have missed this (and again, my fault), I've genuinely appreciated them and up-voted as accordingly.
My main guideline is to use regular expressions for throwaway code, and for user-input validation. Or when I'm trying to find a specific pattern within a big glob of text. For most other purposes, I'll write a grammar and implement a simple parser.
One important guideline (that's really hard to sidestep, though I see people try all the time) is to always use a parser in cases where the target language's grammar is recursive.
For example, consider a tiny "expression language" for evaluating parenthetized arithmetic expressions. Examples of "programs" in this language would look like this:
1 + 2
5 * (10 - 6)
((1 + 1) / (2 + 2)) / 3
A grammar is easy to write, and looks something like this:
DIGIT := ["0"-"9"]
NUMBER := (DIGIT)+
OPERATOR := ("+" | "-" | "*" | "/" )
EXPRESSION := (NUMBER | GROUP) (OPERATOR EXPRESSION)?
GROUP := "(" EXPRESSION ")"
With that grammar, you can build a recursive descent parser in a jiffy.
An equivalent regular expression is REALLY hard to write, because regular expressions don't usually have very good support for recursion.
Another good example is JSON ingestion. I've seen people try to consume JSON with regular expressions, and it's INSANE. JSON objects are recursive, so they're just begging for regular grammars and recursive descent parsers.
Hmmmmmmm... Looking at other people's responses, I think I may have answered the wrong question.
I interpreted it as "when should use use a simple regex, rather than a full-blown parser?" whereas most people seem to have interpreted the question as "when should you roll your own clumsy ad-hoc character-by-character validation scheme, rather than using a regular expression?"
Given that interpretation, my answer is: never.
Okay.... one more edit.
I'll be a little more forgiving of the roll-your-own scheme. Just... don't call it "parsing" :o)
I think a good rule of thumb is that you should only use string-matching primitives if you can implement ALL of your logic using a single predicate. Like this:
if (str.equals("DooWahDiddy")) // No problemo.
if (str.contains("destroy the earth")) // Okay.
if (str.indexOf(";") < str.length / 2) // Not bad.
Once your conditions contain multiple predicates, then you've started inventing your own ad hoc string validation language, and you should probably just man up and study some regular expressions.
if (str.startsWith("I") && str.endsWith("Widget") &&
(!str.contains("Monkey") || !str.contains("Pox"))) // Madness.
Regular expressions really aren't that hard to learn. Compared to a huuuuge full-featured language like C# with dozens of keywords, primitive types, and operators, and a standard library with thousands of classes, regular expressions are absolutely dirt simple. Most regex implementations support about a dozen or so operations (give or take).
Here's a great reference:
http://www.regular-expressions.info/
PS: As a bonus, if you ever do want to learn about writing your own parsers (with lex/yacc, ANTLR, JavaCC, or other similar tools), learning regular expressions is a great preparation, because parser-generator tools use many of the same principles.
The regex can be
easier to understand
express more clearly the intent
much shorter
easier to change/adapt
In some situations all of those advantages would be achieved by using a regex, in others only some are achieved (the regex is not really easy to understand for example) and in yet other situations the regex is harder to understand, obfuscates the intent, longer and hard to change.
The more of those (and possibly other) advantages I gain from the regex, the more likely I am to use them.
Possible rule of thumb: if understanding the regex would take minutes for someone who is somewhat familiar with regular expressions, then you don't want to use it (unless the "normal" code is even more convoluted ;-).
Hm ... still no simple rule-of-thumb, sorry.
[W]e're evaluating a soap header's
action and making decisions on that
Never use regular expressions or basic string parsing to process XML. Every language in common usage right now has perfectly good XML support. XML is a deceptively complex standard and it's unlikely your code will be correct in the sense that it will properly parse all well-formed XML input, and even it if does, you're wasting your time because (as just mentioned) every language in common usage has XML support. It is unprofessional to use regular expressions to parse XML.
To answer your question, in general the usage of regular expressions should be minimized as they're not very readable. Oftentimes you can combine string parsing and regular expressions (perhaps in a loop) to create a much simpler solution than regular expressions alone.
I would agree with what benjismith said, but want to elaborate just a bit. For very simple syntaxes, basic string parsing can work well, but so can regexes. I wouldn't call them overkill. If it works, it works - go with what you find simplest. And for moderate to intermediate string parsing, a regex is usually the way to go.
As soon as you start finding yourself needing to define a grammar however, i.e. complex string parsing, get back to using some sort of finite state machine or the likes as quickly as you can. Regexes simply don't scale well, to use the term loosely. They get complex, hard to interpret, and even incapable.
I've seen at least one project where the use of regexes kept growing and growing and soon they had trouble inserting new functionality. When it finally came time to do a new major release, they dumped all the regexes and went the route of a grammar parser.
When your required transformation isn't basic -- but is still conceptually simple.
no reason to pull out Regex if you're doing a straight string replacement, for example... its easier to just use the string.Replace
on the other hand, a complex rule with many conditionals or special cases that would take more than 50 characters of regex can be a nightmare to maintain later on if you don't explicitly write it out
I would always use a regex unless it's something very simple such as splitting a comma-separated string. If I think there's a chance the strings might one day get more complicated, I'll probably start with a regex.
I don't subscribe to the view that regexes are hard or complicated. It's one tool that every developer should learn and learn well. They have a myriad of uses, and once learned, this is exactly the sort of thing you never have to worry about ever again.
Regexes are rarely overkill - if the match is simple, so is the regex.
I would think the easiest way to know when to use regular expressions and when not to, is when your string search requires an IF/THEN statement or anything resembling this or that logic, then you need something better than a simple string comparison which is where regex shines.

Are regex tools (like RegexBuddy) a good idea?

One of my developers has started using RegexBuddy for help in interpreting legacy code, which is a usage I fully understand and support. What concerns me is using a regex tool for writing new code. I have actually discouraged its use for new code in my team. Two quotes come to mind:
Some people, when confronted with a
problem, think "I know, I’ll use
regular expressions." Now they have
two problems. - Jamie Zawinski
And:
Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as
cleverly as possible, you are, by
definition, not smart enough to debug
it. - Brian Kernighan
My concerns are (respectively:)
That the tool may make it possible to solve a problem using a complicated regular expression that really doesn't need it. (See also this question).
That my one developer, using regex tools, will start writing regular expressions which (even with comments) can't be maintained by anyone who doesn't have (and know how to use) regex tools.
Should I encourage or discourage the use of regex tools, specifically with regard to producing new code? Are my concerns justified? Or am I being paranoid?
Poor programming is rarely the fault of the tool. It is the fault of the developer not understanding the tool. To me, this is like saying a carpenter should not own a screwdriver because he might use a screw where a nail would have been more appropriate.
Regular expressions are just one of the many tools available to you. I don't generally agree with the oft-cited Zawinski quote, as with any technology or technique, there are both good and bad ways to apply them.
Personally, I see things like RegexBuddy and the free Regex Coach primarily as learning tools. There are certainly times when they can be helpful to debug or understand existing regexes, but generally speaking, if you've written your regex using a tool, then it's going to be very hard to maintain it.
As a Perl programmer, I'm very familiar with both good and bad regular expressions, and have been using even complicated ones in production code successfully for many years. Here are a few of the guidelines I like to stick to that have been gathered from various places:
Don't use a regex when a string match will do. I often see code where people use regular expressions in order to match a string case-insensitively. Simply lower- or upper-case the string and perform a standard string comparison.
Don't use a regex to see if a string is one of several possible values. This is unnecessarily hard to maintain. Instead place the possible values in an array, hash (whatever your language provides) and test the string against those.
Write tests! Having a set of tests that specifically target your regular expression makes development significantly easier, particularly if it's a vaguely complicated one. Plus, a few tests can often answer many of the questions a maintenance programmer is likely to have about your regex.
Construct your regex out of smaller parts. If you really need a big complicated regex, build it out of smaller, testable sections. This not only makes development easier (as you can get each smaller section right individually), but it also makes the code more readable, flexible and allows for thorough commenting.
Build your regular expression into a dedicated subroutine/function/method. This makes it very easy to write tests for the regex (and only the regex). it also makes the code in which your regex is used easier to read (a nicely named function call is considerably less scary than a block of random punctuation!). Dropping huge regular expressions into the middle of a block of code (where they can't easily be tested in isolation) is extremely common, and usually very easy to avoid.
You should encourage the use of tools that make your developers more efficient. Having said that, it is important to make sure they're using the right tool for the job. You'll need to educate all of your team members on when it is appropriate to use a regular expression, and when (less|more) powerful methods are called for. Finally, any regular expression (IMHO) should be thoroughly commented to ensure that the next generation of developers can maintain it.
I'm not sure why there is so much diffidence against regex.
Yes, they can become messy and obscure, exactly as any other piece of code somebody may write but they have an advantage over code: they represent the set of strings one is interested to in a formally specified way (at least by your language if there are extensions). Understanding which set of strings is accepted by a piece of code will require "reverse engineering" the code.
Sure, you could discurage the use of regex as has already been done with recursion and goto's but this would be justifed to me only if there's a good alternative.
I would prefer maintain a single line regex code than a convoluted hand-made functions that tries to capture a set of strings.
On using a tool to understand a regex (or write a new one) I think it's perfectly fine! If somebody wrote it with the tool, somebody else could understand it with a tool! Actually, if you are worried about this, I would see tools like RegexBuddy your best insurance that the code will not be unmaintainable just because of the regex's
Regex testing tools are invaluable. I use them all the time. My job isn't even particularly regex heavy, so having a program to guide me through the nuances as I build my knowledge base is crucial.
Regular expressions are a great tool for a lot of text handling problems. If you have someone on your team who is writing regexes that the rest of the team don't understand, why not get them to teach the rest of you how they are working? Rather than a threat, you could be seeing this as an opportunity. That way you wouldn't have to feel threatened by the unknown and you'll have another very valuable tool in your arsenal.
Zawinski's comments, though entertainingly glib, are fundamentally a display of ignorance and writing Regular Expressions is not the whole of coding so I wouldn't worry about those quotes. Nobody ever got the whole of an argument into a one-liner anyways.
If you came across a Regular Expression that was too complicated to understand even with comments, then probably a regex wasn't a good solution for that particular problem, but that doesn't mean they have no use. I'd be willing to bet that if you've deliberately avoided them, there will be places in your codebase where you have many lines of code and a single, simple, Regex would have done the same job.
Regexbuddy is a useful shortcut, to make sure that the regular expressions you are writing do what you expect- it certainly makes life easier, but it's the matter of using them at all that is what seems important to me about your question.
Like others have said, I think using or not using such a tool is a neutral issue. More to the point: If a regular expression is so complicated that it needs inline comments, it is too complicated. I never comment my regexps. I approach large or complex matching problems by breaking it down into several steps of matching, either with multiple match statements (=~), or by building up a regexp with sub regexps.
Having said all that, I think any developer worth his salt should be reasonably proficient in regular expression writing and reading. I've been using regular expressions for years and have never encountered a time where I needed to write or read one that was terrifically complex. But a moderately sized one may be the most elegant and concise way to do a validation or match, and regexps should not be shied away from only because an inexperienced developer may not be able to read it -- better to educate that developer.
What you should be doing is getting your other devs hooked up with RB.
Don't worry about that whole "2 probs" quote; it seems that may have been a blast on Perl (said back in 1997) not regex.
I prefer not to use regex tools. If I can't write it by hand, then it means the output of the tool is something I don't understand and thus can't maintain. I'd much rather spend the time reading up on some regex feature than learning the regex tool. I don't understand the attitude of many programmers that regexes are a black art to be avoided/insulated from. It's just another programming language to be learned.
It's entirely possible that a regex tool would save me some time implementing regex features that I do know, but I doubt it... I can type pretty fast, and if you understand the syntax well (using a text editor where regexes are idiomatic really helps -- I use gVim), most regexes really aren't that complex. I think you're nearly always better served by learning a technology better rather than learning a crutch, unless the tool is something where you can put in simple info and get out a lot of boilerplate code.
Well, it sounds like the cure for that is for some smart person to introduce a regex tool that annotates itself as it matches. That would suggest that using a tool is not as much the issue as whether there is a big gap between what the tool understands and what the programmer understands.
So, documentation can help.
This is a real trivial example is a table like the following (just a suggestion)
Expression Match Reason
^ Pos 0 Start of input
\s+ " " At least one space
(abs|floor|ceil) ceil One of "abs", "floor", or "ceil"
...
I see the issue, though. You probably want to discourage people from building more complex regular expression than they can parse. I think standards can address this, by always requiring expanded REs and check that the annotation is proper.
However, if they just want to debug an RE, to make sure it's acting as they think it's acting, then it's not really much different from writing code you have to debug.
It's relative.
A couple of regex tools (for Node/JS, PHP and Python) i made (for some other projects) are available online to play and experiment.
regex-analyzer and regex-composer
github repo