My personal experience is that regexs solve problems that can't be efficiently solved any other way, and are so frequently required in a world where strings are as important as they are that not having a firm grasp of the subject would be sufficient reason for me to consider not hiring you as a senior programmer (a junior is always allowed the leeway of training).
However.
A number of responses on the recurrent "What's the regex for this?" type-questions suggest that a great deal of coders find them somewhere between unintelligible and opaque.
This is not about whether a simple indexOf or substring is a better solution, that's a technical matter, and sometimes the simple way is correct, sometimes a regex is, and sometimes neither (looking at you html parser questions).
This is about how important it is to understand Regexs and whether the anti-Regex opinion (that trite "...now they have two problems" thing) is merited or FUD.
Should a programmer should be expected to understand Regexs? Is this a required skill?
edit: just in case it isn't clear, I'm not asking whether I need to learn them (I'm a defender of the faith) but whether the anti-camp have are an evolutionary dead end or whether it's an unnecessary niche skill like InstallShield.
REs let you solve relatively complex problems that would otherwise require you to code up full parsers with backtracking and all that messy sort of stuff. I liken the use of REs to using chainsaws to chop down a tree instead of trying to do it with a piece of celery.
Once you've learned how to use the chainsaw safely, you'll never go back. People who continue to spout anti-RE propaganda will never be as productive as those of us who have learned to love them.
So yes, you should know how to use REs, even if you understand only the basic constructs. They're a tool just like any other.
There are some tasks where regular expressions are the best tool to use.
There are some tasks where regular expressions are pointlessly obscure.
There are some tasks where they're reasonably appropriate, but a different approach may be more readable.
In general, I think of using a regular expression when an actual pattern is involved. If you're just looking for a specific string, I wouldn't generally use a regex. As an example of a grey area, someone once asked on a newsgroup the best way to check whether one string contained any of a number of other strings. The two ways which came up were:
Build a regex with alternatives and perform a single match.
Test each string in turn with string.Contains.
Personally I think the latter way is much simpler - it doesn't require any thought about escaping the strings you're looking for, or any other knowledge of regular expressions (and their different flavours across different platforms).
As an example of somewhere that regular expressions are quite clearly the wrong choice, someone seriously proposed using a regular expression to test whether or not a string three characters long. Their regular expression didn't even work, despite them claiming that the reason they thought of regular expressions first is because they'd been using them for so long, and that they naturally sort of "thought" in regular expressions.
There are, however, plenty of examples where regular expressions really do make life easier - as I say, when you're actually matching patterns: "I want one letter, then three digits, then another letter" or whatever. I don't find myself using regular expressions very often, but when I do use them, they save a lot of work.
In short, I believe it's good to know regular expressions - but equally to be careful about when to use them. It's easy to end up with write-only code which could be made simpler to understand by rewriting with simple string operations, even if the resulting code is slightly longer.
EDIT: In response to the edit of the question...
I don't think it's a good idea to be evangelical about them - in my experience, that tends to lead to using them where an alternative would be simpler, and that just makes you look bad. On the other hand, if you come across someone writing complicated code to avoid using a regular expression, it's fine to point out that a regex would make the code simpler.
Personally I like to comment my regular expressions in quite a detailed way, splitting them up onto several lines with a comment between each line. That way they're easier to maintain, and it doesn't look like you're just trying to be "hard core" geeky (which can be the impression, even if it's not the actual intended aim).
I think the most important thing is to remember that short != readable. Never claim that using a regex is better because it requires less code - claim that it's better when it's genuinely simpler and easier to understand (or where there's a significant performance benefit, of course).
As a developer you should know the pros and cons of as many tools as possible that could provide pre-made solutions for your problems. Every developer should know how to work with regular expressions and have a feeling when they should be used and when it is besser to use simple string functions to achieve a goal.
Rejecting them outright because they are hard to read is no option in my opinion. A developer who thinks so strips himself of a valuable tool for searching and validating complex string patterns.
I have really mixed feelings. I have used them and know the bones of the syntax and something in me loves their conciseness. However they are not commonly understood and are a highly obfuscated form of code. I too would like to see performance comparisons against similar operations in plain code. There is no question that the exploded code will be more maintainable and more easily and widely understood, which is a serious consideration in any commercial software project.
Even if they turn out to be more performant, the argument for them taken to its logical conclusion would see us all embedding assembler into our code for important loops - perhaps we should. Neat and concise and very fast, but almost un-maintainable.
On balance I think that until the regex syntax becomes mainstream they probably cause more trouble than they solve and should be used only very carefully.
In the Steve Yegge's article, Five Essential Phone Screen Questions, you should read the section "Area Number Three: Scripting and Regular Expressions".
Steve Yegge has some interesting points. He gives real world problems he has encountered with clients having to parse 50,000 files for a particular pattern of a phone number. The applicants who know regular expressions tear through the problem in a few minutes while those who don't write monster multi-hundred line programs that are very unwieldy. This article convinced me I should learn regular expressions.
Not a brilliant answer but everywhere I've worked the following holds true
0 < Number of people who (fully) understand regex < 1
If I knew how to do it I'd write that previous expression as a regex, but I can't. The best I could come up with on the fly is s/fully/a little/g - that's my limit (and that's probably not a regex).
A more serious answer is that the right regex will solve all kinds of problems, with one(ish) line of code. But you'll have real problems debugging it if it goes wrong. Therefore IMHO a complex regex however 'clean/clever' is a liability, if it takes ten lines of code to replicate it, why's that a problem, is memory/disk space suddenly expensive again?
BTW I'd love to know if regexs are fast compared to code equivalent.
It is not clear what kind of answer you are expecting.
I can imagine roughly three kinds of answer to this question:
Regexen are essential to the education of professional programmers. They enable the use the powerful unix shell tools, and regex-based search-replace can dramatically cut down on text-munging handiwork that is a part of a programmer's life. Programmers that do not know regexen are just intelectually lazy which is a very bad trait for a programmer.
Regexps are kinda useful depending on the application domain. Surely, knowing how to write regexps is a valuable tool a programmer's chest, but most of the time you can do fine without using them. Also, regexps tend to be very hard to read, so abuse must be strongly discouraged.
Some nutcases like to put regexs everything (I'm looking at you, the perl guy who implemented a regex-based tetris in perl). But really, they are just a bit of computer science trivia whose only practical use is in writing parsers. They are widely taught because they make a good teaching topic on which to evaluate students, and like most such topics it can forgotten the second you step out of the exam room.
You will notice the careful use of the plural forms "regexen" (pro), "regexps" (careful neutral) and "regexs" (con).
Personally, I am of the first kind. Good programmers like to learn new languages, and they hate repetitive handiwork.
When you have to parse something (ranging from simple date strings to programming languages) you should know your tools and regular expressions are one of them.
But you should also know what you can do with regexes and what not. At this point it comes in handy if you know the Chomsky hierarchy
hierarchy. Otherwise you end up trying to use regular expressions to parse context-sensitive languages and wonder why you can't get your regex right.
The fact that all languages support regexs should mean something !
I think knowing a regex is a quite important skill. While the usage of regex in a programming environment/language is question of maintainable code, I find the knowledge of regex to be useful with some commands (say egrep), editors (vim, emacs etc.). Using a regex to do a find and replace in vim is very handy when you have a text file and you want to do some formatting once in a while.
I find it very useful to know regular expressions. They are a very powerful tool, and in my opinion there are problems that you simply can't solve without these.
I would however not take regular expressions as a killing criterion for "hiring you as a senior programmer". They are like the wealth of other tools in the world. You should really known them in a problem domain where you need them, but you cannot presume that someone already knows all of these.
"a junior is always allowed the leeway
of training"
If a senior isn't, then I would not hire him!
To the ones that argue how complex and unreadable a regular expression is: If the regexp solution to a problem is complex and unreadable, then probably the problem itself is! Good luck in solving it in an other way...
What does the following do?
"([A-Za-z][A-Za-z0-9+.-]{1,120}:A-Za-z0-9/{1,333}(#([a-zA-Z0-9][a-zA-Z0-9$_.+!*,;/?:#&~=%-]{0,1000}))?)"
How long did it take you to figure out? to debug?
Regexs are awesome for single-use throwaway programs, but long hairy regexps are not the best choice for programs that other people will need to maintain over the years.
I find that regex's can be very helpful depending on the type of programming that you do. However I probably write less than one regex a month, and because of this long interval between requiring regex's I forget alot about how they work.
I should probably go through mastering regular expressions or something similar someday.
Knowing when to use a regexp and the basics of how they work and what their limitations are is important. But filling your head with a lot of syntax rules that you probably won't need very often is just a pointless academic exercise.
A regexp crib sheet can be written on one sheet of A4 paper or a couple of pages in a textbook - no need to know this stuff by heart, If you use it every day it will stick. If you don't use it very often then the brain cells are probably better used for something else.
A developer thought he had one problem and tried to solve it using regex. Now he has 2 problems.
I agree with pretty much everything said here, and just need to include the mandatory quip:
Some people, when confronted with a
problem, think "I know, I'll use
regular expressions." Now they have
two problems.
(attributed to Jamie Zawinski)
Like most jokes, it contains a kernel of truth.
Related
I know what you're thinking - "oh my god, seriously, not again" - but please bear with me, my question is more than the title. Before we begin, I promise I will never try to parse arbitrary HTML with a regex, or ask anyone else how.
All of the many, many answers here explaining why you cannot do this rely on the formal definition of regular expressions. They parse regular languages, HTML is context-free but not regular, so you can't do it. But I have also heard that many regex implementations in various languages are not strictly regular; they come with extra tricks that break outside the bounds of formal regular expressions.
Since I don't know the details of any particular implementations, such as perl, my questions are:
Which features of regex tools are non-regular? Is it the back references? And in which languages are they found?
Are any of these extra tricks sufficient to parse all context-free languages?
If "no" to #2, then is there a formal category or class of languages that these extra features cover exactly? How can we quickly know whether the problem we are trying to solve is within the power of our not-necessarily-regular expressions?
The answer to your question is that yes, so-called “extended regexes” — which are perhaps more properly called patterns than regular expressions in the formal sense — such as those found in Perl and PCRE are indeed capable of recursive descent parsing of context-free grammars.
This posting’s pair of approaches illustrate not so much theoretical as rather practical limits to applying regexes to X/HTML. The first approach given there, the one labelled naïve, is more like the sort you are apt to find in most programs that make such an attempt. This can be made to work on well-defined, non-generic X/HTML, often with very little effort. That is its best application, just as open-ended X/HTML is its worst.
The second approach, labelled wizardly, uses an actual grammar for parsing. As such, it is fully as powerful as any other grammatical approach. However, it is also far beyond the powers of the overwhelming majority of casual programmers. It also risks re-creating a perfectly fine wheel for negative benefit. I wrote it to show what can be done, but which under virtually no circumstances whatsoever ever should be done. I wanted to show people why they want to use a parser on open-ended X/HTML by showing them how devilishly hard it is to come even close to getting right even using some of the most powerful of pattern-matching facilities currently available.
Many have misread my posting as somehow advocating the opposite of what I am actually saying. Please make no mistake: I’m saying that it is far too complicated to use. It is a proof by counter-example. I had hoped that by showing how to do it with regexes, people would realize why they did not want to go down that road. While all things are possible, not all are expedient.
My personal rule of thumb is that if the required regex is of only the first category, I may well use it, but that if it requires the fully grammatical treatment of the second category, I use someone else’s already-written parser. So even though I can write a parser, I see no reason to do so, and plenty not to.
When carefully crafted for that explicit purpose, patterns can be more resisilient to malformed X/HTML than off-the-shelf parsers tend to be, particularly if you have no real opportunity to hack on said parsers to make them more resilient to the common failure cases that web browsers tend to tolerate but validators do not. However, the grammatical patterns I provide above were designed for only well-formed but reasonably generic HTML (albeit without entity replacement, which is easily enough added). Error recovery in parsers is a separate issue altogether, and by no means a pleasant one.
Patterns, especially the far more commonplace non-grammatical ones most people are used to seeing and using, are much better suited for grabbing up discrete chunks one at a time than they are for producing a full syntactic analysys. In other words, regexes usually work better for lexing than they do for parsing. Without grammatical regexes, you should not try parsing grammars.
But don’t take that too far. I certainly do not mean to imply that you should immediately turn to a full-blown parser just because you want to tackle something that is recursively defined. The easiest and perhaps most commonly seen example of this sort of thing is a pattern to detect nested items, like parentheses. It’s extremely common for me to just plop down something simple like this in my code, and be done with it:
# delete all nested parens
s/\((?:[^()]*+|(?0))*\)//g;
Yes, the extensions in questions are backreferences, and they technically make "regexps" NP-complete, see the Wikipedia paragraph.
It strikes me that regular expressions are not understood well by the majority of developers. It also strikes me that for a lot of problems where regular expressions are used, a lump of code could be used instead. Granted, it might be slower and be 20 lines for something like email validation, but if performance of the code is not desperately important, is it reasonable to assume that not using regular expressions might be better practise?
I'm thinking in terms of maintenance of the code rather that straight line execution time.
Maintaining one regular expression is a lot less effort than maintaining 20 lines of code. And you underestimate the amount of code needed - for a regex of any complexity, the replacement code could easily be 200 rather than 20 lines.
Professional developers should be familiar with basic syntax
At the very least. In all the years long I've been a professional developer I haven't come across a developer that wouldn't know what Regular Expressions are. It's true, not everybody likes using them or is very good at knowing its syntax, but that doesn't mean one shouldn't use them. Developers should learn the syntax and regular expressions should be used.
It's like: "Ok. We have Lambda expressions, but who cares, I can still do it the old fashioned way."
Not learning key aspects of professional development is pure laziness and shouldn't be tolerated for too long.
Whenever i use a Regex i always try to leave a comment explaining exactly how it's structured because I agree with you that not all developers understand them and going back to a regex, even if you've written it yourself, can be a headache to understand again.
That said, they definitely have their uses. Try stripping out all html elements from a box of text without it!
I'm thinking in terms of maintenance of the code rather that straight line execution time.
Code size is the single most important factor in reducing maintainability.
And while Regexps can be very hard to decipher, so are 50 line string processing methods - and the latter are more likely to contain bugs in rare corner cases.
The thing is: any non-trivial regexp must be commented just as thoroughly as you'd comment a 50 line method.
Regular expressions are a domain-specific language: no generic programming language is quite as expressive or quite as efficient at doing what regular expressions do with string matching. The sheer size of the lump of code you will have to write in a standard programming language (even one with a good string library) will make it harder to maintain. It is also a good separation-of-concerns to make sure that the regular expression only does the matching. Having a code blob that basically does matching, but does something else in-between can produce some surprising bugs.
Also note that there are mechanisms to make regular expressions more readable. In Python you can enable verbose mode, which allows you to write things like this:
a = re.compile(r"""\d + # the integral part
\. # the decimal point
\d * # some fractional digits""", re.X)
Another possibility is to build the regular expression up from strings, by line and comment each line, like this:
a = re.compile("\d+" # the integral part
"\." # the decimal point
"\d *" # fraction digits
)
This is possible in different ways in most programming languages. My advice is to keep using regular expressions where appropriate, but treat them like you do other code. Write them as clear as possible, comment them and test them.
You raise a very good point with regards to maintainability. Regular expressions can require some deciphering to understand but I doubt the code which would replace them would be easier to maintain. Regular Expressions are VERY powerful and a valuable tool. Use them but use them carefully, and think about how to make it clear what the intent of the regular expression is.
Regards
With great power comes great responsibility!
Regular expressions are great, but there can be a tendancy to over-use them! There are not suitable in all cases!
In my opinion, it might make more sense to enforce better practices with using regular expressesions other than forgoing it all together.
Always comment your regular expressions. You might know what it does now, but someone else might not and even you might not remember in two weeks. Moreover, descriptive comments should be used, stating exactly what the regular expression is meant to do.
Use unit testing. Create unit tests for your regular expressions. So can have a degree of assurance as to the reliability and correctness of your regular expression statement. And if the regex is being maintained, it would ensure that any code changes does not break existing functionality.
Using regular expression has some advantages:
Time. You don't have to write your own code to do exactly what is built in.
Maintainability. You have to maintain only a couple of lines as opposed to 30 or 300
Performance. The code is optimized
Reliability. If your regex statement is correct, it should function correctly.
Flexibility. Regex gives you a lot of power which is very useful if used properly
Think of regular expressions as the lingua Franca of string processing. You simply need to know them if you are going tocode in a professional capacity. Unless you just write SQL maybe.
I would just like to add that unit testing is the ideal way to make your regular expressions maintainable. I consider Regex an essential developer skill that is always a practical alternative to writing many lines of string manipulation code.
The most hassle I see is when people try to parse non-regular languages with regular expressions (yes, that includes all programming and many markup languages, yes, also HTML). I sometimes wish all coders had to demonstrate that they have understood at least the difference between context-free and regular languages before they are allowed to use regular expressions. Alternatively, they could get their regex license revoked when they are caught trying to parse non-regular languages with them. Yes, I'm joking, but only half.
The next problem arises when people try to do more than character matching in a regular expression, for example, checking for a valid date, perhaps even including leap year considerations (this could also lead to regex license revokation).
Regular expressions really are just a convenient shorthand for a finite state automaton (You know what that is, don't you? Where is your regex license, please?). The problems come from people expecting some kind of magic from them, not from the regular expressions themselves.
I see regex as a fast, readable and preferable way to perform pattern matching on string data. So many languages support regex for this reason. If you wanted to write string manipulation code to match say, a Canadian zip code, be my guest, but the regex equivalent is so much more succinct. Definitely worth it.
In .NET regex'es you can have comments, and break them up into multiple lines, use indenting etc. (I don't know about other dialects...)
Use the "ignore pattern whitespace" setting, and either # for commenting out the rest of the line, or "(#comments)" in your pattern...
So if you wanted to, you can actually make them sort of readable/maintainable...
I just ran into this issue. I built a regular expression to pull out groups of data from a long string of numbers and some other noise. The regex was quite long, though concise, and it got even bigger when i tried to add it to the C# app i was writing. In total the reg ex was 3 lines of code.
However it was painful to look at after i escaped it for C# and the other developers i work with don't under stand regular expressions. I ended up stripping out most of the noise characters and splitting on space to get the groups of data. Very simple code and only 5 lines.
Which is better? My ego says Regular Expressions. Any new hire would say character stripping.
I would never wish for fewer options in programming. Regular expressions can be very powerful, but do require skill. I like problems that can be solved in a few lines of code. It is really cool how many elements of validation can be accomplished. As long as the code is commented on what the expression checks for, I do not see a problem. I also have never seen a professional programmer not know what a regex was. It is another tool in the tool box.
Regex is one tool among many. But as many craftsmen will attest, the more tools you have at your disposal, and the more skilled you are at using them, the more likely you will become a Master Craftsman.
Is Regex worth the hassle to you? Dunno. Depends how seriously you take what you do.
It's a lot easier to see at first glance that a regex is probably correct. Why would I write a long state machine in code (probably containing bugs at first) when I could write a simple one line regex?
Regexes may be considered "write only", but I think that is sometimes a benefit. When writing a relatively simple regex from scratch, it's pretty easy to get it right.
True, learning to decipher regexes is difficult -- but so is learning to decipher the hosting program code in the first place. But is that so difficult, that we would rather write out manual instruction for a person to perform? No -- because that would be ridiculously longer and complicated. Same thing for not using a properly-formed regex.
I've found with reg ex it's easier to maintain, but fine tuning someone else's reg ex is a bit of a pain. I think you underestimate the developers by saying most people don't understand it. Usually what I found is that over time, requirements adjust, and the regex that used to validate something is no longer effective and attempting to remove portions that are no longer valid is harder than to just rewrite the entire thing.
Also, imagine if you were validating phone numbers, and you decided to use code instead of reg ex. So it amounts to let's say 20 lines. Over time, your company decides to expand to other regions where now the phone validation is no longer totally true. So you have to adjust it to fit the other requirements. It could be possible that the code would be harder to maintain because you have to adjust over 20 lines of code rather than simply removing the old reg ex, and replacing it with a new one.
However, I think code can be used in certain cases along with regex. For example, let's say you want to validate US phone numbers, in every case, it has 10 digits numbers, but there are literally a ton of ways to write it out. For example (xxx) xxx-xxxx, or xxx-xxx-xxxx, or xxx xxx xxxx, etc, etc, etc. So if you write reg ex, you'd have to account for each of the cases. However, if you just strip all non-numerics and spaces with a regex replace, then go for a second pass and check if it has 10 digits, you'd find it easier than accounting each and every possible way to write a phone number.
One thing that doesn't seem mentioned (from a quick scan of the answers above) is that regular expressions are useful outside of code too. That means they are worth the hassle for a coder, or even for end users.
For example, I just wrote a bunch of unit tests for a formatter. I then made a copy of the test, and used a single regex in my editor to invert values and resulting strings (changing the method name too), giving expected value to a string to parse...
Another example: in our product, we allow using regular expressions for searching or filtering columns of data: sometime it is useful to get only names starting with something, ending with something, with letters followed by digits, or similar: no need to be a master of regexes to use them.
In these cases, writing code isn't an option (well, I could have made a small Lua script in the first case, but it would have been longer) and performance isn't a major issue.
But even in code, I often find easier and more readable to use a simple regular expression than a bunch of substring() with complex offsets and whatnot. Beside, they shine to validate user input where, again, performance isn't an issue.
Due to the type of apps I build, the only RegEx's I regularly use are for email validation, html stripping, and character stripping to remove the garbage around phone numbers.
It's rare that I need to do very much string manipulation other than concatenation.
Incidentally, the apps are typically CRM's.
So the hassle for me is limited to googling for a regex in the event I find myself in need. ;)
Read the section under "Using Benchmarks" at JavaWorld.
Sure regular expressions are a very helpful tool, but I agree that they are overused and over complicate what can easily be a simple solution.
That being said, you should use regular expressions whenever the situation calls for it. Some things, such as searching for text in a string, can just as easily be done with an iterative search (or using the API searches), but for more complex situations you need regular expressions.
Surly all code needs to be optimized where possible!
In the context where code need not be optimized, and the logic will need to be maintained then it is down to the skill set of the team.
If the bulk of the team responsible for the code is regEX savvy then do it with a regEX. Else write it in the way the team is likely to be most comfortable with.
VB.net is best, No, C# is, No F# is the best. It's really more a matter of what will be the people maintaining be better suited to handle, in my opinion. That's more a flame question, than something that is absolutely answerable.
Personally I'd choose regex whenever there's complex string validation (phone numbers,emails, ss#, ip addresses) where there are well known regex's out there. Get it from regex.org, give attribution with a comment and/or get the authors permission whichever is appropriate, and be done with it.
Also, for extracting pieces of a string, or complex splitting of strings, regex can be a great time saver.
But if you're writing your own, rather than using someone else's, using something like regex buddy or sells brothers regexdesigner is a must for testing and validation.
It always depends on where it's used. If by doing the same task using a lump of code is being too complex and hard to maintain which can be a 1 liner less complex regex, then go with regex. Other wise use the lump of code.
Also I encountered problems which can I believe can only be answered by regex effective and concisely. Such question like this which can only be answered by another regex effectively: Dart regex for capturing groups but ignoring certain similar patterns
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 13 years ago.
Someone I know has been telling me that RegEx should be avoided, as it is heavyweight or involves heavy processing. Is this true? This made a clap in my ears, ringing my eardrums up until now.
I don't know why he told me that. Could it have been from experience or merely 3rd-hand information (you know what I mean...)?
So, stated plainly, why should I avoid regular expressions?
If you can easily do the same thing with common string operations, then you should avoid using a regular expression.
In most situations regular expressions are used where the same operation would require a substantial amount of common string operations, then there is of course no point in avoiding regular expressions.
Don't avoid them. They're an excellent tool, and when used appropriately can save you a lot of time and effort. Moreover, a good implementation used carefully should not be particularly CPU-intensive.
Overhyped? No. They're extremely powerful and flexible.
Overused? Absolutely. Particularly when it comes to parsing HTML (which frequently comes up here).
This is another of those "right tool for the job" scenarios. Some go too far and try to use it for everything.
You are right though in that you can do many things with substring and/or split. You will often reach a point with those where what you're doing will get so complicated that you have to change method or you just end up writing too much fragile code. Regexes are (relatively) easy to expand.
But hand written code will nearly always be faster. A good example of this is Putting char into a java string for each N characters. The regex solution is terser but has some issues that a hand written loop doesn't and is much slower.
You can substitute "regex" in your question with pretty much any technology and you'll find people who poorly understand the technology or too lazy to learn the technology making such claims.
There is nothing heavy-weight about regular expressions. The most common way that programmers get themselves into trouble using regular expressions is that they try to do too much with a single regular expression. If you use regular expressions for what they're intended (simple pattern matching), you'll be hard-pressed to write procedural code that's more efficient than the equivalent regular expression. Given decent proficiency with regular expressions, the regular expression takes much less time to write, is easier to read, and can be pasted into tools such as RegexBuddy for visualization.
As a basic parser or validator, use a regular expression unless the parsing or validation code you would otherwise write would be easier to read.
For complex parsers (i.e. recursive descent parsers) use regex only to validate lexical elements, not to find them.
The bottom line is, the best regex engines are well tuned for validation work, and in some cases may be more efficient than the code you yourself could write, and in others your code would perform better. Write your code using handwritten state machines or regex as you see fit, but change from regex to handwritten code if performance tests show you that regex is significantly inefficient.
"When you have a hammer, everything looks like a nail."
Regular expressions are a very useful tool; but I agree that they're not necessary for every single place they're used. One positive factor to them is that because they tend to be complex and very heavily used where they are, the algorithms to apply regular expressions tend to be rather well optimized. That said, the overhead involved in learning the regular expressions can be... high. Very high.
Are regular expressions the best tool to use for every applicable situation? Probably not, but on the other hand, if you work with string validation and search all the time, you probably use regular expressions a lot; and once you do, you already have the knowledge necessary to use the tool probably more efficiently and quickly than any other tool. But if you don't have that experience, learning it is effectively a drag on your productivity for that implementation. So I think it depends on the amount of time you're willing to put into learning a new paradigm, and the level of rush involved in your project. Overall, I think regular expressions are very worth learning, but at the same time, that learning process can, frankly, suck.
I think that if you learn programming in language that speaks regular expressions natively you'll gravitate toward them because they just solve so many problems. IE, you may never learn to use split because regexec() can solve a wider set of problems and once you get used to it, why look anywhere else?
On the other hand, I bet C and C++ programmers will for the most part look at other options first, since it's not built into the language.
You know, given the fact that I'm what many people call "young", I've heard too much criticism about RegEx. You know, "he had a problem and tried to use regex, now he has two problems".
Seriously, I don't get it. It is a tool like any other. If you need a simple website with some text, you don't need PHP/ASP.NET/STG44. Still, no discussion on whether any of those should be avoided. How odd.
In my experience, RegEx is probably the most useful tool I've ever encountered as a developer. It's the most useful tool when it comes to #1 security issue: parsing user input. I has saved me hours if not days of coding and creating potentially buggy (read: crappy) code.
With modern CPUs, I don't see what's the performance issue here. I'm quite willing to sacrifice some cycles for some quality and security. (It's not always the case, though, but I think those cases are rare.)
Still, RegEx is very powerful. With great power, comes great responsibility. It doesn't mean you'll use it whenever you can. Only where it's power is worth using.
As someone mentioned above, HTML parsing with RegEx is like a Russian roulette with a fully loaded gun. Don't overdo anything, RegEx included.
You should also avoid floating-point numbers at all cost. That is when you're programming in an embedded-environment.
Seriously: if you're in normal software-development you should actually use regex if you need to do something that can't be achieved with simpler string-operations. I'd say that any normal programmer won't be able to implement something that's best done using regexps in a way that is faster than the correspondig regular expression. Once compiled, a regular expression works as a state-maschine that is optimized to near perfection.
Overhyped? No
Under-Utilized Properly? Yes
If more people knew how to use a decent parser generator, there would be fewer people using regular expressions.
In my belief, they are overused by people quite a bit (I've had this discussion a number of times on SO).
But they are a very useful construct because they deliver a lot of expressive power in a very small piece of code.
You only have to look at an example such as a Western Australian car registration number. The RE would be
re.match("[1-9] [A-Z]{3} [0-9]{3}")
whilst the code to check this would be substantially longer, in either a simple 9-if-statement or slightly better looping version.
I hardly ever use complex REs in my code because:
I know how the RE engines work and I can use domain knowledge to code up faster solutions (that 9-if variant would almost certainly be faster than a one-shot RE compile/execute cycle); and
I find code more readable if it's broken up logically and commented. This isn't easy with most REs (although I have seen one that allows inline comments).
I have seen people suggest the use of REs for extracting a fixed-size substring at a fixed location. Why these people don't just use substring() is beyond me. My personal thought is that they're just trying to show how clever they are (but it rarely works).
Don't avoid it, but ask youself if they're the best tool for the task you have to solve. Maybe sometimes regex are difficult to use or debug, but they're really usefull in some situations. The question is to use the apropiate tool for each task, and usually this is not obvious.
There is a very good reason to use regular expressions in scripting languages (such as Ruby, Python, Perl, JavaScript and Lua): parsing a string with carefully optimized regular expression executes faster than the equivalent custom while loop which scans the string character-by-character. For compiled languages (such as C and C++, and also C# and Java most of the time) usually the opposite is true: the custom while loop executes faster.
One more reason why regular expressions are so popular: they express the programmer's intention in an extremely compact way: a single-line regexp can do as much as a 10- or 20-line while loop.
Overhyped? No, if you have ever taken a Parsing or Compiler course, you would understand that this is like saying addition and multiplication is overhyped for math problems.
It is a system for solving parsing problems.
some problems are simpler and don't require regular expressions, some are harder and require better tools.
I've seen so many people argue about whether a given regex is correct or not that I'm starting to think the best way to write one is to ask how to do it on StackOverflow and then let the regex gurus fight it out.
I think they're especially useful in JavaScript. JavaScript is transmitted (so should be small) and interpreted from text (although this is changing in the new browsers with V8 and JIT compilation), so a good internal regex engine has a chance to be faster than an algorithm.
I'd say if there is a clear and easy way to do it with string operations, use the string operations. But if you can do a nice regex instead of writing your own state machine interpreter, use the regex.
Regular Expressions are one of the most useful things programmers can learn, they allow to speed up and minimize your code if you know how to handle them.
Regular expressions are often easier to understand than the non-regex equivalent, especially in a language with native regular expressions, especially in a code section where other things that need to be done with regexes are present.
That doesn't meant they're not overused. The only time string.match(/\?/) is better than string.contains('?') is if it's significantly more readable with the surrounding code, or if you know that .contains is implemented with regexes anyway
I often use regex in my IDE to quick fix code. Try to do the following without regex.
glVector3f( -1.0f, 1.0f, 1.0f ); -> glVector3f( center.x - 1.0f, center.y + 1.0f, center.z + 1.0f );
Without regex, it's a pain, but WITH regex...
s/glVector3f\((.*?),(.*?),(.*?)\)/glVector3f(point.x+$1,point.y+$2,point.z+$3)/g
Awesome.
I'd agree that regular expressions are sometimes used inappropriately. Certainly for very simple cases like what you're describing, but also for cases where a more powerful parser is needed.
One consideration is that sometimes you have a condition that needs to do something simple like test for presence of a question mark character. But it's often true that the condition becomes more complex. For example, to find a question mark character that isn't preceded by a space or beginning-of-line, and isn't followed by an alphanumeric character. Or the character may be either a question mark or the Spanish "¿" (which may appear at the start of a word). You get the idea.
If conditions are expected to evolve into something that's less simple to do with a plain call to String.contains("?"), then it could be easier to code it using a very simple regular expression from the start.
It comes down to the right tool for the job.
I usually hear two arguments against regular expressions:
1) They're computationally inefficient, and
2) They're hard to understand.
Honestly, I can't understand how either are legitimate claims.
1) This may be true in an academic sense. A complex expression can double back on itself may times over. Does it really matter though? How many millions of computations a second can a server processor do these days? I've dealt with some crazy expressions, and I've never seen a regexp be the bottle neck. By far it's DB interaction, followed by bandwidth.
2) Hard for about a week. The most complicated regexp is no more complex than HTML - it's just a familiarity problem. If you needed HTML once every 3 months, would you get it 100% each time? Work with them on a daily basis and they're just as clear as any other language syntax.
I write validation software. REGEXP's are second nature. Every fifth line of code has a regexp, and for the life of me I can't understand why people make a big deal about them. I've never seen a regexp slow down processing, and I've seen even the most dull 'programmers' pick up the syntax.
Regexp's are powerful, efficient, and useful. Why avoid them?
I wouldn't say avoid them entirely, as they are QUITE handy at times. However, it is important to realize the fundamental mechanisms underneath. Depending on your implementation, you could have up to exponential run-time for a search, but since searches are usually bounded by some constant number of backtraces, you can end up with the slowest linear run-time you ever saw.
If you want the best answer, you'll have to examine your particular implementation as well as the data you intend to search on.
From memory, wikipedia has a decent article on regular expressions and the underlying algorithms.
It seems that the choice to use string parsing vs. regular expressions comes up on a regular basis for me anytime a situation arises that I need part of a string, information about said string, etc.
The reason that this comes up is that we're evaluating a soap header's action, after it has been parsed into something manageable via the OperationContext object for WCF and then making decisions on that. Right now, the simple solution seems to be basic substring'ing to keep the implementation simple, but part of me wonders if RegEx would be better or more robust. The other part of me wonders if it'd be like using a shotgun to kill a fly in our particular scenario.
So I have to ask, what's the typical threshold that people use when trying to decide to use RegEx over typical string parsing. Note that I'm not very strong in Regular Expressions, and because of this, I try to shy away unless it's absolutely vital to avoid introducing more complication than I need.
If you couldn't tell by my choice of abbreviations, this is in .NET land (C#), but I believe that doesn't have much bearing on the question.
EDIT: It seems as per my typical Raybell charm, I've been too wordy or misleading in my question. I want to apologize. I was giving some background to help give clues as to what I was doing, not mislead people.
I'm basically looking for a guideline as to when to use substring, and variations thereof, over Regular Expressions and vice versa. And while some of the answers may have missed this (and again, my fault), I've genuinely appreciated them and up-voted as accordingly.
My main guideline is to use regular expressions for throwaway code, and for user-input validation. Or when I'm trying to find a specific pattern within a big glob of text. For most other purposes, I'll write a grammar and implement a simple parser.
One important guideline (that's really hard to sidestep, though I see people try all the time) is to always use a parser in cases where the target language's grammar is recursive.
For example, consider a tiny "expression language" for evaluating parenthetized arithmetic expressions. Examples of "programs" in this language would look like this:
1 + 2
5 * (10 - 6)
((1 + 1) / (2 + 2)) / 3
A grammar is easy to write, and looks something like this:
DIGIT := ["0"-"9"]
NUMBER := (DIGIT)+
OPERATOR := ("+" | "-" | "*" | "/" )
EXPRESSION := (NUMBER | GROUP) (OPERATOR EXPRESSION)?
GROUP := "(" EXPRESSION ")"
With that grammar, you can build a recursive descent parser in a jiffy.
An equivalent regular expression is REALLY hard to write, because regular expressions don't usually have very good support for recursion.
Another good example is JSON ingestion. I've seen people try to consume JSON with regular expressions, and it's INSANE. JSON objects are recursive, so they're just begging for regular grammars and recursive descent parsers.
Hmmmmmmm... Looking at other people's responses, I think I may have answered the wrong question.
I interpreted it as "when should use use a simple regex, rather than a full-blown parser?" whereas most people seem to have interpreted the question as "when should you roll your own clumsy ad-hoc character-by-character validation scheme, rather than using a regular expression?"
Given that interpretation, my answer is: never.
Okay.... one more edit.
I'll be a little more forgiving of the roll-your-own scheme. Just... don't call it "parsing" :o)
I think a good rule of thumb is that you should only use string-matching primitives if you can implement ALL of your logic using a single predicate. Like this:
if (str.equals("DooWahDiddy")) // No problemo.
if (str.contains("destroy the earth")) // Okay.
if (str.indexOf(";") < str.length / 2) // Not bad.
Once your conditions contain multiple predicates, then you've started inventing your own ad hoc string validation language, and you should probably just man up and study some regular expressions.
if (str.startsWith("I") && str.endsWith("Widget") &&
(!str.contains("Monkey") || !str.contains("Pox"))) // Madness.
Regular expressions really aren't that hard to learn. Compared to a huuuuge full-featured language like C# with dozens of keywords, primitive types, and operators, and a standard library with thousands of classes, regular expressions are absolutely dirt simple. Most regex implementations support about a dozen or so operations (give or take).
Here's a great reference:
http://www.regular-expressions.info/
PS: As a bonus, if you ever do want to learn about writing your own parsers (with lex/yacc, ANTLR, JavaCC, or other similar tools), learning regular expressions is a great preparation, because parser-generator tools use many of the same principles.
The regex can be
easier to understand
express more clearly the intent
much shorter
easier to change/adapt
In some situations all of those advantages would be achieved by using a regex, in others only some are achieved (the regex is not really easy to understand for example) and in yet other situations the regex is harder to understand, obfuscates the intent, longer and hard to change.
The more of those (and possibly other) advantages I gain from the regex, the more likely I am to use them.
Possible rule of thumb: if understanding the regex would take minutes for someone who is somewhat familiar with regular expressions, then you don't want to use it (unless the "normal" code is even more convoluted ;-).
Hm ... still no simple rule-of-thumb, sorry.
[W]e're evaluating a soap header's
action and making decisions on that
Never use regular expressions or basic string parsing to process XML. Every language in common usage right now has perfectly good XML support. XML is a deceptively complex standard and it's unlikely your code will be correct in the sense that it will properly parse all well-formed XML input, and even it if does, you're wasting your time because (as just mentioned) every language in common usage has XML support. It is unprofessional to use regular expressions to parse XML.
To answer your question, in general the usage of regular expressions should be minimized as they're not very readable. Oftentimes you can combine string parsing and regular expressions (perhaps in a loop) to create a much simpler solution than regular expressions alone.
I would agree with what benjismith said, but want to elaborate just a bit. For very simple syntaxes, basic string parsing can work well, but so can regexes. I wouldn't call them overkill. If it works, it works - go with what you find simplest. And for moderate to intermediate string parsing, a regex is usually the way to go.
As soon as you start finding yourself needing to define a grammar however, i.e. complex string parsing, get back to using some sort of finite state machine or the likes as quickly as you can. Regexes simply don't scale well, to use the term loosely. They get complex, hard to interpret, and even incapable.
I've seen at least one project where the use of regexes kept growing and growing and soon they had trouble inserting new functionality. When it finally came time to do a new major release, they dumped all the regexes and went the route of a grammar parser.
When your required transformation isn't basic -- but is still conceptually simple.
no reason to pull out Regex if you're doing a straight string replacement, for example... its easier to just use the string.Replace
on the other hand, a complex rule with many conditionals or special cases that would take more than 50 characters of regex can be a nightmare to maintain later on if you don't explicitly write it out
I would always use a regex unless it's something very simple such as splitting a comma-separated string. If I think there's a chance the strings might one day get more complicated, I'll probably start with a regex.
I don't subscribe to the view that regexes are hard or complicated. It's one tool that every developer should learn and learn well. They have a myriad of uses, and once learned, this is exactly the sort of thing you never have to worry about ever again.
Regexes are rarely overkill - if the match is simple, so is the regex.
I would think the easiest way to know when to use regular expressions and when not to, is when your string search requires an IF/THEN statement or anything resembling this or that logic, then you need something better than a simple string comparison which is where regex shines.
One of my developers has started using RegexBuddy for help in interpreting legacy code, which is a usage I fully understand and support. What concerns me is using a regex tool for writing new code. I have actually discouraged its use for new code in my team. Two quotes come to mind:
Some people, when confronted with a
problem, think "I know, I’ll use
regular expressions." Now they have
two problems. - Jamie Zawinski
And:
Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as
cleverly as possible, you are, by
definition, not smart enough to debug
it. - Brian Kernighan
My concerns are (respectively:)
That the tool may make it possible to solve a problem using a complicated regular expression that really doesn't need it. (See also this question).
That my one developer, using regex tools, will start writing regular expressions which (even with comments) can't be maintained by anyone who doesn't have (and know how to use) regex tools.
Should I encourage or discourage the use of regex tools, specifically with regard to producing new code? Are my concerns justified? Or am I being paranoid?
Poor programming is rarely the fault of the tool. It is the fault of the developer not understanding the tool. To me, this is like saying a carpenter should not own a screwdriver because he might use a screw where a nail would have been more appropriate.
Regular expressions are just one of the many tools available to you. I don't generally agree with the oft-cited Zawinski quote, as with any technology or technique, there are both good and bad ways to apply them.
Personally, I see things like RegexBuddy and the free Regex Coach primarily as learning tools. There are certainly times when they can be helpful to debug or understand existing regexes, but generally speaking, if you've written your regex using a tool, then it's going to be very hard to maintain it.
As a Perl programmer, I'm very familiar with both good and bad regular expressions, and have been using even complicated ones in production code successfully for many years. Here are a few of the guidelines I like to stick to that have been gathered from various places:
Don't use a regex when a string match will do. I often see code where people use regular expressions in order to match a string case-insensitively. Simply lower- or upper-case the string and perform a standard string comparison.
Don't use a regex to see if a string is one of several possible values. This is unnecessarily hard to maintain. Instead place the possible values in an array, hash (whatever your language provides) and test the string against those.
Write tests! Having a set of tests that specifically target your regular expression makes development significantly easier, particularly if it's a vaguely complicated one. Plus, a few tests can often answer many of the questions a maintenance programmer is likely to have about your regex.
Construct your regex out of smaller parts. If you really need a big complicated regex, build it out of smaller, testable sections. This not only makes development easier (as you can get each smaller section right individually), but it also makes the code more readable, flexible and allows for thorough commenting.
Build your regular expression into a dedicated subroutine/function/method. This makes it very easy to write tests for the regex (and only the regex). it also makes the code in which your regex is used easier to read (a nicely named function call is considerably less scary than a block of random punctuation!). Dropping huge regular expressions into the middle of a block of code (where they can't easily be tested in isolation) is extremely common, and usually very easy to avoid.
You should encourage the use of tools that make your developers more efficient. Having said that, it is important to make sure they're using the right tool for the job. You'll need to educate all of your team members on when it is appropriate to use a regular expression, and when (less|more) powerful methods are called for. Finally, any regular expression (IMHO) should be thoroughly commented to ensure that the next generation of developers can maintain it.
I'm not sure why there is so much diffidence against regex.
Yes, they can become messy and obscure, exactly as any other piece of code somebody may write but they have an advantage over code: they represent the set of strings one is interested to in a formally specified way (at least by your language if there are extensions). Understanding which set of strings is accepted by a piece of code will require "reverse engineering" the code.
Sure, you could discurage the use of regex as has already been done with recursion and goto's but this would be justifed to me only if there's a good alternative.
I would prefer maintain a single line regex code than a convoluted hand-made functions that tries to capture a set of strings.
On using a tool to understand a regex (or write a new one) I think it's perfectly fine! If somebody wrote it with the tool, somebody else could understand it with a tool! Actually, if you are worried about this, I would see tools like RegexBuddy your best insurance that the code will not be unmaintainable just because of the regex's
Regex testing tools are invaluable. I use them all the time. My job isn't even particularly regex heavy, so having a program to guide me through the nuances as I build my knowledge base is crucial.
Regular expressions are a great tool for a lot of text handling problems. If you have someone on your team who is writing regexes that the rest of the team don't understand, why not get them to teach the rest of you how they are working? Rather than a threat, you could be seeing this as an opportunity. That way you wouldn't have to feel threatened by the unknown and you'll have another very valuable tool in your arsenal.
Zawinski's comments, though entertainingly glib, are fundamentally a display of ignorance and writing Regular Expressions is not the whole of coding so I wouldn't worry about those quotes. Nobody ever got the whole of an argument into a one-liner anyways.
If you came across a Regular Expression that was too complicated to understand even with comments, then probably a regex wasn't a good solution for that particular problem, but that doesn't mean they have no use. I'd be willing to bet that if you've deliberately avoided them, there will be places in your codebase where you have many lines of code and a single, simple, Regex would have done the same job.
Regexbuddy is a useful shortcut, to make sure that the regular expressions you are writing do what you expect- it certainly makes life easier, but it's the matter of using them at all that is what seems important to me about your question.
Like others have said, I think using or not using such a tool is a neutral issue. More to the point: If a regular expression is so complicated that it needs inline comments, it is too complicated. I never comment my regexps. I approach large or complex matching problems by breaking it down into several steps of matching, either with multiple match statements (=~), or by building up a regexp with sub regexps.
Having said all that, I think any developer worth his salt should be reasonably proficient in regular expression writing and reading. I've been using regular expressions for years and have never encountered a time where I needed to write or read one that was terrifically complex. But a moderately sized one may be the most elegant and concise way to do a validation or match, and regexps should not be shied away from only because an inexperienced developer may not be able to read it -- better to educate that developer.
What you should be doing is getting your other devs hooked up with RB.
Don't worry about that whole "2 probs" quote; it seems that may have been a blast on Perl (said back in 1997) not regex.
I prefer not to use regex tools. If I can't write it by hand, then it means the output of the tool is something I don't understand and thus can't maintain. I'd much rather spend the time reading up on some regex feature than learning the regex tool. I don't understand the attitude of many programmers that regexes are a black art to be avoided/insulated from. It's just another programming language to be learned.
It's entirely possible that a regex tool would save me some time implementing regex features that I do know, but I doubt it... I can type pretty fast, and if you understand the syntax well (using a text editor where regexes are idiomatic really helps -- I use gVim), most regexes really aren't that complex. I think you're nearly always better served by learning a technology better rather than learning a crutch, unless the tool is something where you can put in simple info and get out a lot of boilerplate code.
Well, it sounds like the cure for that is for some smart person to introduce a regex tool that annotates itself as it matches. That would suggest that using a tool is not as much the issue as whether there is a big gap between what the tool understands and what the programmer understands.
So, documentation can help.
This is a real trivial example is a table like the following (just a suggestion)
Expression Match Reason
^ Pos 0 Start of input
\s+ " " At least one space
(abs|floor|ceil) ceil One of "abs", "floor", or "ceil"
...
I see the issue, though. You probably want to discourage people from building more complex regular expression than they can parse. I think standards can address this, by always requiring expanded REs and check that the annotation is proper.
However, if they just want to debug an RE, to make sure it's acting as they think it's acting, then it's not really much different from writing code you have to debug.
It's relative.
A couple of regex tools (for Node/JS, PHP and Python) i made (for some other projects) are available online to play and experiment.
regex-analyzer and regex-composer
github repo