does regex comparisons consume lots of resources? - regex

i dunno, but will your machine suffer great slowdown if you use a very complex regex?
like for example the famous email validation module proposed just recently? which can be found here RFC822
update: sorry i had to ask this question in a hurry anyway i posted the link to the email regex i was talking about

It highly depends on the individual regex: features like look-behind or look-ahead can get very expensive, while simple regular expressions are fine for most situations.
Tutorials on http://www.regular-expressions.info/ offer performance advice, so that can be a good start.

Regexes are usually implemented as one of two algorithms (NFA or DFA) that correspond to two different FSMs. Different languages and even different versions of the same language may have a different type of regex. Naturally, some regexes work faster in one and some work faster in the other. If it's really critical, you might want to find what type of regex FSM is implemented.
I'm no expert here. I got all this from reading Mastering Regular Expressions by Jeffrey E. F. Friedl. You might want to look that up.

Depends also on how well you optimise your query, and knowing the internal working of regex.
Using the negated character class, for example, saves the cost of having the engine backtracking characters (i.e. /<[^>]+>/ instead of /<.+?>/)(*).Trivial in small matches, but saves a lot of cycles when you have to match inside a big chunk of text.
And there are many other ways to save resources in regex operations, so performance can vary wildly.
example taken from http://www.regular-expressions.info/repeat.html

You might be interested by articles like: Regular Expression Matching Can Be Simple And Fast or Understanding Regular Expressions.
It is, alas, easy to write inefficient REs, which can match quite quickly on success but can look for hours if no match is found, because the engine stupidly try a long match on every position of a long string!
There are a few recipes for this, like anchoring whenever it is possible, avoiding greediness if possible, etc.
Note that the giant e-mail expression isn't recent, and not necessarily slow: a short, simple expression can be slower than a more convoluted one!
Note also that in some situations (like e-mail, precisely), it can be more efficient (and maintainable!) to use a mix of regexes and code to handle cases, like splitting at #, handling different cases (first part starts with " or not, second part is IP address or domain, etc.).
Regexes are not the ultimate tool able to do everything, but it is a very useful tool well worth to master!

It depends on your regexp engine. As explained here (Regular Expression Matching Can Be Simple And Fast) there may be some important difference in the performance depending on the implementation.

You can't talk about regexes in general any more than you can talk about code in general.
Regular expressions are little programs on their own. Just as any given program may be fast or slow, any given regex may be fast or slow.
One thing to remember, however, is that the regular expression handler is is very well optimized to do its job and run the regex quickly.

I once made a program that analyzed a lot of text (a big code base, >300k lines). First I used regex but when I switched to using regular string functions it got a lot faster, like taking 40% of the time of the regex version. So while of course it depends, my thing got a lot faster.

Once I had written a greedy - accidentally, of course :-) - a multi-line regex and had it search/replace on 10 * 200 GB of text files. It was damn slow... So it depends what you write, and what you check.

Depends on the complexity of the expression and the language the expression is used with.
In JavaScript; you have to optimize everything. In C#; not so much.

Related

TCL string match vs regexps

Is it right that we should avoid using regexp as it is slow. Instead we should use string operations. Are there cases that both can be used but regexp is better?
You should use the appropriate tool for the job. That means, you should not avoid regex, you should use it when it is necessary.
If you are just searching for a fixed sequence of characters, use string operations.
If you are searching for a pattern, then use regular expressions.
Example
Search for the word "Foo". use string operations it will also find
"Foobar", is this OK? NO, well then maybe search for "Foo ", but then
it will not find "Foo," and "Foo."
With regex no problem, you can match for a word boundary /\mFoo\M/ and
this regex will not be slow.
I think this negative image comes from special problems like catastrophic backtracking.
There has been a recent example (catastrophic-backtracking-shouldnt-be-happening-on-this-regex) where this behaviour was unexpected.
Conclusion
A regex has to be well designed, if it isn't then the performance can be catastrophic. But the same can also happen to your normal code if you use a bad algorithm.
For a small job it should nearly never be a problem to use a regex, if your task is bigger and has to be repeated often, do a benchmark.
From my own experience, I am analyzing really big text files (some hundred MB) and use regexes to find the rows I am interested in and I don't experience performance problems because of regex.
Here an interesting read about code optimization
Regular expressions (REs) are a marvelous hammer. They can solve some problems elegantly, and many more with brute force, but it won't be pretty. And some problems can be solved with REs if you hit them enough, but there are much better solutions available (for example, things that are a good fit for string map)
string match - or globbing - can be thought of as a simplified version of regular expressions. The glob pattern will usually be shorter than the equivalent regular expression (character classes are an exception - ERs support them, with globs you need to spell them out). I don't know offhand how the performance differs; I'd expect string match to be slightly faster on equivalent patterns because of the simpler logic, but time is much more reliable than expectations.
For a specific case where REs are easier to use, extracting a substring contextually vs. by simple character position is a good example. Or for matching one of several alternatives.
My rule of thumb is to use the simplest thing that works. If that's string match, then great. If it seems like the pattern is too complex for that, go to a regexp and be happy you have the choice.
The best advice I can give, and the advice I use myself is, use regular expressions only when a simpler solution won't work.
If you can use simple string matching, or use glob patterns, use them. It's only when those cannot work that you should be using regular expressions.
To address your specific question I would say that, no, there is no time when you can use either but that regular expressions are the better choice. Maybe there's an edge case I'm not thinking of, but generally speaking, simpler solutions are always better.
I don't know about Tcl in particular, but generally it can be said that if you're looking for exact text matches (e. g. find all lines that start with #define) then string operations are faster. But if you're looking for patterns (e. g. all lines that contain a word that starts with c and ends with t) then regular expressions are the right tool for this (\bc\w*t\b would be a good regex for this - compare this to the program logic you'd need if you had to write this yourself.
And even if regex is slower in a case like this, chances are high that it won't matter in terms of execution speed, but it'll matter a lot when changes to the matching logic are required (oh, now we need to look for a word that starts with c and ends with t but contains at least two as and no x --> \bc(?=\w*a\w*a)(?!\w*x)\w*t\b).
A place where most regex engines don't want to go is recursion (matching nested tags, nested parentheses and all that). That's where parsers enter the picture.
Regular expression matching is a kind of string operation. While it's not as fast as some of the more basic operations, it is enormously more capable too. It's also more difficult to use, especially if you don't already know the basic syntax of REs, but that's not a reason to avoid them. However, replacing a regular expression with a collection of basic string operations can just lead to the program getting enormously longer: sometimes, you simply need complex manipulations.
Tcl does a number of things to make RE operations more efficient. Notably, it detects particularly simple REs and converts them into glob-like matches (as in string match) which are faster but much less powerful, and it does a number of things to cache the compiled form of REs so that matching has less overhead. It also uses an automata-theoretic matching engine that has fewer surprises during match time (at a cost of more time to compile the RE in the first place).
In short, don't avoid them. Use them where appropriate. (And time if you're in doubt about speed.)
regexp aka regular expressions are used to match many different strings and can be very complex or even to validate a specific input.
string match only allows wildcards such as * and ? and basic character grouping with [] as in regexp.
You can read about it here: http://www.tcl.tk/man/tcl8.5/TclCmd/string.htm#M40
A basic guide what regexp can do also with some examples are explained here: http://www.regular-expressions.info/
So in short: If you don't need regexp or even don't know much about it, i recommand you to not use it. If you just want to compare two strings for their equality use string equal.

When should I prefer regex over built-in string functions?

Some say I should use regex whenever possible, others say I should use it at least as possible. Is there something like a "Perl Etiquette" about that matter or just TIMTOWTDI?
The level of complexity generally dictates whether I use a regex or not. Some of the questions I ask when deciding whether or not to use a regex are:
Is there no built string function that handles this relatively easily?
Do I need to capture substring groups?
Do I need complex features like look behind or negative sets?
Am I going to make use of character sets?
Will using a regex make my code more readable?
If I answer yes to any of these, I generally use a regex.
I think a lot of the answers you got already are good. I want to address the etiquette part because I think there is some.
Summed up: if there is a robust parser available, use it instead of regular expressions; 100% of the time. Never recommend anything else to a novice. So–
Don'ts
Don't split or match against commas for CSV, use Text::CSV/Text::CSV_XS.
Don't write regexes against HTML or XML, use XML::LibXML, XML::Twig, HTML::TreeBuilder, HTML::TokeParser::Simple, et cetera.
Don't write regexes for things that are trivial to split or unpack.
Dos
Do use substr, index, and rindex where appropriate but recognize they can come off "unperly" so they are best used when benchmarking shows them superior to regular expressions; regexes can be surprisingly fast in many cases.
Do use regular expressions when there is no good parser available and writing a Parse::RecDescent grammar is overkill, too much work, or will be too slow.
Do use regular expressions for throw-away code like one-liners on well-known/predictable data including the HTML/CSV previously banned from regular expression use.
Do be aware of alternatives for bigger problems like P::RecD, Parse::Yapp, and Marpa.
Do keep your own council. Perl is supposed to be fun. Do whatever you like; just be prepared to get bashed if you complain when not following advice and it goes sideways. :P
I don't know of any "etiquette" about this.
Perl regex are highly optimized (that's one of the things the language is known for, although there are engines that are faster), and in the end, if your regex is so simple that it could be replaced by a string function, I don't believe that the regex will be any significantly less performant. If the problem you are trying to resolve is so time sensitive that you might look into other possibilities of optimization.
Another important aspect is readability. And I think that handling all string transformations through regex also add to this, insteas of mixing and matching different approaches.
Just my two cents.
Though I would classify this as too opinionated for SO, I'll give my point of view.
Use regex when the string is:
"Too Dynamic" (The string could have a lot of variation to it, that making use of the string library(ies) would be cumbersome.
"Contains patterns" if there is a genuine pattern to the string (and may be as simple as 1 character or a group of characters) this is where (i feel) regex excels.
"Too Complex" If you find yourself declaring a whole function block just to do what a single pattern can do, I can see it being worthwhile just to use regex. (However, see "Too Complex" below, too).
Do not use regex to be:
"Fast" Consider the overhead involved in spinning up a regex library over grabbing information directly from a string.
"Too Complex" Good code isn't always short. If you begin making a huge pattern to circumvent several lines of code, that's fine, but keep in mind it's at the risk of readability. Coming back to that piece and trying to wrap your head around it again may not be worth just doing the plain-jane method.
I'd say, if you need more than one or two string function calls to do it, use a regex. ;)
For things that are not too complex that the regex becomes bloated, affects the readability of code and cause performance issues. You can do it via a serious of steps, using builtin functions and other means. You may not have a cool single line regex, but your code will be readable and maintanable.
And also not too simple problems because, again, regexes are heavy weight and there are usually built-in functions that handled the simple scenarios.
It is going to depend on what you are going to do. Ofcourse, please don't use regex for parsing ( especially HTML etc. )
Perl is a great language for regex. It honestly has one of the greatest parsers of any language, so that is why you see so many "use regex" answers. I am not sure what the aversion to regex is, however.
My answer would be: can you sum up the work in a single pattern easier than using the string function, or do you need to use multiple string functions versus a single regex? In either case, I would aim for regex. Otherwise, do what feels comfortable for you.

Are regular expressions worth the hassle?

It strikes me that regular expressions are not understood well by the majority of developers. It also strikes me that for a lot of problems where regular expressions are used, a lump of code could be used instead. Granted, it might be slower and be 20 lines for something like email validation, but if performance of the code is not desperately important, is it reasonable to assume that not using regular expressions might be better practise?
I'm thinking in terms of maintenance of the code rather that straight line execution time.
Maintaining one regular expression is a lot less effort than maintaining 20 lines of code. And you underestimate the amount of code needed - for a regex of any complexity, the replacement code could easily be 200 rather than 20 lines.
Professional developers should be familiar with basic syntax
At the very least. In all the years long I've been a professional developer I haven't come across a developer that wouldn't know what Regular Expressions are. It's true, not everybody likes using them or is very good at knowing its syntax, but that doesn't mean one shouldn't use them. Developers should learn the syntax and regular expressions should be used.
It's like: "Ok. We have Lambda expressions, but who cares, I can still do it the old fashioned way."
Not learning key aspects of professional development is pure laziness and shouldn't be tolerated for too long.
Whenever i use a Regex i always try to leave a comment explaining exactly how it's structured because I agree with you that not all developers understand them and going back to a regex, even if you've written it yourself, can be a headache to understand again.
That said, they definitely have their uses. Try stripping out all html elements from a box of text without it!
I'm thinking in terms of maintenance of the code rather that straight line execution time.
Code size is the single most important factor in reducing maintainability.
And while Regexps can be very hard to decipher, so are 50 line string processing methods - and the latter are more likely to contain bugs in rare corner cases.
The thing is: any non-trivial regexp must be commented just as thoroughly as you'd comment a 50 line method.
Regular expressions are a domain-specific language: no generic programming language is quite as expressive or quite as efficient at doing what regular expressions do with string matching. The sheer size of the lump of code you will have to write in a standard programming language (even one with a good string library) will make it harder to maintain. It is also a good separation-of-concerns to make sure that the regular expression only does the matching. Having a code blob that basically does matching, but does something else in-between can produce some surprising bugs.
Also note that there are mechanisms to make regular expressions more readable. In Python you can enable verbose mode, which allows you to write things like this:
a = re.compile(r"""\d + # the integral part
\. # the decimal point
\d * # some fractional digits""", re.X)
Another possibility is to build the regular expression up from strings, by line and comment each line, like this:
a = re.compile("\d+" # the integral part
"\." # the decimal point
"\d *" # fraction digits
)
This is possible in different ways in most programming languages. My advice is to keep using regular expressions where appropriate, but treat them like you do other code. Write them as clear as possible, comment them and test them.
You raise a very good point with regards to maintainability. Regular expressions can require some deciphering to understand but I doubt the code which would replace them would be easier to maintain. Regular Expressions are VERY powerful and a valuable tool. Use them but use them carefully, and think about how to make it clear what the intent of the regular expression is.
Regards
With great power comes great responsibility!
Regular expressions are great, but there can be a tendancy to over-use them! There are not suitable in all cases!
In my opinion, it might make more sense to enforce better practices with using regular expressesions other than forgoing it all together.
Always comment your regular expressions. You might know what it does now, but someone else might not and even you might not remember in two weeks. Moreover, descriptive comments should be used, stating exactly what the regular expression is meant to do.
Use unit testing. Create unit tests for your regular expressions. So can have a degree of assurance as to the reliability and correctness of your regular expression statement. And if the regex is being maintained, it would ensure that any code changes does not break existing functionality.
Using regular expression has some advantages:
Time. You don't have to write your own code to do exactly what is built in.
Maintainability. You have to maintain only a couple of lines as opposed to 30 or 300
Performance. The code is optimized
Reliability. If your regex statement is correct, it should function correctly.
Flexibility. Regex gives you a lot of power which is very useful if used properly
Think of regular expressions as the lingua Franca of string processing. You simply need to know them if you are going tocode in a professional capacity. Unless you just write SQL maybe.
I would just like to add that unit testing is the ideal way to make your regular expressions maintainable. I consider Regex an essential developer skill that is always a practical alternative to writing many lines of string manipulation code.
The most hassle I see is when people try to parse non-regular languages with regular expressions (yes, that includes all programming and many markup languages, yes, also HTML). I sometimes wish all coders had to demonstrate that they have understood at least the difference between context-free and regular languages before they are allowed to use regular expressions. Alternatively, they could get their regex license revoked when they are caught trying to parse non-regular languages with them. Yes, I'm joking, but only half.
The next problem arises when people try to do more than character matching in a regular expression, for example, checking for a valid date, perhaps even including leap year considerations (this could also lead to regex license revokation).
Regular expressions really are just a convenient shorthand for a finite state automaton (You know what that is, don't you? Where is your regex license, please?). The problems come from people expecting some kind of magic from them, not from the regular expressions themselves.
I see regex as a fast, readable and preferable way to perform pattern matching on string data. So many languages support regex for this reason. If you wanted to write string manipulation code to match say, a Canadian zip code, be my guest, but the regex equivalent is so much more succinct. Definitely worth it.
In .NET regex'es you can have comments, and break them up into multiple lines, use indenting etc. (I don't know about other dialects...)
Use the "ignore pattern whitespace" setting, and either # for commenting out the rest of the line, or "(#comments)" in your pattern...
So if you wanted to, you can actually make them sort of readable/maintainable...
I just ran into this issue. I built a regular expression to pull out groups of data from a long string of numbers and some other noise. The regex was quite long, though concise, and it got even bigger when i tried to add it to the C# app i was writing. In total the reg ex was 3 lines of code.
However it was painful to look at after i escaped it for C# and the other developers i work with don't under stand regular expressions. I ended up stripping out most of the noise characters and splitting on space to get the groups of data. Very simple code and only 5 lines.
Which is better? My ego says Regular Expressions. Any new hire would say character stripping.
I would never wish for fewer options in programming. Regular expressions can be very powerful, but do require skill. I like problems that can be solved in a few lines of code. It is really cool how many elements of validation can be accomplished. As long as the code is commented on what the expression checks for, I do not see a problem. I also have never seen a professional programmer not know what a regex was. It is another tool in the tool box.
Regex is one tool among many. But as many craftsmen will attest, the more tools you have at your disposal, and the more skilled you are at using them, the more likely you will become a Master Craftsman.
Is Regex worth the hassle to you? Dunno. Depends how seriously you take what you do.
It's a lot easier to see at first glance that a regex is probably correct. Why would I write a long state machine in code (probably containing bugs at first) when I could write a simple one line regex?
Regexes may be considered "write only", but I think that is sometimes a benefit. When writing a relatively simple regex from scratch, it's pretty easy to get it right.
True, learning to decipher regexes is difficult -- but so is learning to decipher the hosting program code in the first place. But is that so difficult, that we would rather write out manual instruction for a person to perform? No -- because that would be ridiculously longer and complicated. Same thing for not using a properly-formed regex.
I've found with reg ex it's easier to maintain, but fine tuning someone else's reg ex is a bit of a pain. I think you underestimate the developers by saying most people don't understand it. Usually what I found is that over time, requirements adjust, and the regex that used to validate something is no longer effective and attempting to remove portions that are no longer valid is harder than to just rewrite the entire thing.
Also, imagine if you were validating phone numbers, and you decided to use code instead of reg ex. So it amounts to let's say 20 lines. Over time, your company decides to expand to other regions where now the phone validation is no longer totally true. So you have to adjust it to fit the other requirements. It could be possible that the code would be harder to maintain because you have to adjust over 20 lines of code rather than simply removing the old reg ex, and replacing it with a new one.
However, I think code can be used in certain cases along with regex. For example, let's say you want to validate US phone numbers, in every case, it has 10 digits numbers, but there are literally a ton of ways to write it out. For example (xxx) xxx-xxxx, or xxx-xxx-xxxx, or xxx xxx xxxx, etc, etc, etc. So if you write reg ex, you'd have to account for each of the cases. However, if you just strip all non-numerics and spaces with a regex replace, then go for a second pass and check if it has 10 digits, you'd find it easier than accounting each and every possible way to write a phone number.
One thing that doesn't seem mentioned (from a quick scan of the answers above) is that regular expressions are useful outside of code too. That means they are worth the hassle for a coder, or even for end users.
For example, I just wrote a bunch of unit tests for a formatter. I then made a copy of the test, and used a single regex in my editor to invert values and resulting strings (changing the method name too), giving expected value to a string to parse...
Another example: in our product, we allow using regular expressions for searching or filtering columns of data: sometime it is useful to get only names starting with something, ending with something, with letters followed by digits, or similar: no need to be a master of regexes to use them.
In these cases, writing code isn't an option (well, I could have made a small Lua script in the first case, but it would have been longer) and performance isn't a major issue.
But even in code, I often find easier and more readable to use a simple regular expression than a bunch of substring() with complex offsets and whatnot. Beside, they shine to validate user input where, again, performance isn't an issue.
Due to the type of apps I build, the only RegEx's I regularly use are for email validation, html stripping, and character stripping to remove the garbage around phone numbers.
It's rare that I need to do very much string manipulation other than concatenation.
Incidentally, the apps are typically CRM's.
So the hassle for me is limited to googling for a regex in the event I find myself in need. ;)
Read the section under "Using Benchmarks" at JavaWorld.
Sure regular expressions are a very helpful tool, but I agree that they are overused and over complicate what can easily be a simple solution.
That being said, you should use regular expressions whenever the situation calls for it. Some things, such as searching for text in a string, can just as easily be done with an iterative search (or using the API searches), but for more complex situations you need regular expressions.
Surly all code needs to be optimized where possible!
In the context where code need not be optimized, and the logic will need to be maintained then it is down to the skill set of the team.
If the bulk of the team responsible for the code is regEX savvy then do it with a regEX. Else write it in the way the team is likely to be most comfortable with.
VB.net is best, No, C# is, No F# is the best. It's really more a matter of what will be the people maintaining be better suited to handle, in my opinion. That's more a flame question, than something that is absolutely answerable.
Personally I'd choose regex whenever there's complex string validation (phone numbers,emails, ss#, ip addresses) where there are well known regex's out there. Get it from regex.org, give attribution with a comment and/or get the authors permission whichever is appropriate, and be done with it.
Also, for extracting pieces of a string, or complex splitting of strings, regex can be a great time saver.
But if you're writing your own, rather than using someone else's, using something like regex buddy or sells brothers regexdesigner is a must for testing and validation.
It always depends on where it's used. If by doing the same task using a lump of code is being too complex and hard to maintain which can be a 1 liner less complex regex, then go with regex. Other wise use the lump of code.
Also I encountered problems which can I believe can only be answered by regex effective and concisely. Such question like this which can only be answered by another regex effectively: Dart regex for capturing groups but ignoring certain similar patterns

Constructing regex

I use regex buddy which takes in a regex and then gives out the meaning of it from which one gets what it could be doing? On similar lines is it possible to have some engine which takes natural language input describing about the pattern one needs to match/replace and gives out the correct(almost correct) regex for that description?
e.g. Match the whole word 'dio' in some file
So regex for that could be : <dio>
or
\bdio\b
-AD.
P.S. = I think few guys here might think this as a 'subjective' 'not-related-to-programming' question, but i just need to ask this question nonetheless. For myself. - Thanks.
This would be complicated to program, because you need a natural language parser able to derive meaning. Unless you limit it to a strict subset -- in which case, you're reinventing an expression language, and you'll eventually wind up back at regular expressions -- only with bigger symbols. so what's the gain?
Regexes were developed for a reason -- they're the simplest, most accurate representation possible.
There is a Symbolix Regular Expression Builder package for Emacs, but looking at it, I think that regular expressions are easier to work with.
Short answer: no, not until artificial intelligence improves A LOT.
If you wrote something like this, you'd have a very limited syntax. For someone to know "Match the whole word 'dio' in some file", they would basically need to have significant knowledge of regular expressions. At that point, just use regular expressions.
For non-technical users, this will never work unless you limit it to basic "find this phrase" or, maybe, "find lines starting/ending with ??". They're never going to come up with something like this:
Find lines containing a less-than symbol followed by the string 'img' followed by one or more groupings of: some whitespace followed by one or more letters followed by either a double-quoted string or a single-quoted string, and those groupings are followed by any length of whitespace then a slash and a greater-than sign.
That's my attempt at a plain-language version of this relatively simple regex:
/<img(\s+[a-z]+=("[^"]*"|'[^']*'))+\s*/>/i
Yeah, I agree with you that it is subjective. But I will answer your question because I think that you have asked a wrong question.
The answer is "YES". Almost anything can be coded and this would be a rather simple application to code. Will it work perfectly? No, it wouldn't because natural language is quite complex to parse and interpret. But it is possible to write such an engine with some constraints.
Generating a regex via the use of a natural language processor is quite possible. Prolog is supposed to be a good language choice for this kind of problem. In practice, however, what you'd be doing, in effect, is designing your own input language which provides a regex as output. If your goal is to produce regexs for a specific task, this might in fact be useful. Perhaps the task you are doing tends to require certain formulations that are doable but not built into regular expressions. Though whether this will be more effective than just creating the regexs one at a time depends on your project. Usually this is probably not the case, since your own language is not going to be as well-known or as well-documented as regex. If your goal is to produce a replacement for regex whose output will be parsed as a regex, I think you're asking a lot. Not to say people haven't done the same sort of thing before (e.g. the C++ language as an 'improvement' that runs, originally, on C++).
try the open source mac application Ruby Regexp Machine, at http://www.rubyregexp.sf.net. It is written in ruby, so you can use some of the code even if you are not on mac. You can describe a lot of simple regular expresions in an easy english grammar. As a disclosure, i did make this tool.

When is a issue too complex for a regular expression?

Please don't answer the obvious, but what are the limit signs that tell us a problem should not be solved using regular expressions?
For example: Why is a complete email validation too complex for a regular expression?
Regular expressions are a textual representation of finite-state automata. That is to say, they are limited to only non-recursive matching. This means that you can't have any concept of "scope" or "sub-match" in your regexp. Consider the following problem:
(())()
Are all the open parens matched with a close paren?
Obviously, when we look at this as human beings, we can easily see that the answer is "yes". However, no regular expression will be able to reliably answer this question. In order to do this sort of processing, you will need a full pushdown automaton (like a DFA with a stack). This is most commonly found in the guise of a parser such as those generated by ANTLR or Bison.
A few things to look out for:
beginning and ending tag detection -- matched pairing
recursion
needing to go backwards (though you can reverse the string, but that's a hack)
regexes, as much as I love them, aren't good at those three things. And remember, keep it simple! If you're trying to build a regex that does "everything", then you're probably doing it wrong.
When you need to parse an expression that's not defined by a regular language.
What it comes down to is using common sense. If what you are trying to match becomes an unmanageable, monster regular expression then you either need to break it up into small, logical sub-regular expressions or you need to start re-thinking your solution.
Take email addresses (as per your example). This simple regular expression (taken from RegEx buddy) matches 99% of all emails out there:
\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b
It is short and to the point and you will rarely run into issues with it. However, as the author of RegEx buddy points out, if your email address is in the rare top-level domain ".museum" it will not be accepted.
To truely match all email addresses you need to adhere to the standard known as RFC 2822. It outlines the multitude of ways email addresses can be formatted and it is extremely complex.
Here is a sample regular expression attempting to adhere to RFC 2822:
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"
(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x
0c\x0e-\x7f])*")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9]
(?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.)
{3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08
\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
This obviously becomes a problem of diminishing returns. It is better to use the easily maintained implementation that matches 99% of email addresses vs the monsterous one that accepts 99.9% of them.
Regular expressions are a great tool to have in your programmers toolbox but they aren't a solution to all your parsing problems. If you find your RegEx solution starting to become extremely complex you need to either attempt to logically break it up into smaller regular expressions to match portions of your text or you need to start looking at other methods to solve your problem. Similarly, there are simply problems that Regular Expressions, due to their nature, can't solve (as one poster said, not adhering to Regular Language).
Regular expressions are suited for tokenizing, finding or identifying individual bits of text, e.g. finding keywords, strings, comments, etc. in source code.
Regular expressions are not suited for determining the relationship between multiple bits of text, e.g. finding a block of source code with properly paired braces. You need a parser for that. The parser can use regular expressions for tokenizing the input, while the parser itself determines how the different regex matches fit together.
Essentially, you're going to far with your regular expressions if you start thinking about "balancing groups" (.NET's capture group subtraction feature) or "recursion" (Perl 5.10 and PCRE).
Here's a good quote from Raymond Chen:
Don't make regular expressions do what they're not good at. If you want to match a simple pattern, then match a simple pattern. If you want to do math, then do math. As commenter Maurits put it, "The trick is not to spend time developing a combination hammer/screwdriver, but just use a hammer and a screwdriver.
Source
Solve the problem with a regex, then give it to somebody else conversant in regexes. If they can't tell you what it does (or at least say with confidence that they understand) in about 10 minutes, it's too complex.
Sure sign to stop using regexps is this: if you have many grouping braces '()' and many alternatives '|' then it is a sure sign that you try to do a (complex) parsing with regular expressions.
Add to the mix Perl extensions, backreferences, etc and soon you have yourself a parser that is hard to read, hard to modify, and hard to reason about it's properties (e.g. is there an input on which this parser will work in a exponential time).
This is a time to stop regexing and start parsing (with hand-made parser, parser generators or parser combinators).
Along with tremendous expressions, there are principal limitations on the words, which can be handled by regexp.
For instance you can not not write regexp for word described by n chars a, then n chars b, where n can be any, more strictly .
In different languages regexp is a extension of Regular language, but time of parsing can be extremely large and this code is non-portable.
Whenever you can't be sure it really solves the problem, for example:
HTML parsing
Email validation
Language parsers
Especially so when there already exist tools that solve the problem in a totally understandable way.
Regex can be used in the domains I mentioned, but only as a subset of the whole problem and for specific, simple cases.
This goes beyond the technical limitations of regexes (regular languages + extensions), the maintainability and readability limit is surpassed a lot earlier than the technical limit in most cases.
A problem is too complex for regular expressions when constraints of the problem can change after the solution is written. So, in your example, how can you be sure an email address is valid when you do not have access to the target mail system to verify that the email address is attached to a valid user? You can't.
My limit is a Regex pattern that's about 30-50 characters long (varying depending on how much is fixed text and how much is regex commands)
This may sound stupid but I often lament not being able to do database type of queries using regular expression. Now especially more then before because I am entering those types of search string all the time on search engines. its very difficult, if not impossible to search for +complex AND +"regular expression"
For example, how do I search in emacs for commands that have both Buffer and Window in their name? I need to search separately for .*Buffer.*Window and .*Window.*Buffer