What are some of the most useful regular expressions for programmers? - regex

I am new to regular expressions and have just started learning some. I was wondering what are some of the most commonly used regular expressions by the programmers. Put it in another way, I would like to know what are the regular expressions most useful for? How can they help me in my every day tasks? I would prefer to know regular expressions useful for every day programming, not occasionally used regular expressions such email address matching.
Anyone? Thanks.
Edit: Most of the answers include regular expressions to match email addresses, URLs, dates, phone numbers etc. Please note that not all programmers have to worry about these things in their every day tasks. I would like to know some more generic uses of regular expressions, if there are any, which programmers in general (may) use regardless what language are domain they are working in.

Regular expression examples for
Decimals input
Positive Integers ^\d+$
Negative Integers ^-\d+$
Integer ^-?\d+$
Positive Number ^\d*\.?\d+$
Negative Number ^-\d*\.?\d+$
Positive Number or Negative Number ^-?\d*\.?\d+$
Phone number ^\+?[\d\s]{3,}$
Phone with code ^\+?[\d\s]+\(?[\d\s]{10,}$
Year 1900-2099 ^(19|20)\d{2}$
Date (dd mm yyyy, d/m/yyyy, etc.)
^([1-9]|0[1-9]|[12][0-9]|3[01])\D([1-9]|0[1-9]|1[012])\D(19[0-9][0-9]|20[0-9][0-9])$
IP v4:
^(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5]){3}$
Alphabetic input
Personal Name ^[\w.']{2,}(\s[\w.']{2,})+$
Username ^[\w\d_.]{4,}$
Password at least 6 symbols ^.{6,}$
Password or empty input ^.{6,}$|^$
email ^[_]*([a-z0-9]+(\.|_*)?)+#([a-z][a-z0-9-]+(\.|-*\.))+[a-z]{2,6}$
domain ^([a-z][a-z0-9-]+(\.|-*\.))+[a-z]{2,6}$
Other regular expressions
- Match no input ^$
- Match blank input ^\s\t*$
- Match New line [\r\n]|$
- Match white Space ^\s+$
- Match Url = ^http\:\/\/[a-zA-Z0-9.-]+\.[a-zA-Z]{2,3}$

I would take a different angle on this and say that it's most helpful to know when to use regular expressions and when NOT to use them.
For example, imagine this problem: "Figure out if a string ends with a whitespace character." A regular expression could be used here, but if you're using C#, this code is much faster:
bool EndsWithWhitespace(string s)
{
return !string.IsNullOrEmpty(s) && char.IsWhiteSpace(s[s.Length - 1]);
}
Regular expressions are powerful, and it's important to know when they're too powerful for the problem you're trying to solve.

Think about input fields that require validation, such as zip codes, telephone numbers, et cetera. Regular expressions are very utilized to validate those. Also, take a look at this site, which contains many tutorials, and many more examples, some of which I present next:
Numeric Ranges. Since regular
expressions work with text rather than
numbers, matching specific numeric
ranges requires a bit of extra care.
Matching a Floating Point Number. Also
illustrates the common mistake of
making everything in a regular
expression optional.
Matching an Email Address. There's a
lot of controversy about what is a
proper regex to match email addresses.
It's a perfect example showing that
you need to know exactly what you're
trying to match (and what not), and
that there's always a trade-off
between regex complexity and accuracy.
Matching Valid Dates. A regular
expression that matches 31-12-1999 but
not 31-13-1999.
Finding or Verifying Credit Card
Numbers. Validate credit card numbers
entered on your order form. Find
credit card numbers in documents for a
security audit.
And many, many, many more possible applications.

E-mail address
Website
File-Paths
Phone-numbers/Fax/ZIP and other numbers used in business (chemistry numbers, ect.)
file content (check if the file can be a valid XML-file,...)
code modification and formatting (with replacement)
data types (GUID, parsing of integers,...)
...

Upto closing tag
([^<]*)
Seriously. I use combinations of that way too often for comfort... We should all ditch regex:en for peg-parsers, especially since there's a nice regex-like grammar style for them.

Well... I kind of think your question is wrong. It sounds like you're asking about regular expressions that could/should be as much a part of one's coding, or nearly so, as things like mathematical operators. Really, if your code depends that pervasively on regular expressions, you're probably doing something very wrong. For pervasive use throughout code, you want to use data structures that are better defined and more efficient to work with than regular-expression-managed strings.
The closest thing to what you're asking for that would make much sense to me would be something like /\s+/ used for splitting strings on arbitrary amounts of whitespace.

This is a little like asking 'what are the most useful words for programmers?'
It depends what you're going to use them for, and it depends which language. And you didn't say.
Some programmers never need to worry about matching email addresses, phone numbers, ZIP codes and IP addresses.
My copy of
Mastering Regular Expressions, O'Reilly, 3rd Edition, 2006
devotes a lot of space to the flavours of regex used by different languages.
It's a great reference, but I found the 2nd edition more readable.

How can they help me in my every day tasks?
A daily use for programmers could include
search/replace of sample data for testing purposes
searching through log files for String patterns (Exceptions, for example)
searching a directory structure for files of a certain type (as simple as dir *.txt does this)
to name just a few

E-mail
Website URL
Phone-numbers
ZIP Code
Alpha Numeric, (user name consist of alpha number and only start with alpha character
IP Address

This will be completely dependent on what domain you work in. For some it will be phone numbers and SSN's and others it will be email addresses, IP addresses, URLs. The most important thing is knowing when you need a regex and when you don't. For example, if you're trying to parse data from an XML or HTML file, it's usually better to use a library specifically designed to parse that content than to try and write something yourself.

Related

How to learn regular expressions

I.e., I get a list of words and I want to construct a simple regular expression from that which matches at least all of the words (but maybe more).
I want to have an algorithm for that. I.e. input of that algorithm is a list of words and output is a regular expression. Obviously, there will be some restrictions. Like either the regular expression will always match more words if it should match an infinite amounts of words and I only give it a finite number of words. Or I will need some more compact representation of the input. Or I am also thinking about giving me some regular expression as input and a list of additional words and I want to get a regular expression which matches all of them together (and maybe more). In any case, it should try to construct a regular expression which is as simple as possible.
What techniques are availalbe which can do that?
I was quite misunderstood. I know the general principles behind regular expressions. I know what it is. And in most cases I can come up quite easily with a regular expression to some language by hand. But I am searching for algorithms which does that.
Again formulated a bit different:
Let L be a regular language. Let M_n be a finite subset of L with n elements. Let M_n be a subset of M_(n+1).
I want to have an algorithm LRE which gets a finite set of words and outputs a regular expression. And I want to have the property:
lim_n->infinity | diff( LRE(M_n), L ) | = 0
See this website to learn the general principles: http://www.regular-expressions.info/
If all you have is a list of words such as dog, cat, cow, mouse, the simplest regex to match any of these would be: dog|cat|cow|mouse, but note that it will also match doggone, scatological, etc... It may or may not match DOGGONE, COWPATTY, etc... depending on whether or not your are doing case-sensitive matching. Better patterns can be given if more particulars about your problem are given.
It's also a good idea to get a regex testing tool. I like Expresso, it is good for .NET patterns. Since regex capabilties may vary between platforms, make sure your tool supports your platform.
This problem has been looked at the last decade. You might want to google DFA learning, and download a couple of papers to get a sense of the state of the art.
Once you have the DFA generating a regular expression is trivial. To avoid the problems #FrustratedWithDesign mentions some conditions such as generating the DFA with the least amount of nodes is introduced, from a machine learning point of view this is similar to having a regularization condition for the simplest hypothesis.
Use this site to learn the basics and use rubular for live testing.
If you have a list of distinct words that you want to match -- it doesn't sound like you're matching on something that a regular expression is best at.
As FrustratedWithFormsDesigner pointed out -- your regex is going to be mapped to the items in the list in the worst case; best case you can find common prefixes. And if you automate the regex construction, why bother with the regex? What is the use-case?
But if your list is beyond a trivial size, you'd probably be better off looping through it.
http://www.regular-expressions.info is a fantastic site for Regex Reference.
When building a complex regex, I typically use Expresso. It's a free app that helps you build Regular expressions. It breaks them down into a tree view so that it is easy to see what all parts are doing. http://www.ultrapico.com/Expresso.htm It is made to work with .NET languages, but there are plenty of tools like this available for different languages.
To build my Regex, I'll usually start with an acceptable value and start replacing characters with Regex syntax.
For example, if I was trying to match a URL I would start with
http://www.mydomain.com
I would then escape anything that needs escaping
http://www\.mydomain\.com
then I would start replacing characters
http://www\.\w+\.\w+\.\w+
obviously this expression needs some more work, but you get the idea
Here is a site for Perl regex:
http://perldoc.perl.org/perlre.html

In Which Cases Is Better To Use Regular Expressions?

I'm starting to learn Regular Expressions and I want to know: In which cases is better to use them?
Regular expressions is a form of pattern matching that you can apply on textual content. Take for example the DOS wildcards ? and * which you can use when you're searching for a file
. That is a kind of very limited subset of RegExp. For instance, if you want to find all files beginning with "fn", followed by 1 to 4 random characters, and ending with "ht.txt", you can't do that with the usual DOS wildcards. RegExp, on the other hand, could handle that and much more complicated patterns.
Regular expressions are, in short, a way to effectively
handle data
search and replace strings
provide extended string handling.
Often a regular expression can in itself provide string handling that other functionalities such as the built-in string methods and properties can only do if you use them in a complicated function or loop.
When you are trying to find/replace/validate complicated string patterns.
I use regular expressions when comparing strings (preg_match), replacing substrings (sed,preg_replace), replacing characters (sed,preg_replace), searching for strings in files (grep), splitting strings (preg_split) etc.
It is a very flexible and widespread pattern expression language and it is very useful to know.
BUT! It's like they say about poker, it's very easy to learn, but very hard to master.
I just came across a question that i thought was perfect for a RegEx, have a look and decide for yourself.
There are some cases where, if you need better performance, you should avoid regular expressions in favor of writing code. An example of this is parsing very large CSV files.
Regular expressions are a dsl (domain specific language) for parsing text. Just like xpath is a dsl for traversing xml. It is essentially a mini language inside of a general purpose language. You can accomplish quite a bit in a very small amount of code because it is specialized for a narrow purpose. One very common use for regular expressions is checking if a string is an email address, phone number, ssn, etc...
There are also cases where regular expressions are >>NOT<< appropriate (in general; there are always exceptions).
Parsing HTML
Parsing XML
In the above cases a DOM parser is almost always a better choice. The grammars are complex and there are too many edge cases, such as nested tags.
Also be sure to consider future maintenance programmers (which may be you). Comments and/or well-chosen method/constant/variable names can make a world of difference, especially for developers not fluent in regular expressions.
Regular expressions can be especially useful for validating the format of free text input. Of course they can't validate the correctness of data, just its format. And you have to keep in mind regional variations for certain types of values (phone numbers or postal codes for example). But for cases where valid input can be defined as a text pattern, regexes make quick work of the validation.

Is it possible for a computer to "learn" a regular expression by user-provided examples?

Is it possible for a computer to "learn" a regular expression by user-provided examples?
To clarify:
I do not want to learn regular expressions.
I want to create a program which "learns" a regular expression from examples which are interactively provided by a user, perhaps by selecting parts from a text or selecting begin or end markers.
Is it possible? Are there algorithms, keywords, etc. which I can Google for?
EDIT: Thank you for the answers, but I'm not interested in tools which provide this feature. I'm looking for theoretical information, like papers, tutorials, source code, names of algorithms, so I can create something for myself.
Yes,
it is possible,
we can generate regexes from examples (text -> desired extractions).
This is a working online tool which does the job: http://regex.inginf.units.it/
Regex Generator++ online tool generates a regex from provided examples using a GP search algorithm.
The GP algorithm is driven by a multiobjective fitness which leads to higher performance and simpler solution structure (Occam's Razor).
This tool is a demostrative application by the Machine Lerning Lab, Trieste Univeristy (Università degli studi di Trieste).
Please look at the video tutorial here.
This is a research project so you can read about used algorithms here.
Behold! :-)
Finding a meaningful regex/solution from examples is possible if and only if the provided examples describe the problem well.
Consider these examples that describe an extraction task, we are looking for particular item codes; the examples are text/extraction pairs:
"The product code is 467-345A" -> "467-345A"
"The item 789-345B is broken" -> "789-345B"
An (human) guy, looking at the examples, may say: "the item codes are things like \d++-345[AB]"
When the item code is more permissive but we have not provided other examples, we have not proofs to understand the problem well.
When applying the human generated solution \d++-345[AB] to the following text, it fails:
"On the back of the item there is a code: 966-347Z"
You have to provide other examples, in order to better describe what is a match and what is not a desired match:
--i.e:
"My phone is +39-128-3905 , and the phone product id is 966-347Z" -> "966-347Z"
The phone number is not a product id, this may be an important proof.
The book An Introduction to Computational Learning Theory contains an algorithm for learning a finite automaton. As every regular language is equivalent to a finite automaton, it is possible to learn some regular expressions by a program. Kearns and Valiant show some cases where it is not possible to learn a finite automaton. A related problem is learning hidden Markov Models, which are probabilistic automata that can describe a character sequence. Note that most modern "regular expressions" used in programming languages are actually stronger than regular languages, and therefore sometimes harder to learn.
No computer program will ever be able to generate a meaningful regular expression based solely on a list of valid matches. Let me show you why.
Suppose you provide the examples 111111 and 999999, should the computer generate:
A regex matching exactly those two examples: (111111|999999)
A regex matching 6 identical digits (\d)\1{5}
A regex matching 6 ones and nines [19]{6}
A regex matching any 6 digits \d{6}
Any of the above three, with word boundaries, e.g. \b\d{6}\b
Any of the first three, not preceded or followed by a digit, e.g.
(?<!\d)\d{6}(?!\d)
As you can see, there are many ways in which examples can be generalized into a regular expression. The only way for the computer to build a predictable regular expression is to require you to list all possible matches. Then it could generate a search pattern that matches exactly those matches.
If you don't want to list all possible matches, you need a higher-level description. That's exactly what regular expressions are designed to provide. Instead of providing a long list of 6-digit numbers, you simply tell the program to match "any six digits". In regular expression syntax, this becomes \d{6}.
Any method of providing a higher-level description that is as flexible as regular expressions will also be as complex as regular expressions. All tools like RegexBuddy can do is to make it easier to create and test the high-level description. Instead of using the terse regular expression syntax directly, RegexBuddy enables you to use plain English building blocks. But it can't create the high-level description for you, since it can't magically know when it should generalize your examples and when it should not.
It is certainly possible to create a tool that uses sample text along with guidelines provided by the user to generate a regular expression. The hard part in designing such a tool is how does it ask the user for the guiding information that it needs, without making the tool harder to learn than regular expressions themselves, and without restricting the tool to common regex jobs or to simple regular expressions.
Yes, it's certainly "possible"; Here's the pseudo-code:
string MakeRegexFromExamples(<listOfPosExamples>, <listOfNegExamples>)
{
if HasIntersection(<listOfPosExamples>, <listOfNegExamples>)
return <IntersectionError>
string regex = "";
foreach(string example in <listOfPosExamples>)
{
if(regex != "")
{
regex += "|";
}
regex += DoRegexEscaping(example);
}
regex = "^(" + regex + ")$";
// Ignore <listOfNegExamples>; they're excluded by definition
return regex;
}
The problem is that there are an infinite number of regexs that will match a list of examples. This code provides the simplest/stupidest regex in the set, basically matching anything in the list of positive examples (and nothing else, including any of the negative examples).
I suppose the real challenge would be to find the shortest regex that matches all of the examples, but even then, the user would have to provide very good inputs to make sure the resulting expression was "the right one".
I believe the term is "induction". You want to induce a regular grammar.
I don't think it is possible with a finite set of examples (positive or negative). But, if I recall correctly, it can be done if there is an Oracle which can be consulted. (Basically you'd have to let the program ask the user yes/no questions until it was content.)
You might want to play with this site a bit, it's quite cool and sounds like it does something similar to what you're talking about: http://txt2re.com
There's a language dedicated to problems like this, based on prolog. It's called progol.
As others have mentioned, the basic idea is inductive learning, often called ILP (inductive logic programming) in AI circles.
Second link is the wiki article on ILP, which contains a lot of useful source material if you're interested in learning more about the topic.
#Yuval is correct. You're looking at computational learning theory, or "inductive inference. "
The question is more complicated than you think, as the definition of "learn" is non-trivial. One common definition is that the learner can spit out answers whenever it wants, but eventually, it must either stop spitting out answers, or always spit out the same answer. This assumes an infinite number of inputs, and gives absolutely no garauntee on when the program will reach its decision. Also, you can't tell when it HAS reached its decision because it might still output something different later.
By this definition, I'm pretty sure that regular languages are learnable. By other definitions, not so much...
I've done some research on Google and CiteSeer and found these techniques/papers:
Language identification in the limit
Probably approximately correct learning
Also Dana Angluin's "Learning regular sets from queries and counterexamples" seems promising, but I wasn't able to find a PS or PDF version, only cites and seminar papers.
It seems that this is a tricky problem even on the theoretical level.
If its possible for a person to learn a regular expression, then it is fundamentally possible for a program. However, that program will need to be correctly programmed to be able to learn. Luckily this is a fairly finite space of logic, so it wouldn't be as complex as teaching a program to be able to see objects or something like that.

I don’t get regular expressions

I don’t understand or see the need for regular expressions.
Can some explain them in simple terms and provide some basic examples where they could be useful, or even critical.
Use them where you need to use/manipulate patterns. For instance, suppose you need to recognise the following pattern:
Any letter, A-Z, either upper or lower case, 5 or 6 times
3 digits
a single letter a-z (definitely lower case)
(Things like this crop up for zip code, credit card, social security number validation etc.)
That's not really hard to write in code - but it becomes harder as the pattern becomes more complicated. With a regular expression, you describe the pattern (rather than the code to validate it) and let the regex engine do the work for you.
The pattern here would be something like
[A-Za-z]{5,6}[0-9]{3}[a-z]
(There are other ways of expressing it too.) Grouping constructs make it easy to match a whole pattern and grab (or replace) different bits of it, too.
A few downsides though:
Regexes can become complicated and hard to read quite quickly. Document thoroughly!
There are variations in behaviour between different regex engines
The complexity can be hard to judge if you're not an expert (which I'm certainly not!); there are "gotchas" which can make the patterns really slow against particular input, and these gotchas aren't obvious at all
Some people overuse regular expressions massively (and some underuse them, of course). The worst example I've seen was where someone asked (on a C# group) how to check whether a string was length 3 - this is clearly a job for using String.Length, but someone seriously suggested matching a regex. Madness. (They also got the regex wrong, which kinda proves the point.)
Regexes use backslashes to escape various things (e.g. use . to mean "a dot" rather than just "any character". In many languages the backslash itself needs escaping.
What regular expressions are used for:
Regular expressions is a language in itself that allows you to perform complex validation of string inputs. I.e. you pass it a string and it will return true or false if it is a match or not.
How regular expressions are used:
Form validation, determine if what the user entered is of the format you want
Finding the position of a certain pattern in a block of text
Search and replace where the search term is a regex and what to replace is a normal string.
Some regular expression language features:
Alternation: allows you to select one thing or another. Example match only yes or no.
yes|no
Grouping: You can define scope and have precedence using parentheses. For example match 3 color shades.
gr(a|e)y|black|white
Quantification: You can quantify how much of something you want. ? means 1 or 0, * means 0 or more. + means at least one. Example: Accept a binary string that is not empty:
(0|1)+
Why regular expressions?
Regular expressions make it easy to match strings, it can often replace several dozen lines of source code with a simple small regular expression string.
Not for all types of matching:
To understand how something is useful, you should also understand how it is not useful. Regular expressions are bad for certain tasks for example when you need to guarantee that a string has an equal number of parentheses.
Available in just about all languages:
Regular expressions are available in just about any programming language.
Formal language:
Any regular expression can be converted to a deterministic finite state machine. And in this same way you can figure out how to make source code that will validate your regular expression.
Example:
[hc]+at
matches "hat", "cat", "hhat", "chat", "hcat", "ccchat", and so on, but not "at"
Source, further reading
They look a bit cryptic but they provide a very powerful tool for finding patterns in text. Anything from href tags in HTML pages to validating email addresses.
And they can be processed into a very efficient data structure (FSA) that finds matches very fast.
They are a bit tricky, but extremely powerful and worth learning. The web is full of tutorial and examples, start for example from here and look at the examples here.
If I could direct the OP to some of the answers/comments on one of my own questions: How important is knowing Regexs?
Regular expressions are a very concise way to specify most pattern-matching and -replacement problems, and regexp engines can be very highly optimized.
If you wanted to do the same job as even a relatively simple regexp, you'd have to write a lot of code, which probably would contain a number of bugs, be hard to understand and perform badly.
Whereas doing the same with a regexp is much shorter, almost certainly performs as well as is technically possible, and is easier to understand to anyone familiar with regexpes (though it should be commented in either case)
The email example is actually a bad example for regular expressions. Regexes can be used, but the resulting expression (for example this one which doesn't handle "John Doe " style addresses) is hugely complicated - take a look at the email address specification and you'll see why...
However regexes are very useful in a host of other situations, extracting ip addresses from text, tags from html etc. Finding all versioned files would be another example. Something along the lines of:
my_versioned_file_(\d{4}-\d{2}-\d{2}).txt
will match any filenames of the format my_versioned_file_2009-02-26.txt and pull out the date as a captured group (the part wrapped in "()") for you to further analyse.
No regexes are not necessary, but they can save a world of time in writing a hand rolled parser for something a regex can easily achieve.
Whenever you've got some pattern to find in a lot of textual data or if you want to check that a string is in a certain format.
For example an email address...
The code for checking for an at symbol and the presence of a valid domain will look quite big where you could just use a regular expression and have an answer in 2 lines of code.
Regex r = new Regex("<An Email Address Regex>");
bool isValidEmail = r.IsMatch(MyInput);
Other examples would be for checking numbers are in the correct format before parsing them into integers etc.
Jon and Sqook gave a fine explanation and definition of Regular Expressions, and for simple problems it is pretty understandable, but if you use it for complex problems regular expressions can be a &$#( (at least for me ;-))
I use Expresso a lot to help me build complex regular expression code.
http://www.ultrapico.com/Expresso.htm
It has a build in library with expressions you can use, a design mode where you can build your code and a test mode where you can test and validate the code. It helped me build and understand complex expressions better!
Goodluck!
Some practical real world usages:
Finding abstract classes that extend JUnit's TestCase:
abstract\s+class\s+\w+\s+extends\s+TestCase
This is useful for finding test cases that cannot be instantiated and will need excluding from an ant build script that runs test cases. You cannot search for regular text because you don't know the class names in advance. hence the \w+ (At least one word character).
Finding running bash or bourne shell scripts:
ps -e | grep -e " sh| bash"
this is useful if you want to kill them all or something, if you did a search for just sh you'd not get the bash ones and have to run the command again for bash scripts. Again, more serviceable than perfect, but nearly no regex you write on the fly will be.
It's not perfect, but most regexes won't be, or they'll take so long to write they're not worth it. The ones you perfect are the ones you commit as part of some sort of validation or built application.
Example of critical use is JavaScript:
If you need to do search or replace on a string, the only matching you can do is a regular expression. It's in the JavaScript API on those string methods...
Personally, I mostly use regular expressions only when I need some advanced matching in some automated find/replace in a text editor (TextPad or Visual Studio). The most powerful feature in my view is the ability to match a pattern that can be inserted in the replace.
To give you some examples:
Email Address
Password requires at least 1 alphabet and 1 digit
How can you acheive these requirements?
The best way is to use regular expression.
Read the following links to learn more:
How To: Use Regular Expressions to Constrain Input in ASP.NET
http://msdn.microsoft.com/en-us/library/ms998267.aspx

When is a issue too complex for a regular expression?

Please don't answer the obvious, but what are the limit signs that tell us a problem should not be solved using regular expressions?
For example: Why is a complete email validation too complex for a regular expression?
Regular expressions are a textual representation of finite-state automata. That is to say, they are limited to only non-recursive matching. This means that you can't have any concept of "scope" or "sub-match" in your regexp. Consider the following problem:
(())()
Are all the open parens matched with a close paren?
Obviously, when we look at this as human beings, we can easily see that the answer is "yes". However, no regular expression will be able to reliably answer this question. In order to do this sort of processing, you will need a full pushdown automaton (like a DFA with a stack). This is most commonly found in the guise of a parser such as those generated by ANTLR or Bison.
A few things to look out for:
beginning and ending tag detection -- matched pairing
recursion
needing to go backwards (though you can reverse the string, but that's a hack)
regexes, as much as I love them, aren't good at those three things. And remember, keep it simple! If you're trying to build a regex that does "everything", then you're probably doing it wrong.
When you need to parse an expression that's not defined by a regular language.
What it comes down to is using common sense. If what you are trying to match becomes an unmanageable, monster regular expression then you either need to break it up into small, logical sub-regular expressions or you need to start re-thinking your solution.
Take email addresses (as per your example). This simple regular expression (taken from RegEx buddy) matches 99% of all emails out there:
\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b
It is short and to the point and you will rarely run into issues with it. However, as the author of RegEx buddy points out, if your email address is in the rare top-level domain ".museum" it will not be accepted.
To truely match all email addresses you need to adhere to the standard known as RFC 2822. It outlines the multitude of ways email addresses can be formatted and it is extremely complex.
Here is a sample regular expression attempting to adhere to RFC 2822:
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"
(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x
0c\x0e-\x7f])*")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9]
(?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.)
{3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08
\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
This obviously becomes a problem of diminishing returns. It is better to use the easily maintained implementation that matches 99% of email addresses vs the monsterous one that accepts 99.9% of them.
Regular expressions are a great tool to have in your programmers toolbox but they aren't a solution to all your parsing problems. If you find your RegEx solution starting to become extremely complex you need to either attempt to logically break it up into smaller regular expressions to match portions of your text or you need to start looking at other methods to solve your problem. Similarly, there are simply problems that Regular Expressions, due to their nature, can't solve (as one poster said, not adhering to Regular Language).
Regular expressions are suited for tokenizing, finding or identifying individual bits of text, e.g. finding keywords, strings, comments, etc. in source code.
Regular expressions are not suited for determining the relationship between multiple bits of text, e.g. finding a block of source code with properly paired braces. You need a parser for that. The parser can use regular expressions for tokenizing the input, while the parser itself determines how the different regex matches fit together.
Essentially, you're going to far with your regular expressions if you start thinking about "balancing groups" (.NET's capture group subtraction feature) or "recursion" (Perl 5.10 and PCRE).
Here's a good quote from Raymond Chen:
Don't make regular expressions do what they're not good at. If you want to match a simple pattern, then match a simple pattern. If you want to do math, then do math. As commenter Maurits put it, "The trick is not to spend time developing a combination hammer/screwdriver, but just use a hammer and a screwdriver.
Source
Solve the problem with a regex, then give it to somebody else conversant in regexes. If they can't tell you what it does (or at least say with confidence that they understand) in about 10 minutes, it's too complex.
Sure sign to stop using regexps is this: if you have many grouping braces '()' and many alternatives '|' then it is a sure sign that you try to do a (complex) parsing with regular expressions.
Add to the mix Perl extensions, backreferences, etc and soon you have yourself a parser that is hard to read, hard to modify, and hard to reason about it's properties (e.g. is there an input on which this parser will work in a exponential time).
This is a time to stop regexing and start parsing (with hand-made parser, parser generators or parser combinators).
Along with tremendous expressions, there are principal limitations on the words, which can be handled by regexp.
For instance you can not not write regexp for word described by n chars a, then n chars b, where n can be any, more strictly .
In different languages regexp is a extension of Regular language, but time of parsing can be extremely large and this code is non-portable.
Whenever you can't be sure it really solves the problem, for example:
HTML parsing
Email validation
Language parsers
Especially so when there already exist tools that solve the problem in a totally understandable way.
Regex can be used in the domains I mentioned, but only as a subset of the whole problem and for specific, simple cases.
This goes beyond the technical limitations of regexes (regular languages + extensions), the maintainability and readability limit is surpassed a lot earlier than the technical limit in most cases.
A problem is too complex for regular expressions when constraints of the problem can change after the solution is written. So, in your example, how can you be sure an email address is valid when you do not have access to the target mail system to verify that the email address is attached to a valid user? You can't.
My limit is a Regex pattern that's about 30-50 characters long (varying depending on how much is fixed text and how much is regex commands)
This may sound stupid but I often lament not being able to do database type of queries using regular expression. Now especially more then before because I am entering those types of search string all the time on search engines. its very difficult, if not impossible to search for +complex AND +"regular expression"
For example, how do I search in emacs for commands that have both Buffer and Window in their name? I need to search separately for .*Buffer.*Window and .*Window.*Buffer