How to learn regular expressions - regex

I.e., I get a list of words and I want to construct a simple regular expression from that which matches at least all of the words (but maybe more).
I want to have an algorithm for that. I.e. input of that algorithm is a list of words and output is a regular expression. Obviously, there will be some restrictions. Like either the regular expression will always match more words if it should match an infinite amounts of words and I only give it a finite number of words. Or I will need some more compact representation of the input. Or I am also thinking about giving me some regular expression as input and a list of additional words and I want to get a regular expression which matches all of them together (and maybe more). In any case, it should try to construct a regular expression which is as simple as possible.
What techniques are availalbe which can do that?
I was quite misunderstood. I know the general principles behind regular expressions. I know what it is. And in most cases I can come up quite easily with a regular expression to some language by hand. But I am searching for algorithms which does that.
Again formulated a bit different:
Let L be a regular language. Let M_n be a finite subset of L with n elements. Let M_n be a subset of M_(n+1).
I want to have an algorithm LRE which gets a finite set of words and outputs a regular expression. And I want to have the property:
lim_n->infinity | diff( LRE(M_n), L ) | = 0

See this website to learn the general principles: http://www.regular-expressions.info/
If all you have is a list of words such as dog, cat, cow, mouse, the simplest regex to match any of these would be: dog|cat|cow|mouse, but note that it will also match doggone, scatological, etc... It may or may not match DOGGONE, COWPATTY, etc... depending on whether or not your are doing case-sensitive matching. Better patterns can be given if more particulars about your problem are given.
It's also a good idea to get a regex testing tool. I like Expresso, it is good for .NET patterns. Since regex capabilties may vary between platforms, make sure your tool supports your platform.

This problem has been looked at the last decade. You might want to google DFA learning, and download a couple of papers to get a sense of the state of the art.
Once you have the DFA generating a regular expression is trivial. To avoid the problems #FrustratedWithDesign mentions some conditions such as generating the DFA with the least amount of nodes is introduced, from a machine learning point of view this is similar to having a regularization condition for the simplest hypothesis.

Use this site to learn the basics and use rubular for live testing.

If you have a list of distinct words that you want to match -- it doesn't sound like you're matching on something that a regular expression is best at.
As FrustratedWithFormsDesigner pointed out -- your regex is going to be mapped to the items in the list in the worst case; best case you can find common prefixes. And if you automate the regex construction, why bother with the regex? What is the use-case?
But if your list is beyond a trivial size, you'd probably be better off looping through it.

http://www.regular-expressions.info is a fantastic site for Regex Reference.
When building a complex regex, I typically use Expresso. It's a free app that helps you build Regular expressions. It breaks them down into a tree view so that it is easy to see what all parts are doing. http://www.ultrapico.com/Expresso.htm It is made to work with .NET languages, but there are plenty of tools like this available for different languages.
To build my Regex, I'll usually start with an acceptable value and start replacing characters with Regex syntax.
For example, if I was trying to match a URL I would start with
http://www.mydomain.com
I would then escape anything that needs escaping
http://www\.mydomain\.com
then I would start replacing characters
http://www\.\w+\.\w+\.\w+
obviously this expression needs some more work, but you get the idea

Here is a site for Perl regex:
http://perldoc.perl.org/perlre.html

Related

TCL string match vs regexps

Is it right that we should avoid using regexp as it is slow. Instead we should use string operations. Are there cases that both can be used but regexp is better?
You should use the appropriate tool for the job. That means, you should not avoid regex, you should use it when it is necessary.
If you are just searching for a fixed sequence of characters, use string operations.
If you are searching for a pattern, then use regular expressions.
Example
Search for the word "Foo". use string operations it will also find
"Foobar", is this OK? NO, well then maybe search for "Foo ", but then
it will not find "Foo," and "Foo."
With regex no problem, you can match for a word boundary /\mFoo\M/ and
this regex will not be slow.
I think this negative image comes from special problems like catastrophic backtracking.
There has been a recent example (catastrophic-backtracking-shouldnt-be-happening-on-this-regex) where this behaviour was unexpected.
Conclusion
A regex has to be well designed, if it isn't then the performance can be catastrophic. But the same can also happen to your normal code if you use a bad algorithm.
For a small job it should nearly never be a problem to use a regex, if your task is bigger and has to be repeated often, do a benchmark.
From my own experience, I am analyzing really big text files (some hundred MB) and use regexes to find the rows I am interested in and I don't experience performance problems because of regex.
Here an interesting read about code optimization
Regular expressions (REs) are a marvelous hammer. They can solve some problems elegantly, and many more with brute force, but it won't be pretty. And some problems can be solved with REs if you hit them enough, but there are much better solutions available (for example, things that are a good fit for string map)
string match - or globbing - can be thought of as a simplified version of regular expressions. The glob pattern will usually be shorter than the equivalent regular expression (character classes are an exception - ERs support them, with globs you need to spell them out). I don't know offhand how the performance differs; I'd expect string match to be slightly faster on equivalent patterns because of the simpler logic, but time is much more reliable than expectations.
For a specific case where REs are easier to use, extracting a substring contextually vs. by simple character position is a good example. Or for matching one of several alternatives.
My rule of thumb is to use the simplest thing that works. If that's string match, then great. If it seems like the pattern is too complex for that, go to a regexp and be happy you have the choice.
The best advice I can give, and the advice I use myself is, use regular expressions only when a simpler solution won't work.
If you can use simple string matching, or use glob patterns, use them. It's only when those cannot work that you should be using regular expressions.
To address your specific question I would say that, no, there is no time when you can use either but that regular expressions are the better choice. Maybe there's an edge case I'm not thinking of, but generally speaking, simpler solutions are always better.
I don't know about Tcl in particular, but generally it can be said that if you're looking for exact text matches (e. g. find all lines that start with #define) then string operations are faster. But if you're looking for patterns (e. g. all lines that contain a word that starts with c and ends with t) then regular expressions are the right tool for this (\bc\w*t\b would be a good regex for this - compare this to the program logic you'd need if you had to write this yourself.
And even if regex is slower in a case like this, chances are high that it won't matter in terms of execution speed, but it'll matter a lot when changes to the matching logic are required (oh, now we need to look for a word that starts with c and ends with t but contains at least two as and no x --> \bc(?=\w*a\w*a)(?!\w*x)\w*t\b).
A place where most regex engines don't want to go is recursion (matching nested tags, nested parentheses and all that). That's where parsers enter the picture.
Regular expression matching is a kind of string operation. While it's not as fast as some of the more basic operations, it is enormously more capable too. It's also more difficult to use, especially if you don't already know the basic syntax of REs, but that's not a reason to avoid them. However, replacing a regular expression with a collection of basic string operations can just lead to the program getting enormously longer: sometimes, you simply need complex manipulations.
Tcl does a number of things to make RE operations more efficient. Notably, it detects particularly simple REs and converts them into glob-like matches (as in string match) which are faster but much less powerful, and it does a number of things to cache the compiled form of REs so that matching has less overhead. It also uses an automata-theoretic matching engine that has fewer surprises during match time (at a cost of more time to compile the RE in the first place).
In short, don't avoid them. Use them where appropriate. (And time if you're in doubt about speed.)
regexp aka regular expressions are used to match many different strings and can be very complex or even to validate a specific input.
string match only allows wildcards such as * and ? and basic character grouping with [] as in regexp.
You can read about it here: http://www.tcl.tk/man/tcl8.5/TclCmd/string.htm#M40
A basic guide what regexp can do also with some examples are explained here: http://www.regular-expressions.info/
So in short: If you don't need regexp or even don't know much about it, i recommand you to not use it. If you just want to compare two strings for their equality use string equal.

Is it possible for a computer to "learn" a regular expression by user-provided examples?

Is it possible for a computer to "learn" a regular expression by user-provided examples?
To clarify:
I do not want to learn regular expressions.
I want to create a program which "learns" a regular expression from examples which are interactively provided by a user, perhaps by selecting parts from a text or selecting begin or end markers.
Is it possible? Are there algorithms, keywords, etc. which I can Google for?
EDIT: Thank you for the answers, but I'm not interested in tools which provide this feature. I'm looking for theoretical information, like papers, tutorials, source code, names of algorithms, so I can create something for myself.
Yes,
it is possible,
we can generate regexes from examples (text -> desired extractions).
This is a working online tool which does the job: http://regex.inginf.units.it/
Regex Generator++ online tool generates a regex from provided examples using a GP search algorithm.
The GP algorithm is driven by a multiobjective fitness which leads to higher performance and simpler solution structure (Occam's Razor).
This tool is a demostrative application by the Machine Lerning Lab, Trieste Univeristy (Università degli studi di Trieste).
Please look at the video tutorial here.
This is a research project so you can read about used algorithms here.
Behold! :-)
Finding a meaningful regex/solution from examples is possible if and only if the provided examples describe the problem well.
Consider these examples that describe an extraction task, we are looking for particular item codes; the examples are text/extraction pairs:
"The product code is 467-345A" -> "467-345A"
"The item 789-345B is broken" -> "789-345B"
An (human) guy, looking at the examples, may say: "the item codes are things like \d++-345[AB]"
When the item code is more permissive but we have not provided other examples, we have not proofs to understand the problem well.
When applying the human generated solution \d++-345[AB] to the following text, it fails:
"On the back of the item there is a code: 966-347Z"
You have to provide other examples, in order to better describe what is a match and what is not a desired match:
--i.e:
"My phone is +39-128-3905 , and the phone product id is 966-347Z" -> "966-347Z"
The phone number is not a product id, this may be an important proof.
The book An Introduction to Computational Learning Theory contains an algorithm for learning a finite automaton. As every regular language is equivalent to a finite automaton, it is possible to learn some regular expressions by a program. Kearns and Valiant show some cases where it is not possible to learn a finite automaton. A related problem is learning hidden Markov Models, which are probabilistic automata that can describe a character sequence. Note that most modern "regular expressions" used in programming languages are actually stronger than regular languages, and therefore sometimes harder to learn.
No computer program will ever be able to generate a meaningful regular expression based solely on a list of valid matches. Let me show you why.
Suppose you provide the examples 111111 and 999999, should the computer generate:
A regex matching exactly those two examples: (111111|999999)
A regex matching 6 identical digits (\d)\1{5}
A regex matching 6 ones and nines [19]{6}
A regex matching any 6 digits \d{6}
Any of the above three, with word boundaries, e.g. \b\d{6}\b
Any of the first three, not preceded or followed by a digit, e.g.
(?<!\d)\d{6}(?!\d)
As you can see, there are many ways in which examples can be generalized into a regular expression. The only way for the computer to build a predictable regular expression is to require you to list all possible matches. Then it could generate a search pattern that matches exactly those matches.
If you don't want to list all possible matches, you need a higher-level description. That's exactly what regular expressions are designed to provide. Instead of providing a long list of 6-digit numbers, you simply tell the program to match "any six digits". In regular expression syntax, this becomes \d{6}.
Any method of providing a higher-level description that is as flexible as regular expressions will also be as complex as regular expressions. All tools like RegexBuddy can do is to make it easier to create and test the high-level description. Instead of using the terse regular expression syntax directly, RegexBuddy enables you to use plain English building blocks. But it can't create the high-level description for you, since it can't magically know when it should generalize your examples and when it should not.
It is certainly possible to create a tool that uses sample text along with guidelines provided by the user to generate a regular expression. The hard part in designing such a tool is how does it ask the user for the guiding information that it needs, without making the tool harder to learn than regular expressions themselves, and without restricting the tool to common regex jobs or to simple regular expressions.
Yes, it's certainly "possible"; Here's the pseudo-code:
string MakeRegexFromExamples(<listOfPosExamples>, <listOfNegExamples>)
{
if HasIntersection(<listOfPosExamples>, <listOfNegExamples>)
return <IntersectionError>
string regex = "";
foreach(string example in <listOfPosExamples>)
{
if(regex != "")
{
regex += "|";
}
regex += DoRegexEscaping(example);
}
regex = "^(" + regex + ")$";
// Ignore <listOfNegExamples>; they're excluded by definition
return regex;
}
The problem is that there are an infinite number of regexs that will match a list of examples. This code provides the simplest/stupidest regex in the set, basically matching anything in the list of positive examples (and nothing else, including any of the negative examples).
I suppose the real challenge would be to find the shortest regex that matches all of the examples, but even then, the user would have to provide very good inputs to make sure the resulting expression was "the right one".
I believe the term is "induction". You want to induce a regular grammar.
I don't think it is possible with a finite set of examples (positive or negative). But, if I recall correctly, it can be done if there is an Oracle which can be consulted. (Basically you'd have to let the program ask the user yes/no questions until it was content.)
You might want to play with this site a bit, it's quite cool and sounds like it does something similar to what you're talking about: http://txt2re.com
There's a language dedicated to problems like this, based on prolog. It's called progol.
As others have mentioned, the basic idea is inductive learning, often called ILP (inductive logic programming) in AI circles.
Second link is the wiki article on ILP, which contains a lot of useful source material if you're interested in learning more about the topic.
#Yuval is correct. You're looking at computational learning theory, or "inductive inference. "
The question is more complicated than you think, as the definition of "learn" is non-trivial. One common definition is that the learner can spit out answers whenever it wants, but eventually, it must either stop spitting out answers, or always spit out the same answer. This assumes an infinite number of inputs, and gives absolutely no garauntee on when the program will reach its decision. Also, you can't tell when it HAS reached its decision because it might still output something different later.
By this definition, I'm pretty sure that regular languages are learnable. By other definitions, not so much...
I've done some research on Google and CiteSeer and found these techniques/papers:
Language identification in the limit
Probably approximately correct learning
Also Dana Angluin's "Learning regular sets from queries and counterexamples" seems promising, but I wasn't able to find a PS or PDF version, only cites and seminar papers.
It seems that this is a tricky problem even on the theoretical level.
If its possible for a person to learn a regular expression, then it is fundamentally possible for a program. However, that program will need to be correctly programmed to be able to learn. Luckily this is a fairly finite space of logic, so it wouldn't be as complex as teaching a program to be able to see objects or something like that.

I don’t get regular expressions

I don’t understand or see the need for regular expressions.
Can some explain them in simple terms and provide some basic examples where they could be useful, or even critical.
Use them where you need to use/manipulate patterns. For instance, suppose you need to recognise the following pattern:
Any letter, A-Z, either upper or lower case, 5 or 6 times
3 digits
a single letter a-z (definitely lower case)
(Things like this crop up for zip code, credit card, social security number validation etc.)
That's not really hard to write in code - but it becomes harder as the pattern becomes more complicated. With a regular expression, you describe the pattern (rather than the code to validate it) and let the regex engine do the work for you.
The pattern here would be something like
[A-Za-z]{5,6}[0-9]{3}[a-z]
(There are other ways of expressing it too.) Grouping constructs make it easy to match a whole pattern and grab (or replace) different bits of it, too.
A few downsides though:
Regexes can become complicated and hard to read quite quickly. Document thoroughly!
There are variations in behaviour between different regex engines
The complexity can be hard to judge if you're not an expert (which I'm certainly not!); there are "gotchas" which can make the patterns really slow against particular input, and these gotchas aren't obvious at all
Some people overuse regular expressions massively (and some underuse them, of course). The worst example I've seen was where someone asked (on a C# group) how to check whether a string was length 3 - this is clearly a job for using String.Length, but someone seriously suggested matching a regex. Madness. (They also got the regex wrong, which kinda proves the point.)
Regexes use backslashes to escape various things (e.g. use . to mean "a dot" rather than just "any character". In many languages the backslash itself needs escaping.
What regular expressions are used for:
Regular expressions is a language in itself that allows you to perform complex validation of string inputs. I.e. you pass it a string and it will return true or false if it is a match or not.
How regular expressions are used:
Form validation, determine if what the user entered is of the format you want
Finding the position of a certain pattern in a block of text
Search and replace where the search term is a regex and what to replace is a normal string.
Some regular expression language features:
Alternation: allows you to select one thing or another. Example match only yes or no.
yes|no
Grouping: You can define scope and have precedence using parentheses. For example match 3 color shades.
gr(a|e)y|black|white
Quantification: You can quantify how much of something you want. ? means 1 or 0, * means 0 or more. + means at least one. Example: Accept a binary string that is not empty:
(0|1)+
Why regular expressions?
Regular expressions make it easy to match strings, it can often replace several dozen lines of source code with a simple small regular expression string.
Not for all types of matching:
To understand how something is useful, you should also understand how it is not useful. Regular expressions are bad for certain tasks for example when you need to guarantee that a string has an equal number of parentheses.
Available in just about all languages:
Regular expressions are available in just about any programming language.
Formal language:
Any regular expression can be converted to a deterministic finite state machine. And in this same way you can figure out how to make source code that will validate your regular expression.
Example:
[hc]+at
matches "hat", "cat", "hhat", "chat", "hcat", "ccchat", and so on, but not "at"
Source, further reading
They look a bit cryptic but they provide a very powerful tool for finding patterns in text. Anything from href tags in HTML pages to validating email addresses.
And they can be processed into a very efficient data structure (FSA) that finds matches very fast.
They are a bit tricky, but extremely powerful and worth learning. The web is full of tutorial and examples, start for example from here and look at the examples here.
If I could direct the OP to some of the answers/comments on one of my own questions: How important is knowing Regexs?
Regular expressions are a very concise way to specify most pattern-matching and -replacement problems, and regexp engines can be very highly optimized.
If you wanted to do the same job as even a relatively simple regexp, you'd have to write a lot of code, which probably would contain a number of bugs, be hard to understand and perform badly.
Whereas doing the same with a regexp is much shorter, almost certainly performs as well as is technically possible, and is easier to understand to anyone familiar with regexpes (though it should be commented in either case)
The email example is actually a bad example for regular expressions. Regexes can be used, but the resulting expression (for example this one which doesn't handle "John Doe " style addresses) is hugely complicated - take a look at the email address specification and you'll see why...
However regexes are very useful in a host of other situations, extracting ip addresses from text, tags from html etc. Finding all versioned files would be another example. Something along the lines of:
my_versioned_file_(\d{4}-\d{2}-\d{2}).txt
will match any filenames of the format my_versioned_file_2009-02-26.txt and pull out the date as a captured group (the part wrapped in "()") for you to further analyse.
No regexes are not necessary, but they can save a world of time in writing a hand rolled parser for something a regex can easily achieve.
Whenever you've got some pattern to find in a lot of textual data or if you want to check that a string is in a certain format.
For example an email address...
The code for checking for an at symbol and the presence of a valid domain will look quite big where you could just use a regular expression and have an answer in 2 lines of code.
Regex r = new Regex("<An Email Address Regex>");
bool isValidEmail = r.IsMatch(MyInput);
Other examples would be for checking numbers are in the correct format before parsing them into integers etc.
Jon and Sqook gave a fine explanation and definition of Regular Expressions, and for simple problems it is pretty understandable, but if you use it for complex problems regular expressions can be a &$#( (at least for me ;-))
I use Expresso a lot to help me build complex regular expression code.
http://www.ultrapico.com/Expresso.htm
It has a build in library with expressions you can use, a design mode where you can build your code and a test mode where you can test and validate the code. It helped me build and understand complex expressions better!
Goodluck!
Some practical real world usages:
Finding abstract classes that extend JUnit's TestCase:
abstract\s+class\s+\w+\s+extends\s+TestCase
This is useful for finding test cases that cannot be instantiated and will need excluding from an ant build script that runs test cases. You cannot search for regular text because you don't know the class names in advance. hence the \w+ (At least one word character).
Finding running bash or bourne shell scripts:
ps -e | grep -e " sh| bash"
this is useful if you want to kill them all or something, if you did a search for just sh you'd not get the bash ones and have to run the command again for bash scripts. Again, more serviceable than perfect, but nearly no regex you write on the fly will be.
It's not perfect, but most regexes won't be, or they'll take so long to write they're not worth it. The ones you perfect are the ones you commit as part of some sort of validation or built application.
Example of critical use is JavaScript:
If you need to do search or replace on a string, the only matching you can do is a regular expression. It's in the JavaScript API on those string methods...
Personally, I mostly use regular expressions only when I need some advanced matching in some automated find/replace in a text editor (TextPad or Visual Studio). The most powerful feature in my view is the ability to match a pattern that can be inserted in the replace.
To give you some examples:
Email Address
Password requires at least 1 alphabet and 1 digit
How can you acheive these requirements?
The best way is to use regular expression.
Read the following links to learn more:
How To: Use Regular Expressions to Constrain Input in ASP.NET
http://msdn.microsoft.com/en-us/library/ms998267.aspx

Constructing regex

I use regex buddy which takes in a regex and then gives out the meaning of it from which one gets what it could be doing? On similar lines is it possible to have some engine which takes natural language input describing about the pattern one needs to match/replace and gives out the correct(almost correct) regex for that description?
e.g. Match the whole word 'dio' in some file
So regex for that could be : <dio>
or
\bdio\b
-AD.
P.S. = I think few guys here might think this as a 'subjective' 'not-related-to-programming' question, but i just need to ask this question nonetheless. For myself. - Thanks.
This would be complicated to program, because you need a natural language parser able to derive meaning. Unless you limit it to a strict subset -- in which case, you're reinventing an expression language, and you'll eventually wind up back at regular expressions -- only with bigger symbols. so what's the gain?
Regexes were developed for a reason -- they're the simplest, most accurate representation possible.
There is a Symbolix Regular Expression Builder package for Emacs, but looking at it, I think that regular expressions are easier to work with.
Short answer: no, not until artificial intelligence improves A LOT.
If you wrote something like this, you'd have a very limited syntax. For someone to know "Match the whole word 'dio' in some file", they would basically need to have significant knowledge of regular expressions. At that point, just use regular expressions.
For non-technical users, this will never work unless you limit it to basic "find this phrase" or, maybe, "find lines starting/ending with ??". They're never going to come up with something like this:
Find lines containing a less-than symbol followed by the string 'img' followed by one or more groupings of: some whitespace followed by one or more letters followed by either a double-quoted string or a single-quoted string, and those groupings are followed by any length of whitespace then a slash and a greater-than sign.
That's my attempt at a plain-language version of this relatively simple regex:
/<img(\s+[a-z]+=("[^"]*"|'[^']*'))+\s*/>/i
Yeah, I agree with you that it is subjective. But I will answer your question because I think that you have asked a wrong question.
The answer is "YES". Almost anything can be coded and this would be a rather simple application to code. Will it work perfectly? No, it wouldn't because natural language is quite complex to parse and interpret. But it is possible to write such an engine with some constraints.
Generating a regex via the use of a natural language processor is quite possible. Prolog is supposed to be a good language choice for this kind of problem. In practice, however, what you'd be doing, in effect, is designing your own input language which provides a regex as output. If your goal is to produce regexs for a specific task, this might in fact be useful. Perhaps the task you are doing tends to require certain formulations that are doable but not built into regular expressions. Though whether this will be more effective than just creating the regexs one at a time depends on your project. Usually this is probably not the case, since your own language is not going to be as well-known or as well-documented as regex. If your goal is to produce a replacement for regex whose output will be parsed as a regex, I think you're asking a lot. Not to say people haven't done the same sort of thing before (e.g. the C++ language as an 'improvement' that runs, originally, on C++).
try the open source mac application Ruby Regexp Machine, at http://www.rubyregexp.sf.net. It is written in ruby, so you can use some of the code even if you are not on mac. You can describe a lot of simple regular expresions in an easy english grammar. As a disclosure, i did make this tool.

When is a issue too complex for a regular expression?

Please don't answer the obvious, but what are the limit signs that tell us a problem should not be solved using regular expressions?
For example: Why is a complete email validation too complex for a regular expression?
Regular expressions are a textual representation of finite-state automata. That is to say, they are limited to only non-recursive matching. This means that you can't have any concept of "scope" or "sub-match" in your regexp. Consider the following problem:
(())()
Are all the open parens matched with a close paren?
Obviously, when we look at this as human beings, we can easily see that the answer is "yes". However, no regular expression will be able to reliably answer this question. In order to do this sort of processing, you will need a full pushdown automaton (like a DFA with a stack). This is most commonly found in the guise of a parser such as those generated by ANTLR or Bison.
A few things to look out for:
beginning and ending tag detection -- matched pairing
recursion
needing to go backwards (though you can reverse the string, but that's a hack)
regexes, as much as I love them, aren't good at those three things. And remember, keep it simple! If you're trying to build a regex that does "everything", then you're probably doing it wrong.
When you need to parse an expression that's not defined by a regular language.
What it comes down to is using common sense. If what you are trying to match becomes an unmanageable, monster regular expression then you either need to break it up into small, logical sub-regular expressions or you need to start re-thinking your solution.
Take email addresses (as per your example). This simple regular expression (taken from RegEx buddy) matches 99% of all emails out there:
\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b
It is short and to the point and you will rarely run into issues with it. However, as the author of RegEx buddy points out, if your email address is in the rare top-level domain ".museum" it will not be accepted.
To truely match all email addresses you need to adhere to the standard known as RFC 2822. It outlines the multitude of ways email addresses can be formatted and it is extremely complex.
Here is a sample regular expression attempting to adhere to RFC 2822:
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"
(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x
0c\x0e-\x7f])*")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9]
(?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.)
{3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08
\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
This obviously becomes a problem of diminishing returns. It is better to use the easily maintained implementation that matches 99% of email addresses vs the monsterous one that accepts 99.9% of them.
Regular expressions are a great tool to have in your programmers toolbox but they aren't a solution to all your parsing problems. If you find your RegEx solution starting to become extremely complex you need to either attempt to logically break it up into smaller regular expressions to match portions of your text or you need to start looking at other methods to solve your problem. Similarly, there are simply problems that Regular Expressions, due to their nature, can't solve (as one poster said, not adhering to Regular Language).
Regular expressions are suited for tokenizing, finding or identifying individual bits of text, e.g. finding keywords, strings, comments, etc. in source code.
Regular expressions are not suited for determining the relationship between multiple bits of text, e.g. finding a block of source code with properly paired braces. You need a parser for that. The parser can use regular expressions for tokenizing the input, while the parser itself determines how the different regex matches fit together.
Essentially, you're going to far with your regular expressions if you start thinking about "balancing groups" (.NET's capture group subtraction feature) or "recursion" (Perl 5.10 and PCRE).
Here's a good quote from Raymond Chen:
Don't make regular expressions do what they're not good at. If you want to match a simple pattern, then match a simple pattern. If you want to do math, then do math. As commenter Maurits put it, "The trick is not to spend time developing a combination hammer/screwdriver, but just use a hammer and a screwdriver.
Source
Solve the problem with a regex, then give it to somebody else conversant in regexes. If they can't tell you what it does (or at least say with confidence that they understand) in about 10 minutes, it's too complex.
Sure sign to stop using regexps is this: if you have many grouping braces '()' and many alternatives '|' then it is a sure sign that you try to do a (complex) parsing with regular expressions.
Add to the mix Perl extensions, backreferences, etc and soon you have yourself a parser that is hard to read, hard to modify, and hard to reason about it's properties (e.g. is there an input on which this parser will work in a exponential time).
This is a time to stop regexing and start parsing (with hand-made parser, parser generators or parser combinators).
Along with tremendous expressions, there are principal limitations on the words, which can be handled by regexp.
For instance you can not not write regexp for word described by n chars a, then n chars b, where n can be any, more strictly .
In different languages regexp is a extension of Regular language, but time of parsing can be extremely large and this code is non-portable.
Whenever you can't be sure it really solves the problem, for example:
HTML parsing
Email validation
Language parsers
Especially so when there already exist tools that solve the problem in a totally understandable way.
Regex can be used in the domains I mentioned, but only as a subset of the whole problem and for specific, simple cases.
This goes beyond the technical limitations of regexes (regular languages + extensions), the maintainability and readability limit is surpassed a lot earlier than the technical limit in most cases.
A problem is too complex for regular expressions when constraints of the problem can change after the solution is written. So, in your example, how can you be sure an email address is valid when you do not have access to the target mail system to verify that the email address is attached to a valid user? You can't.
My limit is a Regex pattern that's about 30-50 characters long (varying depending on how much is fixed text and how much is regex commands)
This may sound stupid but I often lament not being able to do database type of queries using regular expression. Now especially more then before because I am entering those types of search string all the time on search engines. its very difficult, if not impossible to search for +complex AND +"regular expression"
For example, how do I search in emacs for commands that have both Buffer and Window in their name? I need to search separately for .*Buffer.*Window and .*Window.*Buffer