Check if a given regex will match anything - regex

Is it possible to check if a given regular expression will match any string? Specifically, I'm looking for a function matchesEverything($regex) that returns true iff $regex will match any string.
I suppose that this is equivalent to asking, "given a regex r, does there exist a string that doesn't match r?" and I don't think this is solvable without placing bounds on the set of "all strings". I.e., if I assume the strings will never contain "blahblah", then I can simply check if r matches "blahblah". But what if there are no such bounds? I'm wondering if this problem can be solved checking if the regex r is equivalent to .*.

This doesn't exactly answer your question, but hopefully explains a little why a simple answer is hard to come by:
First, the term 'regex' is a bit murky, so just to clarify, we have:
"Strict" regular expressions, which are equivalent to deterministic finite automatons (DFAs).
Perl-compatible regular expressions (PCREs), which add a bunch of bells and whistles such as lookaheads, backreferences, etc. These are implemented in other languages too, such as Python and Java.
Actual Perl regular expressions, which can get even more crazy, including arbitrary Perl code, via the ?{...} construct.
I think this problem is solvable for strict regular expressions. You just construct the corresponding DFA and search that graph to see if there's any path to a non-accept state. But that doesn't help for 'real world' regex, which is usually PCRE.
I don't think PCRE is Turing-complete (though I don't know - see this question, too: Are Perl regexes turing complete?). If it were, then I think as Jim Garrison commented, this is basically the halting problem.
That said, it's not easy to transform them into a DFA, either, making the above method useless...
I don't have an answer for PCREs, but be aware that the aforementioned constructs (backreferences, etc) would make it pretty hard, I imagine. Though I hesitate to say "impossible."
A genuine Perl regex with ?{...} in it is definitely Turing-complete, so there be dragons, and I think you're out of luck.

Related

How to learn regular expressions

I.e., I get a list of words and I want to construct a simple regular expression from that which matches at least all of the words (but maybe more).
I want to have an algorithm for that. I.e. input of that algorithm is a list of words and output is a regular expression. Obviously, there will be some restrictions. Like either the regular expression will always match more words if it should match an infinite amounts of words and I only give it a finite number of words. Or I will need some more compact representation of the input. Or I am also thinking about giving me some regular expression as input and a list of additional words and I want to get a regular expression which matches all of them together (and maybe more). In any case, it should try to construct a regular expression which is as simple as possible.
What techniques are availalbe which can do that?
I was quite misunderstood. I know the general principles behind regular expressions. I know what it is. And in most cases I can come up quite easily with a regular expression to some language by hand. But I am searching for algorithms which does that.
Again formulated a bit different:
Let L be a regular language. Let M_n be a finite subset of L with n elements. Let M_n be a subset of M_(n+1).
I want to have an algorithm LRE which gets a finite set of words and outputs a regular expression. And I want to have the property:
lim_n->infinity | diff( LRE(M_n), L ) | = 0
See this website to learn the general principles: http://www.regular-expressions.info/
If all you have is a list of words such as dog, cat, cow, mouse, the simplest regex to match any of these would be: dog|cat|cow|mouse, but note that it will also match doggone, scatological, etc... It may or may not match DOGGONE, COWPATTY, etc... depending on whether or not your are doing case-sensitive matching. Better patterns can be given if more particulars about your problem are given.
It's also a good idea to get a regex testing tool. I like Expresso, it is good for .NET patterns. Since regex capabilties may vary between platforms, make sure your tool supports your platform.
This problem has been looked at the last decade. You might want to google DFA learning, and download a couple of papers to get a sense of the state of the art.
Once you have the DFA generating a regular expression is trivial. To avoid the problems #FrustratedWithDesign mentions some conditions such as generating the DFA with the least amount of nodes is introduced, from a machine learning point of view this is similar to having a regularization condition for the simplest hypothesis.
Use this site to learn the basics and use rubular for live testing.
If you have a list of distinct words that you want to match -- it doesn't sound like you're matching on something that a regular expression is best at.
As FrustratedWithFormsDesigner pointed out -- your regex is going to be mapped to the items in the list in the worst case; best case you can find common prefixes. And if you automate the regex construction, why bother with the regex? What is the use-case?
But if your list is beyond a trivial size, you'd probably be better off looping through it.
http://www.regular-expressions.info is a fantastic site for Regex Reference.
When building a complex regex, I typically use Expresso. It's a free app that helps you build Regular expressions. It breaks them down into a tree view so that it is easy to see what all parts are doing. http://www.ultrapico.com/Expresso.htm It is made to work with .NET languages, but there are plenty of tools like this available for different languages.
To build my Regex, I'll usually start with an acceptable value and start replacing characters with Regex syntax.
For example, if I was trying to match a URL I would start with
http://www.mydomain.com
I would then escape anything that needs escaping
http://www\.mydomain\.com
then I would start replacing characters
http://www\.\w+\.\w+\.\w+
obviously this expression needs some more work, but you get the idea
Here is a site for Perl regex:
http://perldoc.perl.org/perlre.html

I don’t get regular expressions

I don’t understand or see the need for regular expressions.
Can some explain them in simple terms and provide some basic examples where they could be useful, or even critical.
Use them where you need to use/manipulate patterns. For instance, suppose you need to recognise the following pattern:
Any letter, A-Z, either upper or lower case, 5 or 6 times
3 digits
a single letter a-z (definitely lower case)
(Things like this crop up for zip code, credit card, social security number validation etc.)
That's not really hard to write in code - but it becomes harder as the pattern becomes more complicated. With a regular expression, you describe the pattern (rather than the code to validate it) and let the regex engine do the work for you.
The pattern here would be something like
[A-Za-z]{5,6}[0-9]{3}[a-z]
(There are other ways of expressing it too.) Grouping constructs make it easy to match a whole pattern and grab (or replace) different bits of it, too.
A few downsides though:
Regexes can become complicated and hard to read quite quickly. Document thoroughly!
There are variations in behaviour between different regex engines
The complexity can be hard to judge if you're not an expert (which I'm certainly not!); there are "gotchas" which can make the patterns really slow against particular input, and these gotchas aren't obvious at all
Some people overuse regular expressions massively (and some underuse them, of course). The worst example I've seen was where someone asked (on a C# group) how to check whether a string was length 3 - this is clearly a job for using String.Length, but someone seriously suggested matching a regex. Madness. (They also got the regex wrong, which kinda proves the point.)
Regexes use backslashes to escape various things (e.g. use . to mean "a dot" rather than just "any character". In many languages the backslash itself needs escaping.
What regular expressions are used for:
Regular expressions is a language in itself that allows you to perform complex validation of string inputs. I.e. you pass it a string and it will return true or false if it is a match or not.
How regular expressions are used:
Form validation, determine if what the user entered is of the format you want
Finding the position of a certain pattern in a block of text
Search and replace where the search term is a regex and what to replace is a normal string.
Some regular expression language features:
Alternation: allows you to select one thing or another. Example match only yes or no.
yes|no
Grouping: You can define scope and have precedence using parentheses. For example match 3 color shades.
gr(a|e)y|black|white
Quantification: You can quantify how much of something you want. ? means 1 or 0, * means 0 or more. + means at least one. Example: Accept a binary string that is not empty:
(0|1)+
Why regular expressions?
Regular expressions make it easy to match strings, it can often replace several dozen lines of source code with a simple small regular expression string.
Not for all types of matching:
To understand how something is useful, you should also understand how it is not useful. Regular expressions are bad for certain tasks for example when you need to guarantee that a string has an equal number of parentheses.
Available in just about all languages:
Regular expressions are available in just about any programming language.
Formal language:
Any regular expression can be converted to a deterministic finite state machine. And in this same way you can figure out how to make source code that will validate your regular expression.
Example:
[hc]+at
matches "hat", "cat", "hhat", "chat", "hcat", "ccchat", and so on, but not "at"
Source, further reading
They look a bit cryptic but they provide a very powerful tool for finding patterns in text. Anything from href tags in HTML pages to validating email addresses.
And they can be processed into a very efficient data structure (FSA) that finds matches very fast.
They are a bit tricky, but extremely powerful and worth learning. The web is full of tutorial and examples, start for example from here and look at the examples here.
If I could direct the OP to some of the answers/comments on one of my own questions: How important is knowing Regexs?
Regular expressions are a very concise way to specify most pattern-matching and -replacement problems, and regexp engines can be very highly optimized.
If you wanted to do the same job as even a relatively simple regexp, you'd have to write a lot of code, which probably would contain a number of bugs, be hard to understand and perform badly.
Whereas doing the same with a regexp is much shorter, almost certainly performs as well as is technically possible, and is easier to understand to anyone familiar with regexpes (though it should be commented in either case)
The email example is actually a bad example for regular expressions. Regexes can be used, but the resulting expression (for example this one which doesn't handle "John Doe " style addresses) is hugely complicated - take a look at the email address specification and you'll see why...
However regexes are very useful in a host of other situations, extracting ip addresses from text, tags from html etc. Finding all versioned files would be another example. Something along the lines of:
my_versioned_file_(\d{4}-\d{2}-\d{2}).txt
will match any filenames of the format my_versioned_file_2009-02-26.txt and pull out the date as a captured group (the part wrapped in "()") for you to further analyse.
No regexes are not necessary, but they can save a world of time in writing a hand rolled parser for something a regex can easily achieve.
Whenever you've got some pattern to find in a lot of textual data or if you want to check that a string is in a certain format.
For example an email address...
The code for checking for an at symbol and the presence of a valid domain will look quite big where you could just use a regular expression and have an answer in 2 lines of code.
Regex r = new Regex("<An Email Address Regex>");
bool isValidEmail = r.IsMatch(MyInput);
Other examples would be for checking numbers are in the correct format before parsing them into integers etc.
Jon and Sqook gave a fine explanation and definition of Regular Expressions, and for simple problems it is pretty understandable, but if you use it for complex problems regular expressions can be a &$#( (at least for me ;-))
I use Expresso a lot to help me build complex regular expression code.
http://www.ultrapico.com/Expresso.htm
It has a build in library with expressions you can use, a design mode where you can build your code and a test mode where you can test and validate the code. It helped me build and understand complex expressions better!
Goodluck!
Some practical real world usages:
Finding abstract classes that extend JUnit's TestCase:
abstract\s+class\s+\w+\s+extends\s+TestCase
This is useful for finding test cases that cannot be instantiated and will need excluding from an ant build script that runs test cases. You cannot search for regular text because you don't know the class names in advance. hence the \w+ (At least one word character).
Finding running bash or bourne shell scripts:
ps -e | grep -e " sh| bash"
this is useful if you want to kill them all or something, if you did a search for just sh you'd not get the bash ones and have to run the command again for bash scripts. Again, more serviceable than perfect, but nearly no regex you write on the fly will be.
It's not perfect, but most regexes won't be, or they'll take so long to write they're not worth it. The ones you perfect are the ones you commit as part of some sort of validation or built application.
Example of critical use is JavaScript:
If you need to do search or replace on a string, the only matching you can do is a regular expression. It's in the JavaScript API on those string methods...
Personally, I mostly use regular expressions only when I need some advanced matching in some automated find/replace in a text editor (TextPad or Visual Studio). The most powerful feature in my view is the ability to match a pattern that can be inserted in the replace.
To give you some examples:
Email Address
Password requires at least 1 alphabet and 1 digit
How can you acheive these requirements?
The best way is to use regular expression.
Read the following links to learn more:
How To: Use Regular Expressions to Constrain Input in ASP.NET
http://msdn.microsoft.com/en-us/library/ms998267.aspx

Constructing regex

I use regex buddy which takes in a regex and then gives out the meaning of it from which one gets what it could be doing? On similar lines is it possible to have some engine which takes natural language input describing about the pattern one needs to match/replace and gives out the correct(almost correct) regex for that description?
e.g. Match the whole word 'dio' in some file
So regex for that could be : <dio>
or
\bdio\b
-AD.
P.S. = I think few guys here might think this as a 'subjective' 'not-related-to-programming' question, but i just need to ask this question nonetheless. For myself. - Thanks.
This would be complicated to program, because you need a natural language parser able to derive meaning. Unless you limit it to a strict subset -- in which case, you're reinventing an expression language, and you'll eventually wind up back at regular expressions -- only with bigger symbols. so what's the gain?
Regexes were developed for a reason -- they're the simplest, most accurate representation possible.
There is a Symbolix Regular Expression Builder package for Emacs, but looking at it, I think that regular expressions are easier to work with.
Short answer: no, not until artificial intelligence improves A LOT.
If you wrote something like this, you'd have a very limited syntax. For someone to know "Match the whole word 'dio' in some file", they would basically need to have significant knowledge of regular expressions. At that point, just use regular expressions.
For non-technical users, this will never work unless you limit it to basic "find this phrase" or, maybe, "find lines starting/ending with ??". They're never going to come up with something like this:
Find lines containing a less-than symbol followed by the string 'img' followed by one or more groupings of: some whitespace followed by one or more letters followed by either a double-quoted string or a single-quoted string, and those groupings are followed by any length of whitespace then a slash and a greater-than sign.
That's my attempt at a plain-language version of this relatively simple regex:
/<img(\s+[a-z]+=("[^"]*"|'[^']*'))+\s*/>/i
Yeah, I agree with you that it is subjective. But I will answer your question because I think that you have asked a wrong question.
The answer is "YES". Almost anything can be coded and this would be a rather simple application to code. Will it work perfectly? No, it wouldn't because natural language is quite complex to parse and interpret. But it is possible to write such an engine with some constraints.
Generating a regex via the use of a natural language processor is quite possible. Prolog is supposed to be a good language choice for this kind of problem. In practice, however, what you'd be doing, in effect, is designing your own input language which provides a regex as output. If your goal is to produce regexs for a specific task, this might in fact be useful. Perhaps the task you are doing tends to require certain formulations that are doable but not built into regular expressions. Though whether this will be more effective than just creating the regexs one at a time depends on your project. Usually this is probably not the case, since your own language is not going to be as well-known or as well-documented as regex. If your goal is to produce a replacement for regex whose output will be parsed as a regex, I think you're asking a lot. Not to say people haven't done the same sort of thing before (e.g. the C++ language as an 'improvement' that runs, originally, on C++).
try the open source mac application Ruby Regexp Machine, at http://www.rubyregexp.sf.net. It is written in ruby, so you can use some of the code even if you are not on mac. You can describe a lot of simple regular expresions in an easy english grammar. As a disclosure, i did make this tool.

When is a issue too complex for a regular expression?

Please don't answer the obvious, but what are the limit signs that tell us a problem should not be solved using regular expressions?
For example: Why is a complete email validation too complex for a regular expression?
Regular expressions are a textual representation of finite-state automata. That is to say, they are limited to only non-recursive matching. This means that you can't have any concept of "scope" or "sub-match" in your regexp. Consider the following problem:
(())()
Are all the open parens matched with a close paren?
Obviously, when we look at this as human beings, we can easily see that the answer is "yes". However, no regular expression will be able to reliably answer this question. In order to do this sort of processing, you will need a full pushdown automaton (like a DFA with a stack). This is most commonly found in the guise of a parser such as those generated by ANTLR or Bison.
A few things to look out for:
beginning and ending tag detection -- matched pairing
recursion
needing to go backwards (though you can reverse the string, but that's a hack)
regexes, as much as I love them, aren't good at those three things. And remember, keep it simple! If you're trying to build a regex that does "everything", then you're probably doing it wrong.
When you need to parse an expression that's not defined by a regular language.
What it comes down to is using common sense. If what you are trying to match becomes an unmanageable, monster regular expression then you either need to break it up into small, logical sub-regular expressions or you need to start re-thinking your solution.
Take email addresses (as per your example). This simple regular expression (taken from RegEx buddy) matches 99% of all emails out there:
\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b
It is short and to the point and you will rarely run into issues with it. However, as the author of RegEx buddy points out, if your email address is in the rare top-level domain ".museum" it will not be accepted.
To truely match all email addresses you need to adhere to the standard known as RFC 2822. It outlines the multitude of ways email addresses can be formatted and it is extremely complex.
Here is a sample regular expression attempting to adhere to RFC 2822:
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"
(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x
0c\x0e-\x7f])*")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9]
(?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.)
{3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08
\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
This obviously becomes a problem of diminishing returns. It is better to use the easily maintained implementation that matches 99% of email addresses vs the monsterous one that accepts 99.9% of them.
Regular expressions are a great tool to have in your programmers toolbox but they aren't a solution to all your parsing problems. If you find your RegEx solution starting to become extremely complex you need to either attempt to logically break it up into smaller regular expressions to match portions of your text or you need to start looking at other methods to solve your problem. Similarly, there are simply problems that Regular Expressions, due to their nature, can't solve (as one poster said, not adhering to Regular Language).
Regular expressions are suited for tokenizing, finding or identifying individual bits of text, e.g. finding keywords, strings, comments, etc. in source code.
Regular expressions are not suited for determining the relationship between multiple bits of text, e.g. finding a block of source code with properly paired braces. You need a parser for that. The parser can use regular expressions for tokenizing the input, while the parser itself determines how the different regex matches fit together.
Essentially, you're going to far with your regular expressions if you start thinking about "balancing groups" (.NET's capture group subtraction feature) or "recursion" (Perl 5.10 and PCRE).
Here's a good quote from Raymond Chen:
Don't make regular expressions do what they're not good at. If you want to match a simple pattern, then match a simple pattern. If you want to do math, then do math. As commenter Maurits put it, "The trick is not to spend time developing a combination hammer/screwdriver, but just use a hammer and a screwdriver.
Source
Solve the problem with a regex, then give it to somebody else conversant in regexes. If they can't tell you what it does (or at least say with confidence that they understand) in about 10 minutes, it's too complex.
Sure sign to stop using regexps is this: if you have many grouping braces '()' and many alternatives '|' then it is a sure sign that you try to do a (complex) parsing with regular expressions.
Add to the mix Perl extensions, backreferences, etc and soon you have yourself a parser that is hard to read, hard to modify, and hard to reason about it's properties (e.g. is there an input on which this parser will work in a exponential time).
This is a time to stop regexing and start parsing (with hand-made parser, parser generators or parser combinators).
Along with tremendous expressions, there are principal limitations on the words, which can be handled by regexp.
For instance you can not not write regexp for word described by n chars a, then n chars b, where n can be any, more strictly .
In different languages regexp is a extension of Regular language, but time of parsing can be extremely large and this code is non-portable.
Whenever you can't be sure it really solves the problem, for example:
HTML parsing
Email validation
Language parsers
Especially so when there already exist tools that solve the problem in a totally understandable way.
Regex can be used in the domains I mentioned, but only as a subset of the whole problem and for specific, simple cases.
This goes beyond the technical limitations of regexes (regular languages + extensions), the maintainability and readability limit is surpassed a lot earlier than the technical limit in most cases.
A problem is too complex for regular expressions when constraints of the problem can change after the solution is written. So, in your example, how can you be sure an email address is valid when you do not have access to the target mail system to verify that the email address is attached to a valid user? You can't.
My limit is a Regex pattern that's about 30-50 characters long (varying depending on how much is fixed text and how much is regex commands)
This may sound stupid but I often lament not being able to do database type of queries using regular expression. Now especially more then before because I am entering those types of search string all the time on search engines. its very difficult, if not impossible to search for +complex AND +"regular expression"
For example, how do I search in emacs for commands that have both Buffer and Window in their name? I need to search separately for .*Buffer.*Window and .*Window.*Buffer

Regexp that matches valid regexps

Is there a regular expression that matches valid regular expressions?
(I know there are several flavors of regexps. One would do.)
Is there a regular expression that matches valid regular expressions?
By definion, it's quite simple: No.
The language of all regexes is no regular language (just look at nested parentheses) and therefore there can't be a regular expression to parse it.
If you merely want to check whether a regular expression is valid or not, simply try to compile it with whichever programming language or regular expression library you're working with.
Parsing regular expressions is far from trivial. As the author of RegexBuddy, I have been around that block a few times. If you really want to do it, use a regex to tokenize the input, and leave the parsing logic to procedural code. That is, your regex would match one regex token (^, $, \w, (, ), etc.) at a time, and your procedural code would check if they're in the right order.
Unfortunately, most invalid regular expressions are invalid due to parentheses nesting errors. This is exactly the type of strings that regular expressions can't match. (Okay, some fancy regular expression systems have recursion extensions, but that's rare)
As already said, you cannot describe regular expressions with a regular expression due to their recursive nature. You'll need a context free grammar for that.
But what would be the point of having such a regular expression, anyway? If you just want to check whether a regular expression is correct, you can simply try to use it (Pattern.compile(regexp) in Java) and if it screams it is not valid.
You probably need a parser, not a regex. Regexes are powerful tools, but are not parsing tools. They are not well suited to nested grammars, for example.
From Douglas Crockford's The JavaScript Programming Language video 4 (of 4):
/\/(\\[^\x00-\x1f]|\[(\\[^\x00-\x1f]|[^\x00-\x1f\\\/])*\]|[^\x00-\x1f\\\/\[])+\/[gim]*/
http://video.yahoo.com/watch/111596/1710658 at approximately -17.20.
Depending on your goal I would say definately maybe.
If you want to filter regexps out from somewhere, it might prove difficult as regular expressions come in all sizes and shapes and they don't all start and end with slashes.
If you just need to know wether or not a regexp is valid there is another way. Depending on the language you're using you could try/catch
If you can be more specific I could try and give a better answer, the question is intruiging.