Regexp that matches valid regexps - regex

Is there a regular expression that matches valid regular expressions?
(I know there are several flavors of regexps. One would do.)

Is there a regular expression that matches valid regular expressions?
By definion, it's quite simple: No.
The language of all regexes is no regular language (just look at nested parentheses) and therefore there can't be a regular expression to parse it.

If you merely want to check whether a regular expression is valid or not, simply try to compile it with whichever programming language or regular expression library you're working with.
Parsing regular expressions is far from trivial. As the author of RegexBuddy, I have been around that block a few times. If you really want to do it, use a regex to tokenize the input, and leave the parsing logic to procedural code. That is, your regex would match one regex token (^, $, \w, (, ), etc.) at a time, and your procedural code would check if they're in the right order.

Unfortunately, most invalid regular expressions are invalid due to parentheses nesting errors. This is exactly the type of strings that regular expressions can't match. (Okay, some fancy regular expression systems have recursion extensions, but that's rare)

As already said, you cannot describe regular expressions with a regular expression due to their recursive nature. You'll need a context free grammar for that.
But what would be the point of having such a regular expression, anyway? If you just want to check whether a regular expression is correct, you can simply try to use it (Pattern.compile(regexp) in Java) and if it screams it is not valid.

You probably need a parser, not a regex. Regexes are powerful tools, but are not parsing tools. They are not well suited to nested grammars, for example.

From Douglas Crockford's The JavaScript Programming Language video 4 (of 4):
/\/(\\[^\x00-\x1f]|\[(\\[^\x00-\x1f]|[^\x00-\x1f\\\/])*\]|[^\x00-\x1f\\\/\[])+\/[gim]*/
http://video.yahoo.com/watch/111596/1710658 at approximately -17.20.

Depending on your goal I would say definately maybe.
If you want to filter regexps out from somewhere, it might prove difficult as regular expressions come in all sizes and shapes and they don't all start and end with slashes.
If you just need to know wether or not a regexp is valid there is another way. Depending on the language you're using you could try/catch
If you can be more specific I could try and give a better answer, the question is intruiging.

Related

Regular expression that matches regular expressions

Has anyone ever tried to describe regular expression that would match regular expressions?
This topic is almost impossible to find on the web due to the repeating keywords.
It is probably unusable in real-world applications, since the languages that support regular expressions usually have a method of parsing them, which we can use for validation, and a method of delimiting the regular expressions in code, which can be used for searching purposes.
But still I am wondering how would a regular expression that matches all regular expressions look like. It should be possible to write one.
I don't have a formal proof for this, but I strongly suspect that the language of regular expressions is not itself regular, and therefore not subject to regular expressions¹. This would make a proper regex to represent it impossible.
Why? Well, it can be shown that a language that requires balanced parentheses such as Lisp (or, more famously, HTML) is not regular using the pumping lemma:
The proof that the language of balanced (i.e., properly nested) parentheses is not regular follows the same idea. Given p, there is a string of balanced parentheses that begins with more than p left parentheses, so that y will consist entirely of left parentheses. By repeating y, we can produce a string that does not contain the same number of left and right parentheses, and so they cannot be balanced.
Regular expressions permit nested capture groups, which seem to fall into this category:
Take the example from the previous lesson, if we wanted to capture the image file number along with the filename, I can write the expression ^(IMG(\d+))\.png$.
In any case, this may be a better question for the Computer Science Stack Exchange site.
Edit:
¹tomp points out that PCRE-based regular expression engines (and likely others) are actually able to match all context-free grammars and at least some context-sensitive grammars! That represents a massive difference in expressive power. Assuming the article is correct, pretty cool!
(Of course, whether these extended implementations are still "regular expressions" is up for debate. Since we're on a programming site I'll take the position that they are. On a CS site I'd probably take the opposite position!)
So it may be technically possible to represent regular expressions as a regular expression.
Even so, the task of writing a regex representing all regexes is enormously complex. Consider for comparison the task of validating an email address. Many resources boil this down to something akin to [^#]+#[^#]+, or "as long as there is only one at symbol and at least one character before and one character after it, we're good".
But have a look at this apparently complete regex to validate RFC 822. Is it correct? Who knows. I'm certainly not going to check it.
Having seen this, I wouldn't want to try to write a regex to validate regular expressions.
I just coded this in a couple of minutes, so don't expect too much...still, it can match a regex in a string.
^([igsmx]{1,})?\/(?=.*?(\\w|\\d|\[.*?\]|\(.*?\))).*?\/([igsmx]{1,})?$
It can be extended, a looooooot...

Check if a given regex will match anything

Is it possible to check if a given regular expression will match any string? Specifically, I'm looking for a function matchesEverything($regex) that returns true iff $regex will match any string.
I suppose that this is equivalent to asking, "given a regex r, does there exist a string that doesn't match r?" and I don't think this is solvable without placing bounds on the set of "all strings". I.e., if I assume the strings will never contain "blahblah", then I can simply check if r matches "blahblah". But what if there are no such bounds? I'm wondering if this problem can be solved checking if the regex r is equivalent to .*.
This doesn't exactly answer your question, but hopefully explains a little why a simple answer is hard to come by:
First, the term 'regex' is a bit murky, so just to clarify, we have:
"Strict" regular expressions, which are equivalent to deterministic finite automatons (DFAs).
Perl-compatible regular expressions (PCREs), which add a bunch of bells and whistles such as lookaheads, backreferences, etc. These are implemented in other languages too, such as Python and Java.
Actual Perl regular expressions, which can get even more crazy, including arbitrary Perl code, via the ?{...} construct.
I think this problem is solvable for strict regular expressions. You just construct the corresponding DFA and search that graph to see if there's any path to a non-accept state. But that doesn't help for 'real world' regex, which is usually PCRE.
I don't think PCRE is Turing-complete (though I don't know - see this question, too: Are Perl regexes turing complete?). If it were, then I think as Jim Garrison commented, this is basically the halting problem.
That said, it's not easy to transform them into a DFA, either, making the above method useless...
I don't have an answer for PCREs, but be aware that the aforementioned constructs (backreferences, etc) would make it pretty hard, I imagine. Though I hesitate to say "impossible."
A genuine Perl regex with ?{...} in it is definitely Turing-complete, so there be dragons, and I think you're out of luck.

what does regular in regex/"regular expression" mean?

What does the "regular" in the phrase "regular expression" mean?
I have heard that regexes were regular at one time, but no more
The regular in regular expression comes from that it matches a regular language.
The concept of regular expressions used in formal language theory is quite different from what engines like PCRE call regular expressions. PCRE and other similar engines have features like lookahead, conditionals and recursion, which make them able to match non-regular languages.
It comes from regular language. This is part of formal language theory. Check out the Chomsky hierarchy for other formal languages.
It's signifying that it's a regular language.
Regexes are still popular. Some people frown on them but they remain a quick and easy (if you know how to use them) way of matching certain types of strings. The alternative is often a good few lines of code looping through strings and extracting the bits that you need which is much nastier!
I still use them on a regular (pun fully intended) basis, to give you a use case I used one the other day to match lines of guitar chords as oppose to lyrics. They're also commonly used for things like basic validation of email addresses and the like.
They're certainly not dead.
I think it comes from the term for the class of grammars that regular expressions describe: regular grammars (or "regular" languages). Where that term comes from is likely answered by a trip to Wikipedia.
Modern regex engines that implement all those fancy look-ahead, pattern re-match, and subexpression counting features, well, those are recognizing a class of grammar that's a superset of regular grammars. "Classical" regular expressions correspond in mechanical ways to theoretical machines called "finite automata". That's a really fun subject in and of itself.

RegExp to validate a formula (math expression with matched parentheses)?

I have used regExp quit a bit of times but still far from being an expert. This time I want to validate a formula (or math expression) by regExp. The difficult part here is to validate proper starting and ending parentheses with in the formula.
I believe, there would be some sample on the web but I could not find it. Can somebody post a link for such example? or help me by some other means?
Languages with matched nested parentheses are not regular languages and can therefore not be recognized by regular expressions. Some implementations of regular expression (for example in the .NET framework) have extensions to deal with this but that is really no fun to work with. So I suggest to use an available parser or implement a simple parser yourself (for the sake of fun).
For the extension in the .NET implementation see the MSDN on balancing groups.
If your mathematical expression involves matched nested parenthesis, then it's not a regular grammar but a context-free one and as such, can not be parsed using regex.

BNF to regular Expressions

How can I describe the language
A → AA | ( A ) | ε
generates using regular expressions?
Regular expressions accept strings from regular languages. Regular languages can also be accepted by an FSM.
There's an potentially infinite number of brackets in your language that you have to match up. That means you need an infinite state, obviously impossible in any Finite State Machine. Therefore, your language isn't regular and can't be matched with a regex.
Regular expressions cannot match nesting brackets.
I'm not sure about how you could notate that language, but that language isn't regular, as can be shown using the pumping lemma for regular languages (and thus, it can't be noted by a regex). An intuitive explanation is that accepting words from that language would require the FDA to 'remember' the number of opening parenthesis that it just read every time it begins to read closing parenthesis, and that isn't possible for them as they have no 'memory'. A push-down automaton, on the other hand...
Could that language be noted as {(n)n}*, for any n?
You can't - regular expressions can only recognise a small subset of possible languages. In particular, informally, any language that requires an unbounded amount of memory potentially to recognise is not RE recognisable.
Here, you'd need an unbounded amount of memory to remember how many opening parentheses you've seen in order to make sure the number of closing parentheses are the same.
You'll need some mechanism that is capable of parsing Context-Free Grammars to be able to recognise languages described by BNF in general. Modern parsers are very good at this!
As others have said you cannot do this with a single regular expression, however you can tokenize it with two "\(" and "\)". Seeing that your language can only have brackets in it though I'm not sure this is going to be very useful.
Note: You will also need a passer to ensure the brackets are paired up correctly. So “(()()” will tokenize but will not parse.