Is it possible to have regexp that matches all valid regular expressions? - regex

Is it possible to detect if a given string is valid regular expression, using just regular expressions?
Say I have some strings, that may or may not be a valid regular expressions. I'd like to have a regular expression matches those string that correspond to valid regular expression. Is that possible? Or do I have use some higher level grammar (i.e. context free language) to detect this? Does it affect if I am using some extended version of regexps like Perl regexps?
If that is possible, what the regexp matching regexp is?

No, it is not possible. This is because valid regular expressions involve grouping, which requires balanced parentheses.
Balanced delimiters cannot be matched by a regular expression; they must instead be matched with a context-free grammar. (The first example on that article deals with balanced parentheses.)

See an excellent write-up here:
Regular expression for regular expressions?
The answer is that regexes are NOT written using a regular grammar, but a context-free one.

If your question had been "match all valid regular expressions", the answer is (perhaps surprisingly) 'yes'. The regular expression .* match all valid (and non-valid) regular expressions, but is pretty useless for determining if you're looking at a valid one.
However, as the question is "match all and only valid regular expressions", the answer is (as DVK and Platinum Azure" have said 'no'.

Related

Check is regex is valid via regex

Out of curiousity, is it possible to write a regex, which checks if other regexs are valid.
No, it is not possible: regular expression execution model is not powerful enough for that.
In order for a regular expression string to be valid, all parentheses in the string must be balanced. Since it is not theoretically possible to write a regular expression to verify if all parentheses in a string are balanced, it is also not possible to write a regular expression to check validity of a regular expression string.

Construction of pattern that doesn't contain binary string

I was trying to write a pattern which doesn't contain binary string (let's assume 101). I know that such expressions cannot be written using Regular Expression considering http://en.wikipedia.org/wiki/Regular_language.
I tried writing the pattern for the above problem using Regular Expression though and it seems to be working.
\b(?!101)\w+\b
What I wanted to ask is that can a regular expression be written for my problem and why? And if yes, then is my regular expression correct?
To match a whole string that doesn't contain 101:
^(?!.*101).*$
Look-ahead are indeed an easy way to check a condition on a string through regex, but your regex will only match alphanumeric words that do not start with 101.
You wrote
I know that such expressions cannot be written using Regular
Expression considering http://en.wikipedia.org/wiki/Regular_language.
In that Wikipedia article, you seem to have missed the
Note that the "regular expression" features provided with many
programming languages are augmented with features that make them
capable of recognizing languages that can not be expressed by the
formal regular expressions (as formally defined below).
The negative lookahead construct is such a feature.

regular expression to match nested braces

I need regular expression to match braces correct e.g for every open one close one
abc{abc{bc}xyz} I need it get all it from {abc{bc}xyz} not get {abc{bc}.
I tried using (\{.*?})
This is not possible with regular expressions. A context-free grammar would be necessary for this and regular expressions only work for finite regular languages.
According to this link there is an extension available for the regular expressions in .NET that can do this, but this just means that .NET regular expressions are more than just regular expressions.
This is not a task for a regular expression. What you're looking for is parser at that point. Which means a language grammar, LL(1), LALR, recursive-descent, the dragon book, and generally a splitting migraine.
Balanced parenthesis of arbitrary nested depth is not a regular language. It's a context-free language.
That said, many "regular expression" implementations actually recognize more than regular languages, so this is possible with some implementation but not others.
Wikipedia
Regular language
Pumping lemma for regular languages
Context-free language
Regular expression
Many features found in modern regular expression libraries provide an expressive power that far exceeds the regular languages.
As Bryan said, regular expressions might not be the right tool here, but if you're using PHP, the manual gives an example of how you might be able to use regular expressions in a recursive/nested fashion:
$input = "plain [indent] deep [indent] deeper [/indent] deep [/indent] plain";
function parseTagsRecursive($input)
{
$regex = '#\[indent]((?:[^[]|\[(?!/?indent])|(?R))+)\[/indent]#';
if (is_array($input)) {
$input = '<div style="margin-left: 10px">'.$input[1].'</div>';
}
return preg_replace_callback($regex, 'parseTagsRecursive', $input);
}
$output = parseTagsRecursive($input);
echo $output;
I'm not sure if that'll be helpful to you or not.
This is not possible in the "standard" regular expression language. However, a few different implementations have extensions that allow you to implement it. For example, here's a blog post that explains how to do it with .NET's regex library.
Generally speaking though, this is a task that regular expressions are not really suited to.
Assuming what you want to do is select a maximal substring between { and }:
.*? is a lazy quantifier. That is, it will match the least number of characters possible. If you change your expression to {.*}, you should find it will work.
If what you want to do is to verify that the braces are matched correctly, then as the other answers have stated, this is not possible with a (single) regular expression. You can do it by scanning the string with a stack though. Or with some voodoo of iterating your regular expression over the previous maximal match. Yikes.

Regular expression which matches regular expressions

Is it possible to write a regular expression which matches regular expressions? Does anyone have examples? If there is some theoretical obstruction, does anyone know of a regex which will match at least the most common regex patterns?
Regular expressions are not a regular language, and thus cannot be described by a regular expression!
Update: More useful practical answer
You cannot detect valid regular expressions using any regular expression. To detect its validity, you should just parse the string using the regex library and it would fail if it is an invalid regular expression. For example, in Java, it would be something like:
boolean isValidRegexp(String s) {
try {
Pattern.compile(s);
return true;
} catch (Exception e) {
return false;
}
}
This technique should work with almost any language.
You're all wrong! In my secret laboratories, my evil scientists have discovered the regular expression that can match any regular expression:
.*
It will even match the null expression. Let's see you try to match that!
As an added benefit, it will even match strings that are not regular expressions.
It is not possible using standard regular expressions.
Regular expressions can be nested indefinitely (eg, /(a(b(c(d))))/), which is impossible to match using standard regex.
According Crockford this is a regex which matches regular expression (at least in JavaScript)
/\/(\\[^\x00-\x1f]|\[(\\[^\x00-\x1f]|[^\x00-\x1f\\\/])*\]|[^\x00-\x1f\\\/\[])+\/[gim]*/
Yes. Example: This regex ^[a-z][+*]$ will match this regex z+ and this a* and this c+ and so on.
This is not possible. Regular expressions can only match regular languages. Regular expressions are not a regular language. If memory serves I believe they are a context-free language and require a context-free grammar to match.
Here we go:
m{/([^\\/]++|\\.)/}
Should match a regular expression delimited by //.
Of course, it won't ensure that the regular expression parses correctly - it just identifies where it is (say, for a tokenizer).

Meaning of "match" as related to Regular Expressions

I'm writing a term paper on regular expressions and I'm a bit confused regarding the way one uses the word "match" when referring to regexes. Which of the following is the correct wording to use:
"The regular expression matches the string"
or
"The string matches the regular expression"
Or are they both correct? All opinions on this are welcome! I really want to get this right and I think it would help my understanding greatly to get this clarified.
I think both are correct. It depends on what you're focusing on. If your focus is in the regular expression itself to see if it serves to work on a given string or set of strings, then you use the first sentence. In the contrary, if you are more interested in looking at a set of strings that match certain criteria, the second one is applicable. You know, a match has the meaning of some equivalence under certain conditions, so both sentences sound equivalent to me.
The string is being matched to the regular expression pattern, therefore I would say the latter is more accurate
When two things match, it is (from a logical perspective at least) irrelevant in which order you mention them.
So it depends on what you want to put focus on.
The string matches the regular expression: Focus is on the string.
The regular expression matches the string: Focus is on the regex.
The latter sounds better to me. The regex specifies a pattern that the string may match. But there's nothing really wrong with either.
If you said either one to me, I would understand what you're saying. I'm sure people have said both to me, and I never thought either one needed to be corrected.
I agree that the string matches (or not) the regular expression. To make it clear why I'd say: the regular expression defines a grammar, and a given string is either well-formed according to that grammar or not.
"The regular expression matches the string"
True if the RE matches the whole string (eg. using ^ $ or just happening to match everything). Otherwise, I would write: the regular expression has match(es) in the string.
"The string matches the regular expression"
Again, true if the regex matches everything, otherwise it sounds a bit odd.
But indeed, in the case of a whole match, the two sentences are equivalent.
Since you're looking for a regular expression within a string, it's more correct to say that you've found the regular expression since that's a one-way relationship.
But as to which matches which, that's a two way relationship and it doesn't really matter (in English, anyway - I can't vouch for other languages ), so either would be correct.
My preference would be to say that the string matches the regular expression, since the RE is the invariant part and the string changes. But that's a personal preference and is unlikely to have any bearing on reality :-)
"The string matches the regular expression" seems to be shorthand for "the string is in the language defined by and isomorphic to the regular expression."
"The regular expression matches the string" seems to be shorthand for "a parser automaton compiled from the regular expression will parse the string and halt in a final state."
I'd say:
At design time a user/develper creates a regular expression that matches a string.
At run time a regular expression engine finds a string that matches the regular expression.
(Not intended to be a definition, just an example of common usage.)
Since a regular expression represents a possibly infinite set of finite strings, I would say that it is most correct to write that "string s matches regular expression r". You could also say that "string s is member of the set generated by regular expression r".
Also, you should consider using the words accept and reject, especially if you intend to discuss finite automata in your paper.