Check is regex is valid via regex - regex

Out of curiousity, is it possible to write a regex, which checks if other regexs are valid.

No, it is not possible: regular expression execution model is not powerful enough for that.
In order for a regular expression string to be valid, all parentheses in the string must be balanced. Since it is not theoretically possible to write a regular expression to verify if all parentheses in a string are balanced, it is also not possible to write a regular expression to check validity of a regular expression string.

Related

Boost regular expression syntax validation

A simple description of the problem is that I need to receive a regular expression as input and check if any given string matches it.
My question: Is there a way to verify that the given regex input has valid syntax? I am using boost and POSIX regular expressions (not sure if it is important whether basic or extended regular expressions are used, the problem remains the same.) Is there even a "wrong" syntax for regular expressions?
http://www.boost.org/doc/libs/1_61_0/libs/regex/doc/html/boost_regex/ref/basic_regex.html#boost_regex.basic_regex.construct3
Throws: bad_expression if [p1,p2) is not a valid regular expression, unless the flag no_except is set in f.

How to match Regular Expression with String containing a wildcard character?

Regular expression:
/Hello .*, what's up?/i
String which may contain any number of wildcard characters (%):
"% world, what's up?" (matches)
"Hello world, %?" (matches)
"Hello %, what's up?" (matches)
"Hey world, what's up?" (no match)
"Hello %, blabla." (no match)
I have thought of a solution myself, but I'd like to see what you are able to come up with (considering performance is a high priority). A requirement is the ability to use any regular expression; I only used .* in the example, but any valid regular expression should work.
A little automata theory might help you here. You say
this is a simplified version of matching a regular expression with a regular expression[1]
Actually, that does not seem to be the case. Instead of matching the text of a regular expression, you want to find regular expressions that can match the same string as a given regular expression.
Luckily, this problem is solvable :-) To see whether such a string exists, you would need to compute the union of the two regular languages and test whether the result is not the empty language. This might be a non-trivial problem and solving it efficiently [enough] may be hard, but standard algorithms for this do already exist. Basically you would need to translate the expression into a NFA, that one into a DFA which you then can union.
[1]: Indeed, the wildcard strings you're using in the question build some kind of regular language, and can be translated to corresponding regular expressions
Not sure that I fully understand your question, but if you're looking for performance, avoid regular expressions. Instead you can split the string on %. Then, take a look at the first and last matches:
// Anything before % should match at start of the string
targetString.indexOf(splits[0]) === 0;
// Anything after % should match at the end of the string
targetString.indexOf(splits[1]) + splits[1].length === targetString.length;
If you can use % multiple times within the string, then the first and last splits should follow the above rules. Anything else just needs to be in the string, and .indexOf is how you can check that.
I came to realize that this is impossible with a regular language, and therefore the only solution to this problem is to replace the wildcard symbol % with .* and then match two regular expressions with each other. This can however not be done by traditional regular expressions, look at this SO-question and it's answers for details.
Or perhaps you should edit the underlying Regular Expression engine for supporting wildcard based strings. Anyone being able to answer this question by extending the default implementation will be accepted as answer to this question ;-)

Is it possible to have regexp that matches all valid regular expressions?

Is it possible to detect if a given string is valid regular expression, using just regular expressions?
Say I have some strings, that may or may not be a valid regular expressions. I'd like to have a regular expression matches those string that correspond to valid regular expression. Is that possible? Or do I have use some higher level grammar (i.e. context free language) to detect this? Does it affect if I am using some extended version of regexps like Perl regexps?
If that is possible, what the regexp matching regexp is?
No, it is not possible. This is because valid regular expressions involve grouping, which requires balanced parentheses.
Balanced delimiters cannot be matched by a regular expression; they must instead be matched with a context-free grammar. (The first example on that article deals with balanced parentheses.)
See an excellent write-up here:
Regular expression for regular expressions?
The answer is that regexes are NOT written using a regular grammar, but a context-free one.
If your question had been "match all valid regular expressions", the answer is (perhaps surprisingly) 'yes'. The regular expression .* match all valid (and non-valid) regular expressions, but is pretty useless for determining if you're looking at a valid one.
However, as the question is "match all and only valid regular expressions", the answer is (as DVK and Platinum Azure" have said 'no'.

Regular expression which matches regular expressions

Is it possible to write a regular expression which matches regular expressions? Does anyone have examples? If there is some theoretical obstruction, does anyone know of a regex which will match at least the most common regex patterns?
Regular expressions are not a regular language, and thus cannot be described by a regular expression!
Update: More useful practical answer
You cannot detect valid regular expressions using any regular expression. To detect its validity, you should just parse the string using the regex library and it would fail if it is an invalid regular expression. For example, in Java, it would be something like:
boolean isValidRegexp(String s) {
try {
Pattern.compile(s);
return true;
} catch (Exception e) {
return false;
}
}
This technique should work with almost any language.
You're all wrong! In my secret laboratories, my evil scientists have discovered the regular expression that can match any regular expression:
.*
It will even match the null expression. Let's see you try to match that!
As an added benefit, it will even match strings that are not regular expressions.
It is not possible using standard regular expressions.
Regular expressions can be nested indefinitely (eg, /(a(b(c(d))))/), which is impossible to match using standard regex.
According Crockford this is a regex which matches regular expression (at least in JavaScript)
/\/(\\[^\x00-\x1f]|\[(\\[^\x00-\x1f]|[^\x00-\x1f\\\/])*\]|[^\x00-\x1f\\\/\[])+\/[gim]*/
Yes. Example: This regex ^[a-z][+*]$ will match this regex z+ and this a* and this c+ and so on.
This is not possible. Regular expressions can only match regular languages. Regular expressions are not a regular language. If memory serves I believe they are a context-free language and require a context-free grammar to match.
Here we go:
m{/([^\\/]++|\\.)/}
Should match a regular expression delimited by //.
Of course, it won't ensure that the regular expression parses correctly - it just identifies where it is (say, for a tokenizer).

Meaning of "match" as related to Regular Expressions

I'm writing a term paper on regular expressions and I'm a bit confused regarding the way one uses the word "match" when referring to regexes. Which of the following is the correct wording to use:
"The regular expression matches the string"
or
"The string matches the regular expression"
Or are they both correct? All opinions on this are welcome! I really want to get this right and I think it would help my understanding greatly to get this clarified.
I think both are correct. It depends on what you're focusing on. If your focus is in the regular expression itself to see if it serves to work on a given string or set of strings, then you use the first sentence. In the contrary, if you are more interested in looking at a set of strings that match certain criteria, the second one is applicable. You know, a match has the meaning of some equivalence under certain conditions, so both sentences sound equivalent to me.
The string is being matched to the regular expression pattern, therefore I would say the latter is more accurate
When two things match, it is (from a logical perspective at least) irrelevant in which order you mention them.
So it depends on what you want to put focus on.
The string matches the regular expression: Focus is on the string.
The regular expression matches the string: Focus is on the regex.
The latter sounds better to me. The regex specifies a pattern that the string may match. But there's nothing really wrong with either.
If you said either one to me, I would understand what you're saying. I'm sure people have said both to me, and I never thought either one needed to be corrected.
I agree that the string matches (or not) the regular expression. To make it clear why I'd say: the regular expression defines a grammar, and a given string is either well-formed according to that grammar or not.
"The regular expression matches the string"
True if the RE matches the whole string (eg. using ^ $ or just happening to match everything). Otherwise, I would write: the regular expression has match(es) in the string.
"The string matches the regular expression"
Again, true if the regex matches everything, otherwise it sounds a bit odd.
But indeed, in the case of a whole match, the two sentences are equivalent.
Since you're looking for a regular expression within a string, it's more correct to say that you've found the regular expression since that's a one-way relationship.
But as to which matches which, that's a two way relationship and it doesn't really matter (in English, anyway - I can't vouch for other languages ), so either would be correct.
My preference would be to say that the string matches the regular expression, since the RE is the invariant part and the string changes. But that's a personal preference and is unlikely to have any bearing on reality :-)
"The string matches the regular expression" seems to be shorthand for "the string is in the language defined by and isomorphic to the regular expression."
"The regular expression matches the string" seems to be shorthand for "a parser automaton compiled from the regular expression will parse the string and halt in a final state."
I'd say:
At design time a user/develper creates a regular expression that matches a string.
At run time a regular expression engine finds a string that matches the regular expression.
(Not intended to be a definition, just an example of common usage.)
Since a regular expression represents a possibly infinite set of finite strings, I would say that it is most correct to write that "string s matches regular expression r". You could also say that "string s is member of the set generated by regular expression r".
Also, you should consider using the words accept and reject, especially if you intend to discuss finite automata in your paper.