Boost regular expression syntax validation - c++

A simple description of the problem is that I need to receive a regular expression as input and check if any given string matches it.
My question: Is there a way to verify that the given regex input has valid syntax? I am using boost and POSIX regular expressions (not sure if it is important whether basic or extended regular expressions are used, the problem remains the same.) Is there even a "wrong" syntax for regular expressions?

http://www.boost.org/doc/libs/1_61_0/libs/regex/doc/html/boost_regex/ref/basic_regex.html#boost_regex.basic_regex.construct3
Throws: bad_expression if [p1,p2) is not a valid regular expression, unless the flag no_except is set in f.

Related

Is there a regular language to represent regular expressions?

Specifically, I noticed that the language of regular expressions itself isn't regular. So, I can't use a regular expression to parse a given regular expression. I need to use a parser since the language of the regular expression itself is context free.
Is there any way regular expressions can be represented in a way that the resulting string can be parsed using a regular expression?
Note: My question isn't about whether there is a regexp to match the current syntax of regexes, but whether there exists a "representation" for regular expressions as we know it today (maybe not a neat as what we know them as today) that can be parsed using regular expressions. Also, please could someone remove the dup since it isn't a dup. I'm asking something completely different. I already know that the current language of regular expressions isn't regular (it is how I started my original question).
Depending on what you mean by "represent", the answer is "yes" or "no":
If you want a language that (homomorphically) maps 1:1 to the usual basic regular expression language, the answer is no, because a regular language cannot be isomorphic to a non-regular language, and the standard regular expression language is non-regular. This is because the syntax requires matching opening and closing parentheses of arbitrary depth.
If "represent" only means another method of specifying regular languages, the answer is yes, and right now I can think of at least three ways to achieve this:
The "dumbest" and easiest way is to define some surjective mapping f : ℕ -> RegEx from the natural numbers onto the set of all valid standard regular expressions. You can define the natural numbers using the regular expression 0|1[01]*, and the regular language denoted by a (string representing the) natural number n is the regular language denoted by f(n).
Of course, the meaning attached to a natural number would not be obvious to a human reader at all, so this "regular expression language" would be utterly useless.
As parentheses are the only non-regular part in simple regular expressions, the easiest human-interpretable method would be to extend the standard simple regular expression syntax to allow dangling parentheses and defining semantics for dangling parentheses.
The obvious choice would be to ignore non-matching opening parentheses and interpreting non-matching closing parentheses as matching the beginning of the regex. This essentially amounts to implicitly inserting as many opening parentheses at the beginning and as many closing parentheses at the end of the regex as necessary. Additionally, (* would have to be interpreted as repetition of the empty string. If I didn't miss anything, this definition should turn any string into a "regular expression" with a specified meaning, so .* defines this "regular expression language".
This variant even has the same abstract syntax as standard regular expressions.
Another variant would be to specify the NFA that recognizes the language directly using a regular language, e.g.: ([a-z]+,([^,]|\\,|\\\\)+,[a-z]+\$?;)*.
The idea is that [a-z]+ is used as a label for states, and the expression is a list of transition triples (s, c, t) from source state s to target state t consuming character c, and a $ indicating accepting transitions (cf. note below). In c, backslashes are used to escape commas or backslashes - I assumed that you use the same alphabet for standard regular expressions, but of course you can replace the middle component with any other regular language of symbols denotating characters of any alphabet you wish.
The first source state mentioned is the (single) initial state. An empty expression defines the empty language.
Above, I wrote "accepting transition", not "accepting state" because that would make the regex above a bit more complex. You can interpret a triple containing a $ as two transitions, namely one transition consuming c from s to a new, unique state, and an ε-transition from that state to t. This should allow any NFA to be represented, by replacing each transition to an accepting state with a $ triple and each transition to a non-accepting state with a non-$ triple.
One note that might make the "yes" part look more intuitive: Assembly languages are regular, and those are even Turing-complete, so it would be unexpected if it wasn't possible to specify "mere" regular languages using a regular language.
The answer is probably NO.
As you have pointed out, set of all possible regular expressions itself is not a regular set. Any TRUE regular expression (not those extended) can be converted into finite automata (FA). If regular expression can be represented in a form that can be parsed by itself, then FA can be parsed by regular expression as well.
But that's not possible as far as I know. RE itself can be reduced into three basic operation(According to the Dragon Book):
concatenation: e.g. ab
alternation: e.g. a|b
kleen closure: e.g. a*
The kleen closure can match infinite number of characters, but it cannot know how many characters to match.
Just think such case: you want to match 3 consecutive as. Then the corresponding regular expression is /aaa/. But what if you want match 4, 5, 6... as? Parser with only one RE cannot know the exact number of as. So it fails to give the right matching to arbitrary expressions. However, the RE parser has to match infinite different forms of REs. According to your expression, a regular expression cannot match all the possibilities.
Well, the only difference of a RE parser is that it does not need a tokenizer.(probably that's why RE is used in lexical analysis) Every character in RE is a token (excluding those escape charcters). But to parse RE, whatever it is converted,one has to face up with NFA/DFA/TREE... all equivalent structures that cannot be parsed by RE itself.

Check is regex is valid via regex

Out of curiousity, is it possible to write a regex, which checks if other regexs are valid.
No, it is not possible: regular expression execution model is not powerful enough for that.
In order for a regular expression string to be valid, all parentheses in the string must be balanced. Since it is not theoretically possible to write a regular expression to verify if all parentheses in a string are balanced, it is also not possible to write a regular expression to check validity of a regular expression string.

Construction of pattern that doesn't contain binary string

I was trying to write a pattern which doesn't contain binary string (let's assume 101). I know that such expressions cannot be written using Regular Expression considering http://en.wikipedia.org/wiki/Regular_language.
I tried writing the pattern for the above problem using Regular Expression though and it seems to be working.
\b(?!101)\w+\b
What I wanted to ask is that can a regular expression be written for my problem and why? And if yes, then is my regular expression correct?
To match a whole string that doesn't contain 101:
^(?!.*101).*$
Look-ahead are indeed an easy way to check a condition on a string through regex, but your regex will only match alphanumeric words that do not start with 101.
You wrote
I know that such expressions cannot be written using Regular
Expression considering http://en.wikipedia.org/wiki/Regular_language.
In that Wikipedia article, you seem to have missed the
Note that the "regular expression" features provided with many
programming languages are augmented with features that make them
capable of recognizing languages that can not be expressed by the
formal regular expressions (as formally defined below).
The negative lookahead construct is such a feature.

Can RPN (Reverse Polish Notation) or Postfix Notation be derived by Regular Expression

I was wondering if I can define a Regular Expression to check whether a given input matches the RPN expression , i.e. whether the given input is valuid or not?
I am not very familiar with Regex unfortunately, so I was wondering if it is possible to define a regular expression to validate the input for postfix.
Many thanks
Taz
Formally, no; RPN would require a context-free grammar, which regular expressions cannot express. However, it might be possible to do so with a "regular expression" package or library, since they can contain features that are outside of the formal definition of regular expressions.

Is it possible to have regexp that matches all valid regular expressions?

Is it possible to detect if a given string is valid regular expression, using just regular expressions?
Say I have some strings, that may or may not be a valid regular expressions. I'd like to have a regular expression matches those string that correspond to valid regular expression. Is that possible? Or do I have use some higher level grammar (i.e. context free language) to detect this? Does it affect if I am using some extended version of regexps like Perl regexps?
If that is possible, what the regexp matching regexp is?
No, it is not possible. This is because valid regular expressions involve grouping, which requires balanced parentheses.
Balanced delimiters cannot be matched by a regular expression; they must instead be matched with a context-free grammar. (The first example on that article deals with balanced parentheses.)
See an excellent write-up here:
Regular expression for regular expressions?
The answer is that regexes are NOT written using a regular grammar, but a context-free one.
If your question had been "match all valid regular expressions", the answer is (perhaps surprisingly) 'yes'. The regular expression .* match all valid (and non-valid) regular expressions, but is pretty useless for determining if you're looking at a valid one.
However, as the question is "match all and only valid regular expressions", the answer is (as DVK and Platinum Azure" have said 'no'.