Is there a regular language to represent regular expressions? - regex

Specifically, I noticed that the language of regular expressions itself isn't regular. So, I can't use a regular expression to parse a given regular expression. I need to use a parser since the language of the regular expression itself is context free.
Is there any way regular expressions can be represented in a way that the resulting string can be parsed using a regular expression?
Note: My question isn't about whether there is a regexp to match the current syntax of regexes, but whether there exists a "representation" for regular expressions as we know it today (maybe not a neat as what we know them as today) that can be parsed using regular expressions. Also, please could someone remove the dup since it isn't a dup. I'm asking something completely different. I already know that the current language of regular expressions isn't regular (it is how I started my original question).

Depending on what you mean by "represent", the answer is "yes" or "no":
If you want a language that (homomorphically) maps 1:1 to the usual basic regular expression language, the answer is no, because a regular language cannot be isomorphic to a non-regular language, and the standard regular expression language is non-regular. This is because the syntax requires matching opening and closing parentheses of arbitrary depth.
If "represent" only means another method of specifying regular languages, the answer is yes, and right now I can think of at least three ways to achieve this:
The "dumbest" and easiest way is to define some surjective mapping f : ℕ -> RegEx from the natural numbers onto the set of all valid standard regular expressions. You can define the natural numbers using the regular expression 0|1[01]*, and the regular language denoted by a (string representing the) natural number n is the regular language denoted by f(n).
Of course, the meaning attached to a natural number would not be obvious to a human reader at all, so this "regular expression language" would be utterly useless.
As parentheses are the only non-regular part in simple regular expressions, the easiest human-interpretable method would be to extend the standard simple regular expression syntax to allow dangling parentheses and defining semantics for dangling parentheses.
The obvious choice would be to ignore non-matching opening parentheses and interpreting non-matching closing parentheses as matching the beginning of the regex. This essentially amounts to implicitly inserting as many opening parentheses at the beginning and as many closing parentheses at the end of the regex as necessary. Additionally, (* would have to be interpreted as repetition of the empty string. If I didn't miss anything, this definition should turn any string into a "regular expression" with a specified meaning, so .* defines this "regular expression language".
This variant even has the same abstract syntax as standard regular expressions.
Another variant would be to specify the NFA that recognizes the language directly using a regular language, e.g.: ([a-z]+,([^,]|\\,|\\\\)+,[a-z]+\$?;)*.
The idea is that [a-z]+ is used as a label for states, and the expression is a list of transition triples (s, c, t) from source state s to target state t consuming character c, and a $ indicating accepting transitions (cf. note below). In c, backslashes are used to escape commas or backslashes - I assumed that you use the same alphabet for standard regular expressions, but of course you can replace the middle component with any other regular language of symbols denotating characters of any alphabet you wish.
The first source state mentioned is the (single) initial state. An empty expression defines the empty language.
Above, I wrote "accepting transition", not "accepting state" because that would make the regex above a bit more complex. You can interpret a triple containing a $ as two transitions, namely one transition consuming c from s to a new, unique state, and an ε-transition from that state to t. This should allow any NFA to be represented, by replacing each transition to an accepting state with a $ triple and each transition to a non-accepting state with a non-$ triple.
One note that might make the "yes" part look more intuitive: Assembly languages are regular, and those are even Turing-complete, so it would be unexpected if it wasn't possible to specify "mere" regular languages using a regular language.

The answer is probably NO.
As you have pointed out, set of all possible regular expressions itself is not a regular set. Any TRUE regular expression (not those extended) can be converted into finite automata (FA). If regular expression can be represented in a form that can be parsed by itself, then FA can be parsed by regular expression as well.
But that's not possible as far as I know. RE itself can be reduced into three basic operation(According to the Dragon Book):
concatenation: e.g. ab
alternation: e.g. a|b
kleen closure: e.g. a*
The kleen closure can match infinite number of characters, but it cannot know how many characters to match.
Just think such case: you want to match 3 consecutive as. Then the corresponding regular expression is /aaa/. But what if you want match 4, 5, 6... as? Parser with only one RE cannot know the exact number of as. So it fails to give the right matching to arbitrary expressions. However, the RE parser has to match infinite different forms of REs. According to your expression, a regular expression cannot match all the possibilities.
Well, the only difference of a RE parser is that it does not need a tokenizer.(probably that's why RE is used in lexical analysis) Every character in RE is a token (excluding those escape charcters). But to parse RE, whatever it is converted,one has to face up with NFA/DFA/TREE... all equivalent structures that cannot be parsed by RE itself.

Related

Counterpart of regular expressions for parsing nested strucures

Regular expressions are a standard tool used for parsing strings across many languages. However their scope of applicability is limited. Regular expressions can only match a list. There is no way to describe arbitrary deep nested structures using regular expressions. Question: what is a technology/framework as widely used/spread and as standatd as regular expessions are that can match tree structures (produce AST).
Regular expressions describe a finite-state automaton.
Since the late 1960's, the "bread and butter" of parsing (though not necessarily the "state of the art") has been push-down automata generated by parser generators according to "LR" algorithms like LALR(1).
The connection to regular expressions is this: the parsing machine does in fact use rules very similar to regular expressions in order to recognize viable prefixes. The "shift" state transitions among the "core LR(0) items" constitute a finite automaton, and can be described by a regular expression. The recursion is is handled thanks to the semantic action of pushing symbols onto a stack when doing the "shifts", and removing them ("reducing"). Reductions rewrite a portion of the stack, and perform a "goto" to another state. This type of goto, together with the stack, is absent in the regular expression automaton.
Parse Expression Grammars are also related to regular expressions. Regular expressions themselves can be endowed with recursion. Firstly, we can take pieces of regular expressions and give them names, and then construct bigger regular expressions by writing expressions which invoke these names. (Such as feature is found in the lex tool where you can define a named expressions like letters [A-Za-z]+ and refer to it later as {letters}. Now suppose you allow circular references, like letters [A-Za-z]{letters}?. You now have recursion; the only problem is to adjust the model in order to implement it.
Implementations of so-called "regular expressions" in various modern languages and environments in fact support recursion. Perl-compatible regular expressions (PCRE) support it, for instance.
Expressions that feature recursion or backreferencing are not handled by the classic NFA compilation route (possibly converted to a DFA); they cannot be.
How the above letters recursion can be handled is with actual recursion. The ? operator can be implemented as a function which tries to match its respective argument object. If it succeeds, then it consumes whatever it has matched, otherwise it consumes nothing. That is to say, the regular expression can be converted to a syntax tree, and interpreted "as is" rather than compiled to a state machine (or trivially compiled to functions corresponding to the nodes of the tree), and such interpretation can naturally handle recursion. The interpretation then constitutes, effectively, a syntax-directed recursive-descent parser. (Note how I avoided left recursion in defining letters to make that example compatible with this approach).
Example: parenthesis-matching regex:
par-match := ({par-match})|
This gets compiled to a tree:
branch-op <-- "par-match" name points at this node
/ \
catenate-op <empty>
/ \
"(" catenate-op
/ \
{par-match} ")"
This can then converted to a recursive descent parser, or interpreted directly.
Pattern matching starts by invoking the top-level "branch-op". This operator simply tries all of the alternatives. Suppose the input is empty. Then the left alternative will fail: it demands an open parenthesis. So then the right alternative will succeed: empty matches empty. (The operators either "fail" or indicate "success" and consume input.)
But suppose your input is (()). The left catenate-op will in turn invoke its left subtree, which matches and consumes the left parenthesis, leaving ()). It will then invoke its right subtree, another catenate-op. This catenate-op matches its left subtree, which triggers recursion into the top level via the named par-match references. That recursion will match and consume (), leaving ). The catenate-op then invokes its right subtree which matches ). Control returns up to branch-op. (Though the left side of branch-op matched something, branch-op must still try the other alternative; more than one branch can match, and some can match longer than others.)
This is closely related to Parsing Expression Grammars work.
Practically speaking, the recursive definition could be encoded into the regex syntax somehow. Say we invent some new operator like (?name:definition) which means "match definition which is allowed to contain invocations of itself via name. The invocation syntax could be (*name), so that we can write the par-match example as (?par-match:\((*par-match)\)|). The combinations (? and (* are invalid under "classic" regex syntax and so we can use them for extension.
As a final note, regexes correspond to grammars. That is the fundamental connection btween regexes and parsing. That is to say, regexes correspond to a particular subset of grammars describe only regular languages. An example of a grammar which describes a regular language:
S -> A | B
B -> b
A -> A a | c
Although there is A -> A ... recursion, this is still regular, and corresponds to the regex ac*|b, which is just a more compact way to denote the same language. The grammar lets us notate languages that aren't regular and for which we can't write a regex, but as we have seen, we can extend the regex notation and semantics to express some of these things. Regular expressions aren't separate from grammars. The two aren't counterparts, but rather one is a special case or subset of the other.
Parser generators like Yacc, Bison, and derivatives are what you're after. They aren't as widespread as regular expressions because they generate actual C code. There are translations like Jison for example which implement the Yacc/Bison syntax using javascript. I know there are similar tools for other languages.
I get the impression Parsing expression grammar systems are up and coming though.

Construction of pattern that doesn't contain binary string

I was trying to write a pattern which doesn't contain binary string (let's assume 101). I know that such expressions cannot be written using Regular Expression considering http://en.wikipedia.org/wiki/Regular_language.
I tried writing the pattern for the above problem using Regular Expression though and it seems to be working.
\b(?!101)\w+\b
What I wanted to ask is that can a regular expression be written for my problem and why? And if yes, then is my regular expression correct?
To match a whole string that doesn't contain 101:
^(?!.*101).*$
Look-ahead are indeed an easy way to check a condition on a string through regex, but your regex will only match alphanumeric words that do not start with 101.
You wrote
I know that such expressions cannot be written using Regular
Expression considering http://en.wikipedia.org/wiki/Regular_language.
In that Wikipedia article, you seem to have missed the
Note that the "regular expression" features provided with many
programming languages are augmented with features that make them
capable of recognizing languages that can not be expressed by the
formal regular expressions (as formally defined below).
The negative lookahead construct is such a feature.

Aren't modern regular expression dialects regular?

I've seen a few comments here that mention that modern regular expressions go beyond what can be represented in a regular language. How is this so?
What features of modern regular expressions are not regular? Examples would be helpful.
The first thing that comes to mind is backreferences:
(\w*)\s\1
(matches a group of word characters, followed by a space character and then the same group previously matched) eg: hello hello matches, hello world doesn't.
This construct is not regular (ie: can't be generated by a regular grammar).
Another feature supported by Perl Compatible RegExp (PCRE) that is not regular are recursive patterns:
\((a*|(?R))*\)
This can be used to match any combination of balanced parentheses and "a"s (from wikipedia)
A couple of examples:
Regular expressions support grouping. E.g. in Ruby: /my (group)/.match("my group")[1] will output "group". storing something in a group requires an external storage, which a finite automaton does not have.
Many languages, e.g. C#, support captures, i.e. that each match will be captured on a stack - for example the pattern (?<MYGROUP>.)* could perform multiple captures of "." in the same group.
Grouping are used for backreferencing as pointed out by the user NullUserException above. Backreferencing requires one or more external stacks with the power of a push-down-automaton (you have to be able to push something on a stack and peek or pop it afterwards.
Some engines have the possibility of seperately pushing and popping external stacks and checking whether the stack is empty. In .NET, actually (?<MYGROUP>test) pushes a stack, while (?<-MYGROUP>) pops a stack.
Some engines like the .NET engine have a balanced grouping concept - where an external stack can be both pushed and popped at the same time. Balanced grouping syntax is (?<FIRSTGROUP-LASTGROUP>) which pops the LASTGROUP and pushes the capture since the LASTGROUP index on the FIRSTGROUP stack. This can actually be used to match infinitely nested constructions which is definitely beyond the power of a finite automaton.
Probably other good examples exist :-) If you are further interessted in some of the implementation details of external stacks in combination with Regex's and balanced grouping and thus higher order automata than finite automata, I once wrote two short articles on this (http://www.codeproject.com/KB/recipes/Nested_RegEx_explained.aspx and http://www.codeproject.com/KB/recipes/RegEx_Balanced_Grouping.aspx).
Anyway - finitieness or not - I blieve that the power that this extra stuff brings to the regular languages is great :-)
Br. Morten
A deterministic or nondeterministic finite automaton recognizes just the regular languages, which are described by regular expressions. The definition of a regular expression is simple. Let S be an alphabet. Then the empty set, the empty string, and every element of S are regular expressions (over S). Let u and v be regular expressions. Then the union (u | v), concatenation (uv), and closure (u*) of u and v are regular expressions over S. This definition is easily extended to the regular languages. No other expression is a regular expression. As pointed out, some back-references are an example. The Wikipedia pages on regular languages and expressions are good references.
In essence, certain "regular expressions" are not regular because no automaton of a particular type can be constructed to recognize them. For example, the the language
{ a^ i b^ i : i <= 0 }
is not regular. This is because the accepting automaton would require infinitely many states, but an automaton accepting regular languages must have a finite number of states.

Regular expressions Lexical Analysis

Why repeated strings such as
[wcw|w is a string of a's and b's]
cannot be denoted by regular expressions?
pls. give me detailed answer as i m new to lexical analysis.
Thanks ...
Regular expressions in their original form describe regular languages/grammars. Those cannot contain nested structures as those languages can be described by a simple finite state machine. Simplified you can picture that as if each word of the language grows strictly from left to right (or right to left), where repeating structures have to be explicitly defined and are static.
What this means is, that no information whatsoever from previous states can be carried over to later states (a few characters further in the input). So if you have your symbol w you can't specify that the input must have exactly the same string w later in the sequence. Similarly you can't ensure that each opening paranthesis needs a closin paren as well (so regular expressions themselves are not even a regular language and thus cannot be described by regular expressions :-)).
In theoretical computer science we worked with a very restricted set of regex operators, basically only consisting of sequence, alternative (|) and repetition (*), everything else can be described with those operations.
However, usually regex engines allow grouping of certain sub-patterns into matches which can then be referenced or extracted later. Some engines even allow to use such a backreference in the search expression string itself, thereby allowing the expression to describe more than just a regular language. If I remember correctly such use of backreferences can even yield languages that are not context-free.
Additional pointers:
This StackOverflow question
Wikipedia
It can be, you just can't assure that it's the same string of "a"s and "b"s because there's no way to retain the information acquired in traversing the first half for use in traversing the second.

Meaning of "match" as related to Regular Expressions

I'm writing a term paper on regular expressions and I'm a bit confused regarding the way one uses the word "match" when referring to regexes. Which of the following is the correct wording to use:
"The regular expression matches the string"
or
"The string matches the regular expression"
Or are they both correct? All opinions on this are welcome! I really want to get this right and I think it would help my understanding greatly to get this clarified.
I think both are correct. It depends on what you're focusing on. If your focus is in the regular expression itself to see if it serves to work on a given string or set of strings, then you use the first sentence. In the contrary, if you are more interested in looking at a set of strings that match certain criteria, the second one is applicable. You know, a match has the meaning of some equivalence under certain conditions, so both sentences sound equivalent to me.
The string is being matched to the regular expression pattern, therefore I would say the latter is more accurate
When two things match, it is (from a logical perspective at least) irrelevant in which order you mention them.
So it depends on what you want to put focus on.
The string matches the regular expression: Focus is on the string.
The regular expression matches the string: Focus is on the regex.
The latter sounds better to me. The regex specifies a pattern that the string may match. But there's nothing really wrong with either.
If you said either one to me, I would understand what you're saying. I'm sure people have said both to me, and I never thought either one needed to be corrected.
I agree that the string matches (or not) the regular expression. To make it clear why I'd say: the regular expression defines a grammar, and a given string is either well-formed according to that grammar or not.
"The regular expression matches the string"
True if the RE matches the whole string (eg. using ^ $ or just happening to match everything). Otherwise, I would write: the regular expression has match(es) in the string.
"The string matches the regular expression"
Again, true if the regex matches everything, otherwise it sounds a bit odd.
But indeed, in the case of a whole match, the two sentences are equivalent.
Since you're looking for a regular expression within a string, it's more correct to say that you've found the regular expression since that's a one-way relationship.
But as to which matches which, that's a two way relationship and it doesn't really matter (in English, anyway - I can't vouch for other languages ), so either would be correct.
My preference would be to say that the string matches the regular expression, since the RE is the invariant part and the string changes. But that's a personal preference and is unlikely to have any bearing on reality :-)
"The string matches the regular expression" seems to be shorthand for "the string is in the language defined by and isomorphic to the regular expression."
"The regular expression matches the string" seems to be shorthand for "a parser automaton compiled from the regular expression will parse the string and halt in a final state."
I'd say:
At design time a user/develper creates a regular expression that matches a string.
At run time a regular expression engine finds a string that matches the regular expression.
(Not intended to be a definition, just an example of common usage.)
Since a regular expression represents a possibly infinite set of finite strings, I would say that it is most correct to write that "string s matches regular expression r". You could also say that "string s is member of the set generated by regular expression r".
Also, you should consider using the words accept and reject, especially if you intend to discuss finite automata in your paper.