Is there a way to negate a regular expression? - regex

Given a regular expression R that describes a regular language (no fancy backreferences). Is there an algorithmic way to construct a regular expression R* that describes the language of all words except those described by R? It should be possible as Wikipedia says:
The regular languages are closed under the various operations, that is, if the languages K and L are regular, so is the result of the following operations: […] the complement ¬L
For example, given the alphabet {a,b,c}, the inverse of the language (abc*)+ is (a|(ac|b|c).*)?
As DPenner has already pointed out in the comments, the inverse of a regular expresion can be exponentially larger than the original expression. This makes inversing regular expressions unsuitable to implement negative partial expression syntax for searching purposes. Is there an algorithm that preserves the O(n*m) runtime characteristic (where n is the size of the regex and m is the length of the input) of regular expression matching and allows for negated subexpressions?

Unfortunately, the answer given by nhahdtdh in the comments is as good as we can do (so far). Whether a given regular expression generates all strings is PSPACE-complete. Since all problems in NP are in PSPACE-complete, an efficient solution to the universality problem would imply that P=NP.
If there were an efficient solution to your problem, would you be able to resolve the universality problem? Sure you would.
Use your efficient algorithm to generate a regular expression for the negation;
Determine whether the resulting regular expression generates the empty set.
Note that the problem "given a regular expression, does it generate the empty set" is fairly straightforward:
The regular expression {} generates the empty set.
(r + s) generates the empty set iff both r and s generate the empty set.
(rs) generates the empty set iff either r or s generates the empty set.
Nothing else generates the empty set.
Basically, it's pretty easy to tell whether a regular expression generates the empty set: just start evaluating the regular expression.
(Note that while the above procedure is efficient in terms of the output length, it might not be efficient in terms of the input length, if the output length is more than polynomially faster than the input length. However, if that were the case, we'd have the same result anyway, i.e., that your algorithm isn't really efficient, since it would take exponentially many steps to generate an exponentially longer output from a given input).

Wikipedia says: ... if there exists at least one regex that matches a particular set then there exist an infinite number of such expressions. We can deduct from this statement that there is an infinite number of expressions that describe the language of all words except those described by R.
Again, (as also #nhahtdh tried to explain) the simplest algorithm to address this question is to extend the scope of evaluation outside the context of the regular expression language itself. That is: match the strings you want to exclude (which represent a finite subset to work with) by using the original regular expression and then treat any failure to match as an actual match (out of an infinite set of other possibilities). So, if the result of the match is negative, your candidate strings are a subset of the valid solutions.

Related

Compiler matching string to regex, using NFA

I was reading about compilers, the chapter about lexical analyzers(scanners) and I'm puzzled by the following statement:
For an input string X and a regular expression R, the complexity for finding a match using non-deterministic finite automata is:
O(len R * len X)
How can the complexity be polynomial in len R?
I'm under the impression that it depends exponentially on len R, because whenever we have a character which may appear a variable number of times(ie followed by the * symbol), we must test all possible number of occurences. If we have multiple characters which appear a variable number of times, we must check all possibilities(by backtracking).
Where am I wrong?
we must check all possibilities(by backtracking).
Not necessarily by backtracking. There are many ways to implement an NFA. By moving through the input linearly, and transitioning to multiple states at the same time (storing the set of active states in an O(1)-lookup structure), you will get exactly the mentioned runtime - number of states in NFA is linear to length of regex.
See also the popular articel Regular Expression Matching Can Be Simple And Fast.

Is there a regular language to represent regular expressions?

Specifically, I noticed that the language of regular expressions itself isn't regular. So, I can't use a regular expression to parse a given regular expression. I need to use a parser since the language of the regular expression itself is context free.
Is there any way regular expressions can be represented in a way that the resulting string can be parsed using a regular expression?
Note: My question isn't about whether there is a regexp to match the current syntax of regexes, but whether there exists a "representation" for regular expressions as we know it today (maybe not a neat as what we know them as today) that can be parsed using regular expressions. Also, please could someone remove the dup since it isn't a dup. I'm asking something completely different. I already know that the current language of regular expressions isn't regular (it is how I started my original question).
Depending on what you mean by "represent", the answer is "yes" or "no":
If you want a language that (homomorphically) maps 1:1 to the usual basic regular expression language, the answer is no, because a regular language cannot be isomorphic to a non-regular language, and the standard regular expression language is non-regular. This is because the syntax requires matching opening and closing parentheses of arbitrary depth.
If "represent" only means another method of specifying regular languages, the answer is yes, and right now I can think of at least three ways to achieve this:
The "dumbest" and easiest way is to define some surjective mapping f : ℕ -> RegEx from the natural numbers onto the set of all valid standard regular expressions. You can define the natural numbers using the regular expression 0|1[01]*, and the regular language denoted by a (string representing the) natural number n is the regular language denoted by f(n).
Of course, the meaning attached to a natural number would not be obvious to a human reader at all, so this "regular expression language" would be utterly useless.
As parentheses are the only non-regular part in simple regular expressions, the easiest human-interpretable method would be to extend the standard simple regular expression syntax to allow dangling parentheses and defining semantics for dangling parentheses.
The obvious choice would be to ignore non-matching opening parentheses and interpreting non-matching closing parentheses as matching the beginning of the regex. This essentially amounts to implicitly inserting as many opening parentheses at the beginning and as many closing parentheses at the end of the regex as necessary. Additionally, (* would have to be interpreted as repetition of the empty string. If I didn't miss anything, this definition should turn any string into a "regular expression" with a specified meaning, so .* defines this "regular expression language".
This variant even has the same abstract syntax as standard regular expressions.
Another variant would be to specify the NFA that recognizes the language directly using a regular language, e.g.: ([a-z]+,([^,]|\\,|\\\\)+,[a-z]+\$?;)*.
The idea is that [a-z]+ is used as a label for states, and the expression is a list of transition triples (s, c, t) from source state s to target state t consuming character c, and a $ indicating accepting transitions (cf. note below). In c, backslashes are used to escape commas or backslashes - I assumed that you use the same alphabet for standard regular expressions, but of course you can replace the middle component with any other regular language of symbols denotating characters of any alphabet you wish.
The first source state mentioned is the (single) initial state. An empty expression defines the empty language.
Above, I wrote "accepting transition", not "accepting state" because that would make the regex above a bit more complex. You can interpret a triple containing a $ as two transitions, namely one transition consuming c from s to a new, unique state, and an ε-transition from that state to t. This should allow any NFA to be represented, by replacing each transition to an accepting state with a $ triple and each transition to a non-accepting state with a non-$ triple.
One note that might make the "yes" part look more intuitive: Assembly languages are regular, and those are even Turing-complete, so it would be unexpected if it wasn't possible to specify "mere" regular languages using a regular language.
The answer is probably NO.
As you have pointed out, set of all possible regular expressions itself is not a regular set. Any TRUE regular expression (not those extended) can be converted into finite automata (FA). If regular expression can be represented in a form that can be parsed by itself, then FA can be parsed by regular expression as well.
But that's not possible as far as I know. RE itself can be reduced into three basic operation(According to the Dragon Book):
concatenation: e.g. ab
alternation: e.g. a|b
kleen closure: e.g. a*
The kleen closure can match infinite number of characters, but it cannot know how many characters to match.
Just think such case: you want to match 3 consecutive as. Then the corresponding regular expression is /aaa/. But what if you want match 4, 5, 6... as? Parser with only one RE cannot know the exact number of as. So it fails to give the right matching to arbitrary expressions. However, the RE parser has to match infinite different forms of REs. According to your expression, a regular expression cannot match all the possibilities.
Well, the only difference of a RE parser is that it does not need a tokenizer.(probably that's why RE is used in lexical analysis) Every character in RE is a token (excluding those escape charcters). But to parse RE, whatever it is converted,one has to face up with NFA/DFA/TREE... all equivalent structures that cannot be parsed by RE itself.

How do I find the shortest possible reg exp that accepts a sequence?

I'm looking for a way to find the smallest possible regular-expression that accepts a sequence.
To make it interesting I don't want any stars(Kleene stars) and preferably no wildcards?
For instance the sequence : 'aaaaaaaa' would be accepted by 'a^8' and a^8 would be the shortest possible expression to accept the sequence.
Does anyone body know how to generate such an expression?
The search space for what you are after will most likely grow exponentially as the string grows, since there is usually a large amount of regular patterns that can match a given string.
I think that in your case you could try using some search heuristic to try and approximate or even manage to find the optimal solution. I do not think that there is a straightforward solution for that (albeit that is just my opinion).
Given that regular expressions and deterministic finite automata are equivalent, you can minimise a given regular expression using any of the algorithms for the minimisation of DFAs. You would of course still need to come up with a regular expression to start with, but if you only need it to accept one string, then the characters of that string are the states. You can then minimise that DFA and convert it to a regular expression.

Algorithm to match list of regular expressions

I have two algorithmic questions for a project I am working on. I have thought about these, and have some suspicions, but I would love to hear the community's input as well.
Suppose I have a string, and a list of N regular expressions (actually they are wildcard patterns representing a subset of full regex functionality). I want to know whether the string matches at least one of the regular expressions in the list. Is there a data structure that can allow me to match the string against the list of regular expressions in sublinear (presumably logarithmic) time?
This is an extension of the previous problem. Suppose I have the same situation: a string and a list of N regular expressions, only now each of the regular expressions is paired with an offset within the string at which the match must begin (or, if you prefer, each of the regular expressions must match a substring of the given string beginning at the given offset).
To give an example, suppose I had the string:
This is a test string
and the regex patterns and offsets:
(a) his.* at offset 0
(b) his.* at offset 1
The algorithm should return true. Although regex (a) does not match the string beginning at offset 0, regex (b) does match the substring beginning at offset 1 ("his is a test string").
Is there a data structure that can allow me to solve this problem in sublinear time?
One possibly useful piece of information is that often, many of the offsets in the list of regular expressions are the same (i.e. often we are matching the substring at offset X many times). This may be useful to leverage the solution to problem #1 above.
Thank you very much in advance for any suggestions you may have!
I will assume that you really have the power of regular expressions.
To determine whether a string is matched by one of expressions e_1, e_2, ..., e_n, just match against the expression e_1 + e_2 + ... + e_n (sometimes the + operator is written as |).
Given expression-offset pairs (e_1, o_1), ..., (e_n, o_n) and a string, you can check whether there is i such that the string is matched by expression e_i at offset o_i by matching against the expression .{o_1}e_1 + ... + .{o_n}e_n.
Depending on the form of the individual expressions, you can get sublinear performance (not in general though).
If your expressions are sufficiently simple (wildcard patterns are), AND your set of expressions are predetermined, i.e. only the input to be matched changes, THEN you may construct a finite state machine that matches the union of your expressions, i.e., the expression "(r1)|(r2)|...".
Constructing that machine takes time and space at least O(N) (but I guess it is not exponential, which is the worst case for regular expressions in general). Matching is then O(length(input)), independent of N.
OTOH, if your set of expressions is to be part of the program's input, then there is no sublinear algorithm, simply because each expression must be considered.
(1) Combine all the regular expressions as a big union: (r1)|(r2)|(r3)|...
(2) For each regex with offset n add n dots to the beginning plus an anchor. So his.* at offset 6 becomes ^......his.*. Or if your regex syntax supports it, ^.{6}his.*.

Regular expressions Lexical Analysis

Why repeated strings such as
[wcw|w is a string of a's and b's]
cannot be denoted by regular expressions?
pls. give me detailed answer as i m new to lexical analysis.
Thanks ...
Regular expressions in their original form describe regular languages/grammars. Those cannot contain nested structures as those languages can be described by a simple finite state machine. Simplified you can picture that as if each word of the language grows strictly from left to right (or right to left), where repeating structures have to be explicitly defined and are static.
What this means is, that no information whatsoever from previous states can be carried over to later states (a few characters further in the input). So if you have your symbol w you can't specify that the input must have exactly the same string w later in the sequence. Similarly you can't ensure that each opening paranthesis needs a closin paren as well (so regular expressions themselves are not even a regular language and thus cannot be described by regular expressions :-)).
In theoretical computer science we worked with a very restricted set of regex operators, basically only consisting of sequence, alternative (|) and repetition (*), everything else can be described with those operations.
However, usually regex engines allow grouping of certain sub-patterns into matches which can then be referenced or extracted later. Some engines even allow to use such a backreference in the search expression string itself, thereby allowing the expression to describe more than just a regular language. If I remember correctly such use of backreferences can even yield languages that are not context-free.
Additional pointers:
This StackOverflow question
Wikipedia
It can be, you just can't assure that it's the same string of "a"s and "b"s because there's no way to retain the information acquired in traversing the first half for use in traversing the second.