Algorithm to match list of regular expressions - c++

I have two algorithmic questions for a project I am working on. I have thought about these, and have some suspicions, but I would love to hear the community's input as well.
Suppose I have a string, and a list of N regular expressions (actually they are wildcard patterns representing a subset of full regex functionality). I want to know whether the string matches at least one of the regular expressions in the list. Is there a data structure that can allow me to match the string against the list of regular expressions in sublinear (presumably logarithmic) time?
This is an extension of the previous problem. Suppose I have the same situation: a string and a list of N regular expressions, only now each of the regular expressions is paired with an offset within the string at which the match must begin (or, if you prefer, each of the regular expressions must match a substring of the given string beginning at the given offset).
To give an example, suppose I had the string:
This is a test string
and the regex patterns and offsets:
(a) his.* at offset 0
(b) his.* at offset 1
The algorithm should return true. Although regex (a) does not match the string beginning at offset 0, regex (b) does match the substring beginning at offset 1 ("his is a test string").
Is there a data structure that can allow me to solve this problem in sublinear time?
One possibly useful piece of information is that often, many of the offsets in the list of regular expressions are the same (i.e. often we are matching the substring at offset X many times). This may be useful to leverage the solution to problem #1 above.
Thank you very much in advance for any suggestions you may have!

I will assume that you really have the power of regular expressions.
To determine whether a string is matched by one of expressions e_1, e_2, ..., e_n, just match against the expression e_1 + e_2 + ... + e_n (sometimes the + operator is written as |).
Given expression-offset pairs (e_1, o_1), ..., (e_n, o_n) and a string, you can check whether there is i such that the string is matched by expression e_i at offset o_i by matching against the expression .{o_1}e_1 + ... + .{o_n}e_n.
Depending on the form of the individual expressions, you can get sublinear performance (not in general though).

If your expressions are sufficiently simple (wildcard patterns are), AND your set of expressions are predetermined, i.e. only the input to be matched changes, THEN you may construct a finite state machine that matches the union of your expressions, i.e., the expression "(r1)|(r2)|...".
Constructing that machine takes time and space at least O(N) (but I guess it is not exponential, which is the worst case for regular expressions in general). Matching is then O(length(input)), independent of N.
OTOH, if your set of expressions is to be part of the program's input, then there is no sublinear algorithm, simply because each expression must be considered.

(1) Combine all the regular expressions as a big union: (r1)|(r2)|(r3)|...
(2) For each regex with offset n add n dots to the beginning plus an anchor. So his.* at offset 6 becomes ^......his.*. Or if your regex syntax supports it, ^.{6}his.*.

Related

Can I write a regular expression that checks two lengths are equal?

I want to match strings with two numbers of equal length, like : 42-42, 0-2, 12345-54321.
I don't want to match strings where the two numbers have different lengths, like : 42-1, 000-0000.
The two parts (separated by the hyphen) must have the same length.
I wonder if it is possible to do a regexp like [0-9]{n}-[0-9]{n} with n variable but equal?
If there is no clean way to that in one pattern (I must put that in the pattern attribute of a HTML form input), I will do something like /\d-\d|\d{2}-\d{2}|\d{3}-\d{3}|<etc>/ up to the maximum length (16 in my case).
This is not possible with regular expressions, because this is neither a type-3 grammatic (can be done with regular expression) nor a type-2 grammatic (can be done with regular expressions, which support recursion).
The higher grammar levels (type-1 grammatic and type-0 grammatic) can only be parsed using a Turing machine (or something compatible like your programming language).
More about this can be found here:
https://en.wikipedia.org/wiki/Chomsky_hierarchy#The_hierarchy
Using a programming language, you need to count the first sequence of digits, check for the minus and then check if the same amount of digits follows.
Without the minus symbol, this would be a type-2 grammatic and could be solved using a recursive regular expression (even if the right sequence shall not contain digits), like this: ^(\d(?1)\d)$
So you need to write your own, non-regular-expression check code.
You should probably split the String around the separator and compare the length of both parts.
The tool of choice in regex to use when specifying "the same thing than before" are back-references, however they reference the matched value rather than the matching pattern : no way of using a back-reference to .{3} to match any 3 characters.
However, if you only need to validate a finite number of lengths, it can be (painfully) done with alternation :
\d-\d will match up to 1 character on both sides of the separator
\d-\d|\d{2}-\d{2} will match up to 2 characters on both sides of the separator
...

Compiler matching string to regex, using NFA

I was reading about compilers, the chapter about lexical analyzers(scanners) and I'm puzzled by the following statement:
For an input string X and a regular expression R, the complexity for finding a match using non-deterministic finite automata is:
O(len R * len X)
How can the complexity be polynomial in len R?
I'm under the impression that it depends exponentially on len R, because whenever we have a character which may appear a variable number of times(ie followed by the * symbol), we must test all possible number of occurences. If we have multiple characters which appear a variable number of times, we must check all possibilities(by backtracking).
Where am I wrong?
we must check all possibilities(by backtracking).
Not necessarily by backtracking. There are many ways to implement an NFA. By moving through the input linearly, and transitioning to multiple states at the same time (storing the set of active states in an O(1)-lookup structure), you will get exactly the mentioned runtime - number of states in NFA is linear to length of regex.
See also the popular articel Regular Expression Matching Can Be Simple And Fast.

Is there a way to negate a regular expression?

Given a regular expression R that describes a regular language (no fancy backreferences). Is there an algorithmic way to construct a regular expression R* that describes the language of all words except those described by R? It should be possible as Wikipedia says:
The regular languages are closed under the various operations, that is, if the languages K and L are regular, so is the result of the following operations: […] the complement ¬L
For example, given the alphabet {a,b,c}, the inverse of the language (abc*)+ is (a|(ac|b|c).*)?
As DPenner has already pointed out in the comments, the inverse of a regular expresion can be exponentially larger than the original expression. This makes inversing regular expressions unsuitable to implement negative partial expression syntax for searching purposes. Is there an algorithm that preserves the O(n*m) runtime characteristic (where n is the size of the regex and m is the length of the input) of regular expression matching and allows for negated subexpressions?
Unfortunately, the answer given by nhahdtdh in the comments is as good as we can do (so far). Whether a given regular expression generates all strings is PSPACE-complete. Since all problems in NP are in PSPACE-complete, an efficient solution to the universality problem would imply that P=NP.
If there were an efficient solution to your problem, would you be able to resolve the universality problem? Sure you would.
Use your efficient algorithm to generate a regular expression for the negation;
Determine whether the resulting regular expression generates the empty set.
Note that the problem "given a regular expression, does it generate the empty set" is fairly straightforward:
The regular expression {} generates the empty set.
(r + s) generates the empty set iff both r and s generate the empty set.
(rs) generates the empty set iff either r or s generates the empty set.
Nothing else generates the empty set.
Basically, it's pretty easy to tell whether a regular expression generates the empty set: just start evaluating the regular expression.
(Note that while the above procedure is efficient in terms of the output length, it might not be efficient in terms of the input length, if the output length is more than polynomially faster than the input length. However, if that were the case, we'd have the same result anyway, i.e., that your algorithm isn't really efficient, since it would take exponentially many steps to generate an exponentially longer output from a given input).
Wikipedia says: ... if there exists at least one regex that matches a particular set then there exist an infinite number of such expressions. We can deduct from this statement that there is an infinite number of expressions that describe the language of all words except those described by R.
Again, (as also #nhahtdh tried to explain) the simplest algorithm to address this question is to extend the scope of evaluation outside the context of the regular expression language itself. That is: match the strings you want to exclude (which represent a finite subset to work with) by using the original regular expression and then treat any failure to match as an actual match (out of an infinite set of other possibilities). So, if the result of the match is negative, your candidate strings are a subset of the valid solutions.

Java tool for matching multiple regular expressions with priorities to multiple strings

I have an unlimited sequence of strings and numerous regular expressions ordered by priorities. For each string in a sequence I have to to find the first matching regular expression and the matched substring. Strings are not very long (<1Kb) while the number of regular expressions may vary from hundreds to thousands.
I'm looking for a Java tool that would do this job efficiently. I guess the technique should be building DFA ahead.
My current option is JFLEX. The problem I can't workaround in JFLEX is that its rules have no priorities and JFLEX looks for the rule matching the longest part of text.
My question is whether my problem could be solved with JFLEX? If not, can you suggest another Java tool/technique that would do?
You could use Java regexp's. Build up the alternatives into a RE string with each alternative surrounded with '(' and ')+?' and separated by '|', with the highest priority REs first. The first construct makes the sub-REs greedy so they won't backtrack and '|' alternatives are evaluated left-to-right so the highest priority REs will be tried first.
For example, given a string of "zeroonetwothreefour"
'(one)+?|(onetwo)+?' will match 'one'
'(onetwo)+?|(one)+?' will match 'onetwo'
'(twothree)+?|(onetwothree)+?' will match 'twothree'
Note especially that in the last example, 'twothree' matches even though it occurs later in the target string and is shorter than the 'onetwothree' match.

Regular expressions Lexical Analysis

Why repeated strings such as
[wcw|w is a string of a's and b's]
cannot be denoted by regular expressions?
pls. give me detailed answer as i m new to lexical analysis.
Thanks ...
Regular expressions in their original form describe regular languages/grammars. Those cannot contain nested structures as those languages can be described by a simple finite state machine. Simplified you can picture that as if each word of the language grows strictly from left to right (or right to left), where repeating structures have to be explicitly defined and are static.
What this means is, that no information whatsoever from previous states can be carried over to later states (a few characters further in the input). So if you have your symbol w you can't specify that the input must have exactly the same string w later in the sequence. Similarly you can't ensure that each opening paranthesis needs a closin paren as well (so regular expressions themselves are not even a regular language and thus cannot be described by regular expressions :-)).
In theoretical computer science we worked with a very restricted set of regex operators, basically only consisting of sequence, alternative (|) and repetition (*), everything else can be described with those operations.
However, usually regex engines allow grouping of certain sub-patterns into matches which can then be referenced or extracted later. Some engines even allow to use such a backreference in the search expression string itself, thereby allowing the expression to describe more than just a regular language. If I remember correctly such use of backreferences can even yield languages that are not context-free.
Additional pointers:
This StackOverflow question
Wikipedia
It can be, you just can't assure that it's the same string of "a"s and "b"s because there's no way to retain the information acquired in traversing the first half for use in traversing the second.