Producing all possible matches of a regular expression

Producing all possible matches of a regular expression - regex

Given a regular expression, I want to produce the set of strings that that regular expression would match. It is important to note that this set would not be infinite because there would be maximum length for each string. Are there any well known algorithms in place to do this? Are there any research papers I could read to gain insight into this problem?
Thanks.
p.s. Would this sort of question be more appropriate in the theoretical cs stack exchange?

Are there any well known algorithms in place to do this?
In the Perl eco-system the Regexp::Genex CPAN module does this.
In Python the sre_yield generates the matching words. Regex inverter also does this.
A recursive algorithm is described here link1 link2 and several libraries that do this in Java are mentioned here.
Generation of random words/strings that match a given regex: xeger (Python)
Are there any research papers I could read to gain insight into this problem?
Yes, the following papers are available for counting the strings that would match a regex (or obtaining generating functions for them):
Counting occurrences for a finite set of words: an inclusion-exclusion approach by F. Bassino, J. Clement2, J. Fayolle, and P. Nicodeme (2007) paper slides
Regexpcount, a symbolic package for counting problems on regular expressions and words by Pierre Nicodeme (2003) paper link link code

Related

minimization and comparison of regular expression pattern

there are many ways to write a regular expression (regex) pattern that would answer the same question and yield the same result. for instance, ^aa$|^aaa$ and ^a{2,3}$ regex's patterns are equivalent in their results for the question - match all inputs that start and end with the character a and a appears either 2 or 3 times.
is there an algorithm for shortening a regular expression (regex) pattern? e.g. the algorithm accepts ^aa$|^aaa$ and prints ^a{2,3}$?
from theoretical perspective, when given 2 regex patterns that yield exactly the same result, would their minimized deterministic finite automaton be the same?
from practical/computational perspective, when given 2 regex that yield exactly the same result, should one regex be favored over the other?

There exist algorithms to minimize a regex (depending from the metric for what you intend to minimize, e.g. regex length), but from what I've seen here, you're looking at a PSPACE-hard problem with no general solution: this and this site give different results for example.
Yes, it is exactly the purpose of minimizing an automata.
Yes, since they are not linked to the associated minimum automata from an implementation point of view. Here's an interesting set of examples and explainations

Is there any way to optimize a generic regular expression?

I code in Eclipse, and when I do a CTRL-F to find some string, I see that apart from the standardized options of whole word, case sensitive, there is an option for regular expression search also (it is there in Notepad++ too).
I have tried it once or twice, and generally the results are almost instantaneous. But after all, the code files are not humongous, the biggest ones are not more than 500 lines long, with most lines filled less than half. Is there any way to optimize such that any user supplied regex will run much faster on a large piece of text, say 10-15 MB of size?
I can't think of any method because no standardized search algorithm like Rabin-Karp, or suffix tree would apply here!

I have no idea on how regular expression is implemented in Eclipse and why it is so slow. Here is just some thoughts:
First of all, there are a few concepts you should know: Nondeterministic finite automaton (NFA) and Deterministic finite automaton (DFA). In theory, Regular Expression, NFA, and DFA are equivalent, which means they have exactly the same ability to describe languages (sequences of characters). This implies that any one of them can be converted to another (see this site).
Regular Expression can be implemented by converting it to DFA, and using DFA to match text only takes linear time (many of the string matching algorithms, e.g. KMP, are actually special DFAs). However, the trouble is, most of modern Regular Expression implementations introduced features like backreferences making it impossible to use DFA.
So, if discarding those complex features is possible, implementing a fast Regular Expression would be feasible (do the matching in linear time). You may find more in this article.

What makes you think suffix tree isn't a suitable algorithm for this problem? From http://en.wikipedia.org/wiki/Suffix_tree:
Once [the suffix tree is] constructed, several operations can be performed quickly, for instance locating a substring in S, locating a substring if a certain number of mistakes are allowed, locating matches for a regular expression pattern etc.
I think a modified Boyer–Moore string search algorithm also would be possible.

Building a Regex Composer

I was reading the Java project idea described here:
The user gives examples of what he wants and does not want to match. The program tries to deduce a regex that fits the examples. Then it generates examples that would fit and not fit. The user corrects the examples it got wrong, and it composes a new regex. Iteratively, you get a regex that is close enough to what you need.
This sounds like a really interesting idea to me. Does anyone has an idea as to how to do this? My first idea was something like a genetic algorithm, but I would love some input from you guys.

Actually, this starts to look more and more like a compiler application. In fact, if I remember correctly, the Aho Dragon compiler book uses a regex example to build a DFA compiler. That's the place to start. This could be a really cool compiler project.
If that is too much, you can approach it as an optimization in several passes to refine it further and further, but it will be all predefined algo's at first:
First pass: Want to match Cat, Catches cans
Result: /Cat|Catches|Cans/
Second Pass: Look for similar starting conditions:
Result: /Ca(t|tches|ans)/
Second Pass: Look for similar ending conditions:
Result: /Ca(t|tch|an)s*/
Third Pass: Look for more refinements like repetitions and negative conditions

There exists algorithm that does exactly this for positive examples.
Regular expression are equivalent to DFA (Deterministic Finite Automata).
The strategie is mostly always the same:
Look at Alergia (for the theory) and MDI algorithm (for real usage) if generate an Deterministic Automaton is enough.
The Alergia algorithm and MDI are both described here:
http://www.info.ucl.ac.be/~pdupont/pdupont/pdf/icml2k.pdf
If you want to generate smaller models you can use another algorithm. The article describing it is here:
http://www.grappa.univ-lille3.fr/~lemay/publi/TCS02.ps.gz
His homepage is here:
http://www.grappa.univ-lille3.fr/~lemay
If you want to use negative example, I suggest you to use a simple rule (coloring) that prevent two states of the DFA to be merged.
If you ask these people, I am sure they will share their code source.
I made the same kind of algorithm during my Ph.D. for probabilistic automata. That means, you can associate a probability to each string, and I have made a C++ program that "learn" Weighted Automata.
Mainly these algorithm work like that:
from positive examples: {abb, aba, abbb}
create the simplest DFA that accept exactly all these examples:
-> x -- a --> (y) -- b --> (z)
\
b --> t -- b --> (v)
x cant got to state y by reading the letter 'a' for example.
The states are x, y, z, t and v. (z) means it is a finite state.
then "merge" states of the DFA: (here for example the result after merging states y and t.
_
/ \
v | a,b ( <- this is a loop :-) )
x -- a -> (y,t) _/
the new state (y,t) is a terminal state obtaining by merging y and t. And you can read the letter a and b from it.
Now the DFA can accept: a(a|b)* and it is easy to construct the regular expression from the DFA.
Which states to merge is a choice that makes the main difference between algorithms.

The program tries to deduce a regex
that fits the examples
I don't think it's a useful question to ask. You have to know semantically what you need to represent to deduce something. When you write a regex, you have a purpose: accepting urls, accepting emails, extracting tokens from code, etc. I would redefine the question as so: given a knowledge base and a semantic for regular expression, compute the smallest regex. This get a step further, because you have natural language trying explaining a general expression and we all know how it get ambiguous! You have to have some semantic explanation. Without that, the best thing you can do for examples is to compute regex that cover all string from the ok list.
Algorithm for coverage:
Populate Ok List
Populate Not ok List
Check for repetitions
Check for contradictions ( the same string cannot be in both list )
Create Deterministic Finite Automata (DFA) from Ok List where strings from the list are final states
Simplify the DFA by eliminating repetitive states. ([1] 4.4 )
Convert DFA to regular expression. ([1] 3.2.2 )
Test Ok list and Not ok list
[1] Introduction to Automata Theory, Language, and Computation. J. Hopcroft, R. Motwani, J.D. Ullman, 2nd Edition, Pearson Education.
P.S.
I had some experience with genetic programming and I think it's really overhead for your problem. Since the objective function is really light it's better to evaluate with a single processor and this can take a lot of time. To have shorter expression you just need to minimize the DFA. But GA may possibly produce interesting result.

Maybe I'm a bit late, but there is a way to solve this problem by means of Genetic Programming.
Genetic Programming (GP) is an evolutionary machine learning technique in which candidate a candidate solution for a given problem is represeted as an abstract syntax tree.
Several studies have been published on how to use GP in order to find a regular expression that matches a given set of examples.
You can find the articles and the details here
A webapp that does this is hosted at regex.inginf.units.it.
The source code behind the application has been publicly released on github

You may try to use a basic inferring algorithm that has been used in other applications. I have implemented a very basic based on building a state machine. However, it only accounts for positive samples. The source code is on http://github.com/mvaled/inferdtd
Should could be interested in the AutomataInferrer.py which is very simple.

RegexBuilder seems to have many of the features you're looking for.

Detect if a regexp is exponential

This article show that there is some regexp that is O(2^n) when backtracking.
The example is (x+x+)+y.
When attempt to match a string like xxxx...p it going to backtrack for a while before figure it out that it couldn't match.
Is there a way to detect such regexp?
thanks

If your regexp engine exposes runtime exponential behavior for (x+x+)+y ,then it is broken because a DFA or NFA can recognize this pattern in linear time:
echo "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" | egrep "(x+x+)+y"
echo "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxy" | egrep "(x+x+)+y"
both answer immediately.
In fact, there are only a few cases (like backreferences) where backtracking is really needed (mainly, because a regexp with a backreference is not a regular expression in the language theoretic sense anymore). A capable implementation should switch to backtracking only when these corner cases are given.
In fairness, DFA's have a dark side too, because some regexp's have exponential size requirements, but a size contraints is easier to enforce than a time constraint and the huge DFA runs linear on the input, so it's a better bargain than a small backtracker choking on a couple of X's.
You should really read Russ Cox excellent article series about the implementation of regexp (and the pathological behavior of backtracking): http://swtch.com/~rsc/regexp/
To answer your question about decidability: You can't. Because there is not the one backtracking for regexpr. Every implementation has its own strategies to deal with exponential growth in their algorithm for certain cases and does not cover others. One rule might be fit for here and catastrophic for there.
UPDATE:
For example, one implementation could contain an optimizer which could use algebraic transformations to simplify regexps before executing them: (x+x+)+y is the same a xxx*y, which shouldn't be a problem for any backtracker. But the same optimizer wouldn't recognize the next expression and the problem is there again. Here someone described how to craft a regexpr which fools Perl's optimizer:
http://perlgeek.de/blog-en/perl-tips/in-search-of-an-exponetial-regexp.html

No I don't think so, but you can use these guidelines:
If it contains two quantifiers that are open-ended at the high end and they are nested then it might be O(2^n).
If it does not contain two such quantifiers then I think it cannot be O(2^n).
Quantifiers that can cause this are: *, + and {k,}.
Also note that the worst case complexity of evaluating a regular expression might be very different from the complexity on typical strings and that the complexity depends on the specific regular expression engine.

Any regex without backreferences can be matched in linear time, though many regex engines out there in the real world don't do it that way (at least many regex engines that are plugged into programming language runtime environments support backreferences, and don't switch to a more efficient execution model when no backreferences are present).
There's no easy way to find out how much time a regex with backreferences is going to consume.

You could detect and reject nested repetitions using a regex parser, which corresponds to a star height of 1. I've just written a module to compute and reject start heights of >1 using a regex parser from npm.
$ node safe.js '(x+x+)+y'
false
$ node safe.js '(beep|boop)*'
true
$ node safe.js '(a+){10}'
false
$ node safe.js '\blocation\s*:[^:\n]+\b(Oakland|San Francisco)\b'
true

Create a program that inputs a regular expression and outputs strings that satisfy that regular expression

I think that the title accurately summarizes my question, but just to elaborate a bit.
Instead of using a regular expression to verify properties of existing strings, I'd like to use the regular expression as a way to generate strings that have certain properties.
Note: The function doesn't need to generate every string that satisfies the regular expression (cause that would be an infinite number of string for a lot of regexes). Just a sampling of the many valid strings is sufficient.
How feasible is something like this? If the solution is too complicated/large, I'm happy with a general discussion/outline. Additionally, I'm interested in any existing programs or libraries (.NET) that do this.

Well a regex is convertible to a DFA which can be thought of as a graph. To generate a string given this DFA-graph you'd just find a path from a start state to an end state. You'd just have to think about how you want to handle cycles (Maybe traverse every cycle at least once to get a sampling? n times?), but I don't see why it wouldn't work.

This utility on UtilityMill will invert some simple regexen. It is based on this example from the pyparsing wiki. The test cases for this program are:
[A-EA]
[A-D]*
[A-D]{3}
X[A-C]{3}Y
X[A-C]{3}\(
X\d
foobar\d\d
foobar{2}
foobar{2,9}
fooba[rz]{2}
(foobar){2}
([01]\d)|(2[0-5])
([01]\d\d)|(2[0-4]\d)|(25[0-5])
[A-C]{1,2}
[A-C]{0,3}
[A-C]\s[A-C]\s[A-C]
[A-C]\s?[A-C][A-C]
[A-C]\s([A-C][A-C])
[A-C]\s([A-C][A-C])?
[A-C]{2}\d{2}
#|TH[12]
#(#|TH[12])?
#(#|TH[12]|AL[12]|SP[123]|TB(1[0-9]?|20?|[3-9]))?
#(#|TH[12]|AL[12]|SP[123]|TB(1[0-9]?|20?|[3-9])|OH(1[0-9]?|2[0-9]?|30?|[4-9]))?
(([ECMP]|HA|AK)[SD]|HS)T
[A-CV]{2}
A[cglmrstu]|B[aehikr]?|C[adeflmorsu]?|D[bsy]|E[rsu]|F[emr]?|G[ade]|H[efgos]?|I[nr]?|Kr?|L[airu]|M[dgnot]|N[abdeiop]?|Os?|P[abdmortu]?|R[abefghnu]|S[bcegimnr]?|T[abcehilm]|Uu[bhopqst]|U|V|W|Xe|Yb?|Z[nr]
(a|b)|(x|y)
(a|b) (x|y)

This can be done by traversing the DFA (includes pseudocode) or else by walking the regex's abstract-syntax tree directly or converting to NFA first, as explained by Doug McIlroy: paper and Haskell code. (He finds the NFA approach to go faster, but he didn't compare it to the DFA.)
These all work on regular expressions without back-references -- that is, 'real' regular expressions rather than Perl regular expressions. To handle the extra Perl features it'd be easiest to add on a post-filter.
Added: code for this in Python, by Peter Norvig and me.

Since it is trivially possible to write a regular expression that matches no possible strings, and I believe it is also possible to write a regular expression for which calculating a matching string requires an exhaustive search of possible strings of all lengths, you'll probably need an upper bound on requesting an answer.

The easiest way to implement but definitely most CPU time intensive approach would be to simply brute force it.
Set up a character table with the characters that your string should contain and then just sequentially generate strings and do a Regex.IsMatch on them.

I, personally, believe that this is the holy grail of reg-ex. If you could implement this -- even only 3/4 working -- I have no doubt that you'd be rich in about 5 minutes.
All joking aside, I'm not sure that what you are truly going after is feasible. Reg-Ex is a very open, flexible language and giving the computer enough sample input to truly and accurately find what you need, is probably not feasible.
If I'm proven wrong, I wish kudos to that developer.
To look at this from a different perspective, this is almost (not quite) like giving a computer it's output, and having it -- based on that -- write a program for you. This is a little overboard, but it kind of illustrates my point.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js