Building a Regex Composer - regex

I was reading the Java project idea described here:
The user gives examples of what he wants and does not want to match. The program tries to deduce a regex that fits the examples. Then it generates examples that would fit and not fit. The user corrects the examples it got wrong, and it composes a new regex. Iteratively, you get a regex that is close enough to what you need.
This sounds like a really interesting idea to me. Does anyone has an idea as to how to do this? My first idea was something like a genetic algorithm, but I would love some input from you guys.

Actually, this starts to look more and more like a compiler application. In fact, if I remember correctly, the Aho Dragon compiler book uses a regex example to build a DFA compiler. That's the place to start. This could be a really cool compiler project.
If that is too much, you can approach it as an optimization in several passes to refine it further and further, but it will be all predefined algo's at first:
First pass: Want to match Cat, Catches cans
Result: /Cat|Catches|Cans/
Second Pass: Look for similar starting conditions:
Result: /Ca(t|tches|ans)/
Second Pass: Look for similar ending conditions:
Result: /Ca(t|tch|an)s*/
Third Pass: Look for more refinements like repetitions and negative conditions

There exists algorithm that does exactly this for positive examples.
Regular expression are equivalent to DFA (Deterministic Finite Automata).
The strategie is mostly always the same:
Look at Alergia (for the theory) and MDI algorithm (for real usage) if generate an Deterministic Automaton is enough.
The Alergia algorithm and MDI are both described here:
http://www.info.ucl.ac.be/~pdupont/pdupont/pdf/icml2k.pdf
If you want to generate smaller models you can use another algorithm. The article describing it is here:
http://www.grappa.univ-lille3.fr/~lemay/publi/TCS02.ps.gz
His homepage is here:
http://www.grappa.univ-lille3.fr/~lemay
If you want to use negative example, I suggest you to use a simple rule (coloring) that prevent two states of the DFA to be merged.
If you ask these people, I am sure they will share their code source.
I made the same kind of algorithm during my Ph.D. for probabilistic automata. That means, you can associate a probability to each string, and I have made a C++ program that "learn" Weighted Automata.
Mainly these algorithm work like that:
from positive examples: {abb, aba, abbb}
create the simplest DFA that accept exactly all these examples:
-> x -- a --> (y) -- b --> (z)
\
b --> t -- b --> (v)
x cant got to state y by reading the letter 'a' for example.
The states are x, y, z, t and v. (z) means it is a finite state.
then "merge" states of the DFA: (here for example the result after merging states y and t.
_
/ \
v | a,b ( <- this is a loop :-) )
x -- a -> (y,t) _/
the new state (y,t) is a terminal state obtaining by merging y and t. And you can read the letter a and b from it.
Now the DFA can accept: a(a|b)* and it is easy to construct the regular expression from the DFA.
Which states to merge is a choice that makes the main difference between algorithms.

The program tries to deduce a regex
that fits the examples
I don't think it's a useful question to ask. You have to know semantically what you need to represent to deduce something. When you write a regex, you have a purpose: accepting urls, accepting emails, extracting tokens from code, etc. I would redefine the question as so: given a knowledge base and a semantic for regular expression, compute the smallest regex. This get a step further, because you have natural language trying explaining a general expression and we all know how it get ambiguous! You have to have some semantic explanation. Without that, the best thing you can do for examples is to compute regex that cover all string from the ok list.
Algorithm for coverage:
Populate Ok List
Populate Not ok List
Check for repetitions
Check for contradictions ( the same string cannot be in both list )
Create Deterministic Finite Automata (DFA) from Ok List where strings from the list are final states
Simplify the DFA by eliminating repetitive states. ([1] 4.4 )
Convert DFA to regular expression. ([1] 3.2.2 )
Test Ok list and Not ok list
[1] Introduction to Automata Theory, Language, and Computation. J. Hopcroft, R. Motwani, J.D. Ullman, 2nd Edition, Pearson Education.
P.S.
I had some experience with genetic programming and I think it's really overhead for your problem. Since the objective function is really light it's better to evaluate with a single processor and this can take a lot of time. To have shorter expression you just need to minimize the DFA. But GA may possibly produce interesting result.

Maybe I'm a bit late, but there is a way to solve this problem by means of Genetic Programming.
Genetic Programming (GP) is an evolutionary machine learning technique in which candidate a candidate solution for a given problem is represeted as an abstract syntax tree.
Several studies have been published on how to use GP in order to find a regular expression that matches a given set of examples.
You can find the articles and the details here
A webapp that does this is hosted at regex.inginf.units.it.
The source code behind the application has been publicly released on github

You may try to use a basic inferring algorithm that has been used in other applications. I have implemented a very basic based on building a state machine. However, it only accounts for positive samples. The source code is on http://github.com/mvaled/inferdtd
Should could be interested in the AutomataInferrer.py which is very simple.

RegexBuilder seems to have many of the features you're looking for.

Related

How can I shortcut a slow conversion to CNF?

I'm using sympy to generate expressions like this:
for crowd in itertools.combinations(symbs, max_true + 1):
exprs.append(functools.reduce(operator.and_, crowd))
unaltered = ~functools.reduce(operator.or_, exprs)
Later, I convert them to CNF:
altered = sympy.logic.boolalg.to_cnf(unaltered, simplify=True, force=True)
It takes a lot of computer time. I made a gist with more details:
https://gist.github.com/MatrixManAtYrService/501ea099826a5aeeacc9368710b059ec
Given that I'm generating expressions with for loops, they're in a reliable format. Sympy (understandably) is doing the exhaustive thing and solving them "by hand", because it doesn't know that they're so well behaved. A human who is looking at the unaltered/altered expressions can easily ascertain the pattern and just generate the CNF directly with a for loop.
That's easy enough in this case, but I expect to have more constraints.
I want to know if I'm in uncharted terratory, or just failing to ask for help correctly.
Does Sympy have anything to help with this kind of thing? Is there another library I should explore? Is there a name for the "look at it and extrapolate based on a pattern" strategy that I'm proposing? Is there a list of algorithms for the task somewhere?

Best string-comparison algorithm for regex

Given a regex, I want to compare it with a list of other regex, and output a similarity score.
There are several edit distance algorithms out there (e.g. levenshtein distance), but they fail to compare regex's, e.g.:
R1: [a-z0-9]+
R2: [0-9]{1}[a-z0-9]+
Distance: 9
In the example above, both regex's are quite similar, however they have a quite high edit distance. I suppose an approach using character n-grams would be more suitable for such cases.
What algorithm/approach would you consider for this problem?
It seems you're unlikely to improve upon the regular expression parsing algorithm present in an engine itself, because you're ultimately going to be making inferences about combinations of rules.
There are a number of open source regular expression engines, many listed on wikipedia, possibly including the one you're using.
Without having looked at the internals myself (not an insignificant caveat,) my recommendation is to see if it's possible to modify a regex engine (or leverage some pre-existing debugging or testing code) to output pertinent rules-processing metadata, sub-scores, if you will, from which you can then calculate an aggregate. The engines ultimately do their work deterministically, so this is theoretically possible.
If it works, this will amongst other things, enable you classify constructs, which you define as similar, with similar weights, and to possibly ignore others entirely.

creating a regular expression for a list of strings

I have extracted a series of tables from the scientific literature which consist of columns each of which is a distinct type. Here is an example
I'd like to be able to automatically generate regular expressions for each column. Obviously there are trivial solutions such as .* so I would add the constraints that they use only:
[A-Z] [a-z] [0-9]
explicit punctuation (e.g. ',',''')
"simple" quantifiers (e.g {3,4}
A "best" answer for the table above would be:
[A-Z]{3}
[A-Za-z\s\.]+
\d{4}\sm
\d{2}\u00b0\d{2}'\d{2}"N,\d{2}\u00b0\d{2}'\d{2}"E
(speciosissima|intermediate|troglodytes)
(hf|sr)
\d{4}
Of course the 4th regex would break if we move outside the geographical area but the software doesn't know that. The aim would be to collect many regexes for , say "Coordinates" and generalize them, probably partially manual. The enums would only be created if there were a small number of distinct strings.
I'd be grateful for examples of (especially F/OSS) software that can do this, especially in Java. (It's similar to Google's Refine). I am aware of this question 4 years ago but that didn't really answer the question and the text2re site which appears to be interactive.
NOTE: I note a vote to close as "too localised". This is a very general problem (the table given is only an example) as shown by Google/Freebase developing Refine to tackle the problem. It potentially refers to a very wide variety of tables (e.g. financial, journalism, etc.). Here's one with floating point values:
It would be useful to determine automatically that some authorities report ages in real numbers (e.g. not months, days) and use 2 digits of precision.
Your particular issue is a special case of "programming by demonstration". That is, given a bunch of input/output examples, you want to generate a program. For you, the inputs are strings and the output is whether each string belongs to the given column. In the end, you want to generate a program in the language of limited regular expressions that you proposed.
This particular instance of programming by demonstration seems closely related to Flash Fill, a recent project from MSR. There, instead of generating regular expressions to match data, they automatically generated programs to transform string data based on input/output examples.
I only skimmed through one of their papers, but I'll try to lay out what I understand here.
There are basically two important insights in this paper. The first was to design a small programming language to represent string transformations. Even using full-on regular expressions created too many possibilities to search through quickly. They designed their own abstract language for manipulating strings; however, your constraints (e.g. only using simple quantifiers) would probably play the same role as their custom language. This is largely possible because your particular problem has a somewhat smaller scope than theirs.
The second insight was on how to actually find programs in this abstract language that match with given input/output pairs. My understanding is that the key idea here is to use a technique called version space algebra. The rough idea about version space algebra is that you maintain a representation of the space of possible programs and repeatedly prune it by introducing additional constraints. The exact details of this process fall well outside my main interests, so you're better off reading something like this introduction to version space algebra, which includes some sample code as well.
They also have some clever approaches to rank different candidate programs and even guess which inputs might be problematic for an already-generated program. I saw a demo where they generated a program without giving it enough input/output pairs, and the program could actually highlight new inputs that were likely to be incorrect. This sort of ranking is very interesting, but requires some more sophisticated machine learning techniques and is probably not immediately applicable to your use case. Might still be interesting though. (Also, this might have been detailed in a different paper than the one I linked.)
So yeah, long story short, you can generate your expressions by feeding input/output examples into a system based on version space algebra. I hope that helps.
I'm currently researching the same (or something similar) (here). In general, this is called Grammar induction, or in case of regular expressions, it is induction of regular languages. There is the StaMinA competition about this field. Common algorithms are RPNI and Blue-Fringe.
Here is another related question. And here another one. And here another one.
My own approach (which I have partially prototyped) is heuristic and based on the premise that a given column will often have entries which are the same or similar character lengths and have similar punctuation. I would welcome comments (and resulting code will be Open Source).
flatten [A-Z] to 'A'
flatten [a-z] to 'a'
flatten [0-9] to '0'
flatten any other special codepoint sets (e.g. greek characters) to a single character (e.g. alpha)
The columns then become:
"AAA"
"Aaaaaaaaaa", "Aaaaaaaaaaaaa", "Aaa aaa Aaaaaa", etc.
"0000 a"
"00\u00b000'00"N,00\u00b000'00"E
...
...
"0000"
I shall then replace these by regular expressions such as
"([A-Z])([A-Z])([A-Z])"
...
"(\d)(\d)(\d)(\d)\s([0-9])"
and capture the individual characters into sets. This will show that (say) in 3. the final char is always "m" , so \d\d\d\d\s[m] and for 7. the value is [2][0][0][458].
For the columns that don't fit this model we search using "(.*)" and see if we can create useful sets (cols 5. and 6.) with a heuristic such as "at least 2 multiple strings and no more than 50% unique strings".
By using dynamic programming (cf. Kruskal) I hope to be able to align similar regexes, which will be useful for me, at least!

What's the Time Complexity of Average Regex algorithms?

I'm not new to using regular expressions, and I understand the basic theory they're based on--finite state machines.
I'm not so good at algorithmic analysis though and don't understand how a regex compares to say, a basic linear search. I'm asking because on the surface it seems like a linear array search. (If the regex is simple.)
Where could I go to learn more about implementing a regex engine?
This is one of the most popular outlines: Regular Expression Matching Can Be Simple And Fast
. Running a DFA-compiled regular expression against a string is indeed O(n), but can require up to O(2^m) construction time/space (where m = regular expression size).
Are you familiar with the term Deterministic/Non-Deterministic Finite Automata?
Real regular expressions (when I say real I'm refering to those regex that recognize Regular Languages, and not the regex that almost every programming language include with backreferences, etc) can be converted into a DFA/NFA and both can be implemented in a mechanical way in a programming language (a NFA can be converted into a DFA)
What you have to do is:
Find a way to convert a regex into an automaton
Implement the recognition of the automaton in the programming language of your preference
That way, given a regex you can convert it to a DFA and run it to see if it matches or not a specified text.
This can be implemented in O(n), because DFA don't go backward (like a Turing Machine), so it matches the string or not. That is supposing you won't take in count overlapped matches, otherwise you will have to go back and start matching again...
The classic regular expression can be implemented in a way which is fast in practice but has really bad worst case behaviour (the standard DFA) or in a way which has guaranteed reasonable worst case behaviour (keeping it as an NFA). The standard DFA can be extended to support lots of extra matching characters and flags, which make use of the fact that it is basically back-tracking search.
Examples of the standard approach are everywhere (e.g. built into Perl). There is an example that claims good worst case behaviour at http://code.google.com/p/re2/ - in fact it is even better than I expected in the worst case, so they may have found an extra trick or two.
If you are at all interested in this, or care about writing programs that can be made to lock up solid given pathological inputs, read http://swtch.com/~rsc/regexp/regexp1.html.

Create a program that inputs a regular expression and outputs strings that satisfy that regular expression

I think that the title accurately summarizes my question, but just to elaborate a bit.
Instead of using a regular expression to verify properties of existing strings, I'd like to use the regular expression as a way to generate strings that have certain properties.
Note: The function doesn't need to generate every string that satisfies the regular expression (cause that would be an infinite number of string for a lot of regexes). Just a sampling of the many valid strings is sufficient.
How feasible is something like this? If the solution is too complicated/large, I'm happy with a general discussion/outline. Additionally, I'm interested in any existing programs or libraries (.NET) that do this.
Well a regex is convertible to a DFA which can be thought of as a graph. To generate a string given this DFA-graph you'd just find a path from a start state to an end state. You'd just have to think about how you want to handle cycles (Maybe traverse every cycle at least once to get a sampling? n times?), but I don't see why it wouldn't work.
This utility on UtilityMill will invert some simple regexen. It is based on this example from the pyparsing wiki. The test cases for this program are:
[A-EA]
[A-D]*
[A-D]{3}
X[A-C]{3}Y
X[A-C]{3}\(
X\d
foobar\d\d
foobar{2}
foobar{2,9}
fooba[rz]{2}
(foobar){2}
([01]\d)|(2[0-5])
([01]\d\d)|(2[0-4]\d)|(25[0-5])
[A-C]{1,2}
[A-C]{0,3}
[A-C]\s[A-C]\s[A-C]
[A-C]\s?[A-C][A-C]
[A-C]\s([A-C][A-C])
[A-C]\s([A-C][A-C])?
[A-C]{2}\d{2}
#|TH[12]
#(#|TH[12])?
#(#|TH[12]|AL[12]|SP[123]|TB(1[0-9]?|20?|[3-9]))?
#(#|TH[12]|AL[12]|SP[123]|TB(1[0-9]?|20?|[3-9])|OH(1[0-9]?|2[0-9]?|30?|[4-9]))?
(([ECMP]|HA|AK)[SD]|HS)T
[A-CV]{2}
A[cglmrstu]|B[aehikr]?|C[adeflmorsu]?|D[bsy]|E[rsu]|F[emr]?|G[ade]|H[efgos]?|I[nr]?|Kr?|L[airu]|M[dgnot]|N[abdeiop]?|Os?|P[abdmortu]?|R[abefghnu]|S[bcegimnr]?|T[abcehilm]|Uu[bhopqst]|U|V|W|Xe|Yb?|Z[nr]
(a|b)|(x|y)
(a|b) (x|y)
This can be done by traversing the DFA (includes pseudocode) or else by walking the regex's abstract-syntax tree directly or converting to NFA first, as explained by Doug McIlroy: paper and Haskell code. (He finds the NFA approach to go faster, but he didn't compare it to the DFA.)
These all work on regular expressions without back-references -- that is, 'real' regular expressions rather than Perl regular expressions. To handle the extra Perl features it'd be easiest to add on a post-filter.
Added: code for this in Python, by Peter Norvig and me.
Since it is trivially possible to write a regular expression that matches no possible strings, and I believe it is also possible to write a regular expression for which calculating a matching string requires an exhaustive search of possible strings of all lengths, you'll probably need an upper bound on requesting an answer.
The easiest way to implement but definitely most CPU time intensive approach would be to simply brute force it.
Set up a character table with the characters that your string should contain and then just sequentially generate strings and do a Regex.IsMatch on them.
I, personally, believe that this is the holy grail of reg-ex. If you could implement this -- even only 3/4 working -- I have no doubt that you'd be rich in about 5 minutes.
All joking aside, I'm not sure that what you are truly going after is feasible. Reg-Ex is a very open, flexible language and giving the computer enough sample input to truly and accurately find what you need, is probably not feasible.
If I'm proven wrong, I wish kudos to that developer.
To look at this from a different perspective, this is almost (not quite) like giving a computer it's output, and having it -- based on that -- write a program for you. This is a little overboard, but it kind of illustrates my point.