Intersection of two regular expressions in Golang using Ragel

Intersection of two regular expressions in Golang using Ragel - regex

The template of the function is as follows:
func GetIntersection(firstRegex string, secondRegex string) string {
...
}
I'm trying to use Ragel to get the intersection of two regular expressions. Not sure if Ragel is the right tool to use, though. My last resort is implementing conversions from regex to DFA and DFA to regex, as well as intersection of two DFA's myself, but I would rather avoid that. Would highly appreciate any reliable libraries to solve the problem.

Ragel has an intersection operator (&). You can produce the intersection of any two regular expressions, so long as they are expressed in the ragel syntax. There is no freely available automatic ragel-to-regex translation. This is the kind of thing I've been producing for clients privately. Anyhow, see the manual for more information on intersection.

Related

using OCaml to simplify arithmetic expression

I am now using ocaml to deal with some arithmetic expressions. If every arithmetic expression is a string like: "1+2*(2-5)". I want to know how to use ocaml to eliminate useless parentheses.
For example if we get a string like "(2*(1-8))" we should output "2*(1-8)".
Thanks.

OCaml is just a programming language, not a symbolic algebra system. So you would solve this in OCaml just as in any general purpose language.
A full blown solution would be to parse the expression into a tree, then walk the tree to produce the output. For this you need to analyze your string lexically (for which you can probably use the Str module), and then parse the tokens. You can code your own parser pretty easily, or you could really go full force and use a parser generator like ocamlyacc.
For the relatively simple problem of reparenthesizing an arithmetic expression you can use a variant of the "shunting yard" algorithm, which in essence calculates a canonical, unparenthesized (RPN) form of an expression.

Counterpart of regular expressions for parsing nested strucures

Regular expressions are a standard tool used for parsing strings across many languages. However their scope of applicability is limited. Regular expressions can only match a list. There is no way to describe arbitrary deep nested structures using regular expressions. Question: what is a technology/framework as widely used/spread and as standatd as regular expessions are that can match tree structures (produce AST).

Regular expressions describe a finite-state automaton.
Since the late 1960's, the "bread and butter" of parsing (though not necessarily the "state of the art") has been push-down automata generated by parser generators according to "LR" algorithms like LALR(1).
The connection to regular expressions is this: the parsing machine does in fact use rules very similar to regular expressions in order to recognize viable prefixes. The "shift" state transitions among the "core LR(0) items" constitute a finite automaton, and can be described by a regular expression. The recursion is is handled thanks to the semantic action of pushing symbols onto a stack when doing the "shifts", and removing them ("reducing"). Reductions rewrite a portion of the stack, and perform a "goto" to another state. This type of goto, together with the stack, is absent in the regular expression automaton.
Parse Expression Grammars are also related to regular expressions. Regular expressions themselves can be endowed with recursion. Firstly, we can take pieces of regular expressions and give them names, and then construct bigger regular expressions by writing expressions which invoke these names. (Such as feature is found in the lex tool where you can define a named expressions like letters [A-Za-z]+ and refer to it later as {letters}. Now suppose you allow circular references, like letters [A-Za-z]{letters}?. You now have recursion; the only problem is to adjust the model in order to implement it.
Implementations of so-called "regular expressions" in various modern languages and environments in fact support recursion. Perl-compatible regular expressions (PCRE) support it, for instance.
Expressions that feature recursion or backreferencing are not handled by the classic NFA compilation route (possibly converted to a DFA); they cannot be.
How the above letters recursion can be handled is with actual recursion. The ? operator can be implemented as a function which tries to match its respective argument object. If it succeeds, then it consumes whatever it has matched, otherwise it consumes nothing. That is to say, the regular expression can be converted to a syntax tree, and interpreted "as is" rather than compiled to a state machine (or trivially compiled to functions corresponding to the nodes of the tree), and such interpretation can naturally handle recursion. The interpretation then constitutes, effectively, a syntax-directed recursive-descent parser. (Note how I avoided left recursion in defining letters to make that example compatible with this approach).
Example: parenthesis-matching regex:
par-match := ({par-match})|
This gets compiled to a tree:
branch-op <-- "par-match" name points at this node
/ \
catenate-op <empty>
/ \
"(" catenate-op
/ \
{par-match} ")"
This can then converted to a recursive descent parser, or interpreted directly.
Pattern matching starts by invoking the top-level "branch-op". This operator simply tries all of the alternatives. Suppose the input is empty. Then the left alternative will fail: it demands an open parenthesis. So then the right alternative will succeed: empty matches empty. (The operators either "fail" or indicate "success" and consume input.)
But suppose your input is (()). The left catenate-op will in turn invoke its left subtree, which matches and consumes the left parenthesis, leaving ()). It will then invoke its right subtree, another catenate-op. This catenate-op matches its left subtree, which triggers recursion into the top level via the named par-match references. That recursion will match and consume (), leaving ). The catenate-op then invokes its right subtree which matches ). Control returns up to branch-op. (Though the left side of branch-op matched something, branch-op must still try the other alternative; more than one branch can match, and some can match longer than others.)
This is closely related to Parsing Expression Grammars work.
Practically speaking, the recursive definition could be encoded into the regex syntax somehow. Say we invent some new operator like (?name:definition) which means "match definition which is allowed to contain invocations of itself via name. The invocation syntax could be (*name), so that we can write the par-match example as (?par-match:\((*par-match)\)|). The combinations (? and (* are invalid under "classic" regex syntax and so we can use them for extension.
As a final note, regexes correspond to grammars. That is the fundamental connection btween regexes and parsing. That is to say, regexes correspond to a particular subset of grammars describe only regular languages. An example of a grammar which describes a regular language:
S -> A | B
B -> b
A -> A a | c
Although there is A -> A ... recursion, this is still regular, and corresponds to the regex ac*|b, which is just a more compact way to denote the same language. The grammar lets us notate languages that aren't regular and for which we can't write a regex, but as we have seen, we can extend the regex notation and semantics to express some of these things. Regular expressions aren't separate from grammars. The two aren't counterparts, but rather one is a special case or subset of the other.

Parser generators like Yacc, Bison, and derivatives are what you're after. They aren't as widespread as regular expressions because they generate actual C code. There are translations like Jison for example which implement the Yacc/Bison syntax using javascript. I know there are similar tools for other languages.
I get the impression Parsing expression grammar systems are up and coming though.

Regular Expression Vs. String Parsing

At the risk of open a can of worms and getting negative votes I find myself needing to ask,
When should I use Regular Expressions and when is it more appropriate to use String Parsing?
And I'm going to need examples and reasoning as to your stance. I'd like you to address things like readability, maintainability, scaling, and probably most of all performance in your answer.
I found another question Here that only had 1 answer that even bothered giving an example. I need more to understand this.
I'm currently playing around in C++ but Regular Expressions are in almost every Higher Level language and I'd like to know how different languages use/ handle regular expressions also but that's more an after thought.
Thanks for the help in understanding it!
Edit: I'm still looking for more examples and talk on this but the response so far has been great. :)

It depends on how complex the language you're dealing with is.
Splitting
This is great when it works, but only works when there are no escaping conventions.
It does not work for CSV for example because commas inside quoted strings are not proper split points.
foo,bar,baz
can be split, but
foo,"bar,baz"
cannot.
Regular
Regular expressions are great for simple languages that have a "regular grammar". Perl 5 regular expressions are a little more powerful due to back-references but the general rule of thumb is this:
If you need to match brackets ((...), [...]) or other nesting like HTML tags, then regular expressions by themselves are not sufficient.
You can use regular expressions to break a string into a known number of chunks -- for example, pulling out the month/day/year from a date. They are the wrong job for parsing complicated arithmetic expressions though.
Obviously, if you write a regular expression, walk away for a cup of coffee, come back, and can't easily understand what you just wrote, then you should look for a clearer way to express what you're doing. Email addresses are probably at the limit of what one can correctly & readably handle using regular expressions.
Context free
Parser generators and hand-coded pushdown/PEG parsers are great for dealing with more complicated input where you need to handle nesting so you can build a tree or deal with operator precedence or associativity.
Context free parsers often use regular expressions to first break the input into chunks (spaces, identifiers, punctuation, quoted strings) and then use a grammar to turn that stream of chunks into a tree form.
The rule of thumb for CF grammars is
If regular expressions are insufficient but all words in the language have the same meaning regardless of prior declarations then CF works.
Non context free
If words in your language change meaning depending on context, then you need a more complicated solution. These are almost always hand-coded solutions.
For example, in C,
#ifdef X
typedef int foo
#endif
foo * bar
If foo is a type, then foo * bar is the declaration of a foo pointer named bar. Otherwise it is a multiplication of a variable named foo by a variable named bar.

It should be Regular Expression AND String Parsing..
You can use both of them to your advantage!Many a times programmers try to make a SINGLE regular expression for parsing a text and then find it very difficult to maintain..You should use both as and when required.
The REGEX engine is FAST.A simple match takes less than a microsecond.But its not recommended for parsing HTML.

How to determine if a regex is orthogonal to another regex?

I guess my question is best explained with an (simplified) example.
Regex 1:
^\d+_[a-z]+$
Regex 2:
^\d*$
Regex 1 will never match a string where regex 2 matches.
So let's say that regex 1 is orthogonal to regex 2.
As many people asked what I meant by orthogonal I'll try to clarify it:
Let S1 be the (infinite) set of strings where regex 1 matches.
S2 is the set of strings where regex 2 matches.
Regex 2 is orthogonal to regex 1 iff the intersection of S1 and S2 is empty.
The regex ^\d_a$ would be not orthogonal as the string '2_a' is in the set S1 and S2.
How can it be programmatically determined, if two regexes are orthogonal to each other?
Best case would be some library that implements a method like:
/**
* #return True if the regex is orthogonal (i.e. "intersection is empty"), False otherwise or Null if it can't be determined
*/
public Boolean isRegexOrthogonal(Pattern regex1, Pattern regex2);

By "Orthogonal" you mean "the intersection is the empty set" I take it?
I would construct the regular expression for the intersection, then convert to a regular grammar in normal form, and see if it's the empty language...
Then again, I'm a theorist...

I would construct the regular expression for the intersection, then convert to a regular grammar in normal form, and see if it's the empty language...
That seems like shooting sparrows with a cannon. Why not just construct the product automaton and check if an accept state is reachable from the initial state? That'll also give you a string in the intersection straight away without having to construct a regular expression first.
I would be a bit surprised to learn that there is a polynomial-time solution, and I would not be at all surprised to learn that it is equivalent to the halting problem.
I only know of a way to do it which involves creating a DFA from a regexp, which is exponential time (in the degenerate case). It's reducible to the halting problem, because everything is, but the halting problem is not reducible to it.
If the last, then you can use the fact that any RE can be translated into a finite state machine. Two finite state machines are equal if they have the same set of nodes, with the same arcs connecting those nodes.
So, given what I think you're using as a definition for orthogonal, if you translate your REs into FSMs and those FSMs are not equal, the REs are orthogonal.
That's not correct. You can have two DFAs (FSMs) that are non-isomorphic in the edge-labeled multigraph sense, but accept the same languages. Also, were that not the case, your test would check whether two regexps accepted non-identical, whereas OP wants non-overlapping languages (empty intersection).
Also, be aware that the \1, \2, ..., \9 construction is not regular: it can't be expressed in terms of concatenation, union and * (Kleene star). If you want to include back substitution, I don't know what the answer is. Also of interest is the fact that the corresponding problem for context-free languages is undecidable: there is no algorithm which takes two context-free grammars G1 and G2 and returns true iff L(G1) ∩ L(g2) ≠ Ø.

It's been two years since this question was posted, but I'm happy to say this can be determined now simply by calling the "genex" program here: https://github.com/audreyt/regex-genex
$ ./binaries/osx/genex '^\d+_[a-z]+$' '^\d*$'
$
The empty output means there is no strings that matches both regex. If they have any overlap, it will output the entire list of overlaps:
$ runghc Main.hs '\d' '[123abc]'
1.00000000 "2"
1.00000000 "3"
1.00000000 "1"
Hope this helps!

The fsmtools can do all kinds of operations on finite state machines, your only problem would be to convert the string representation of the regular expression into the format the fsmtools can work with. This is definitely possible for simple cases, but will be tricky in the presence of advanced features like look{ahead,behind}.
You might also have a look at OpenFst, although I've never used it. It supports intersection, though.

Excellent point on the \1, \2 bit... that's context free, and so not solvable. Minor point: Not EVERYTHING is reducible to Halt... Program Equivalence for example.. – Brian Postow
[I'm replying to a comment]
IIRC, a^n b^m a^n b^m is not context free, and so (a\*)(b\*)\1\2 isn't either since it's the same. ISTR { ww | w ∈ L } not being "nice" even if L is "nice", for nice being one of regular, context-free.
I modify my statement: everything in RE is reducible to the halting problem ;-)

I finally found exactly the library that I was looking for:
dk.brics.automaton
Usage:
/**
* #return true if the two regexes will never both match a given string
*/
public boolean isRegexOrthogonal( String regex1, String regex2 ) {
Automaton automaton1 = new RegExp(regex1).toAutomaton();
Automaton automaton2 = new RegExp(regex2).toAutomaton();
return automaton1.intersection(automaton2).isEmpty();
}
It should be noted that the implementation doesn't and can't support complex RegEx features like back references. See the blog post "A Faster Java Regex Package" which introduces dk.brics.automaton.

You can maybe use something like Regexp::Genex to generate test strings to match a specified regex and then use the test string on the 2nd regex to determine whether the 2 regexes are orthogonal.

Proving that one regular expression is orthogonal to another can be trivial in some cases, such as mutually exclusive character groups in the same locations. For any but the simplest regular expressions this is a nontrivial problem. For serious expressions, with groups and backreferences, I would go so far as to say that this may be impossible.

I believe kdgregory is correct you're using Orthogonal to mean Complement.
Is this correct?

Let me start by saying that I have no idea how to construct such an algorithm, nor am I aware of any library that implements it. However, I would not be at all surprised to learn that nonesuch exists for general regular expressions of arbitrary complexity.
Every regular expression defines a regular language of all the strings that can be generated by the expression, or if you prefer, of all the strings that are "matched by" the regular expression. Think of the language as a set of strings. In most cases, the set will be infinitely large. Your question asks whether the intersections of the two sets given by the regular expressions is empty or not.
At least to a first approximation, I can't imagine a way to answer that question without computing the sets, which for infinite sets will take longer than you have. I think there might be a way to compute a limited set and determine when a pattern is being elaborated beyond what is required by the other regex, but it would not be straightforward.
For example, just consider the simple expressions (ab)* and (aba)*b. What is the algorithm that will decide to generate abab from the first expression and then stop, without checking ababab, abababab, etc. because they will never work? You can't just generate strings and check until a match is found because that would never complete when the languages are disjoint. I can't imagine anything that would work in the general case, but then there are folks much better than me at this kind of thing.
All in all, this is a hard problem. I would be a bit surprised to learn that there is a polynomial-time solution, and I would not be at all surprised to learn that it is equivalent to the halting problem. Although, given that regular expressions are not Turing complete, it seems at least possible that a solution exists.

I would do the following:
convert each regex to a FSA, using something like the following structure:
struct FSANode
{
bool accept;
Map<char, FSANode> links;
}
List<FSANode> nodes;
FSANode start;
Note that this isn't trivial, but for simple regex shouldn't be that difficult.
Make a new Combined Node like:
class CombinedNode
{
CombinedNode(FSANode left, FSANode right)
{
this.left = left;
this.right = right;
}
Map<char, CombinedNode> links;
bool valid { get { return !left.accept || !right.accept; } }
public FSANode left;
public FSANode right;
}
Build up links based on following the same char on the left and right sides, and you get two FSANodes which make a new CombinedNode.
Then start at CombinedNode(leftStart, rightStart), and find the spanning set, and if there are any non-valid CombinedNodes, the set isn't "orthogonal."

Convert each regular expression into a DFA. From the accept state of one DFA create an epsilon transition to the start state of the second DFA. You will in effect have created an NFA by adding the epsilon transition. Then convert the NFA into a DFA. If the start state is not the accept state, and the accept state is reachable, then the two regular expressions are not "orthogonal." (Since their intersection is non-empty.)
There are know procedures for converting a regular expression to a DFA, and converting an NFA to a DFA. You could look at a book like "Introduction to the Theory of Computation" by Sipser for the procedures, or just search around the web. No doubt many undergrads and grads had to do this for one "theory" class or another.

I spoke too soon. What I said in my original post would not work out, but there is a procedure for what you are trying to do if you can convert your regular expressions into DFA form.
You can find the procedure in the book I mentioned in my first post: "Introduction to the Theory of Computation" 2nd edition by Sipser. It's on page 46, with details in the footnote.
The procedure would give you a new DFA that is the intersection of the two DFAs. If the new DFA had a reachable accept state then the intersection is non-empty.

Efficiently querying one string against multiple regexes

Lets say that I have 10,000 regexes and one string and I want to find out if the string matches any of them and get all the matches.
The trivial way to do it would be to just query the string one by one against all regexes. Is there a faster,more efficient way to do it?
EDIT:
I have tried substituting it with DFA's (lex)
The problem here is that it would only give you one single pattern. If I have a string "hello" and patterns "[H|h]ello" and ".{0,20}ello", DFA will only match one of them, but I want both of them to hit.

This is the way lexers work.
The regular expressions are converted into a single non deterministic automata (NFA) and possibily transformed in a deterministic automata (DFA).
The resulting automaton will try to match all the regular expressions at once and will succeed on one of them.
There are many tools that can help you here, they are called "lexer generator" and there are solutions that work with most of the languages.
You don't say which language are you using. For C programmers I would suggest to have a look at the re2c tool. Of course the traditional (f)lex is always an option.

I've come across a similar problem in the past. I used a solution similar to the one suggested by akdom.
I was lucky in that my regular expressions usually had some substring that must appear in every string it matches. I was able to extract these substrings using a simple parser and index them in an FSA using the Aho-Corasick algorithms. The index was then used to quickly eliminate all the regular expressions that trivially don't match a given string, leaving only a few regular expressions to check.
I released the code under the LGPL as a Python/C module. See esmre on Google code hosting.

We had to do this on a product I worked on once. The answer was to compile all your regexes together into a Deterministic Finite State Machine (also known as a deterministic finite automaton or DFA). The DFA could then be walked character by character over your string and would fire a "match" event whenever one of the expressions matched.
Advantages are it runs fast (each character is compared only once) and does not get any slower if you add more expressions.
Disadvantages are that it requires a huge data table for the automaton, and there are many types of regular expressions that are not supported (for instance, back-references).
The one we used was hand-coded by a C++ template nut in our company at the time, so unfortunately I don't have any FOSS solutions to point you toward. But if you google regex or regular expression with "DFA" you'll find stuff that will point you in the right direction.

Martin Sulzmann Has done quite a bit of work in this field.
He has a HackageDB project explained breifly here which use partial derivatives seems to be tailor made for this.
The language used is Haskell and thus will be very hard to translate to a non functional language if that is the desire (I would think translation to many other FP languages would still be quite hard).
The code is not based on converting to a series of automata and then combining them, instead it is based on symbolic manipulation of the regexes themselves.
Also the code is very much experimental and Martin is no longer a professor but is in 'gainful employment'(1) so may be uninterested/unable to supply any help or input.
this is a joke - I like professors, the less the smart ones try to work the more chance I have of getting paid!

10,000 regexen eh? Eric Wendelin's suggestion of a hierarchy seems to be a good idea. Have you thought of reducing the enormity of these regexen to something like a tree structure?
As a simple example: All regexen requiring a number could branch off of one regex checking for such, all regexen not requiring one down another branch. In this fashion you could reduce the number of actual comparisons down to a path along the tree instead of doing every single comparison in 10,000.
This would require decomposing the regexen provided into genres, each genre having a shared test which would rule them out if it fails. In this way you could theoretically reduce the number of actual comparisons dramatically.
If you had to do this at run time you could parse through your given regular expressions and "file" them into either predefined genres (easiest to do) or comparative genres generated at that moment (not as easy to do).
Your example of comparing "hello" to "[H|h]ello" and ".{0,20}ello" won't really be helped by this solution. A simple case where this could be useful would be: if you had 1000 tests that would only return true if "ello" exists somewhere in the string and your test string is "goodbye;" you would only have to do the one test on "ello" and know that the 1000 tests requiring it won't work, and because of this, you won't have to do them.

If you're thinking in terms of "10,000 regexes" you need to shift your though processes. If nothing else, think in terms of "10,000 target strings to match". Then look for non-regex methods built to deal with "boatloads of target strings" situations, like Aho-Corasick machines. Frankly, though, it seems like somethings gone off the rails much earlier in the process than which machine to use, since 10,000 target strings sounds a lot more like a database lookup than a string match.

Aho-Corasick was the answer for me.
I had 2000 categories of things that each had lists of patterns to match against. String length averaged about 100,000 characters.
Main Caveat: The patters to match were all language patters not regex patterns e.g. 'cat' vs r'\w+'.
I was using python and so used https://pypi.python.org/pypi/pyahocorasick/.
import ahocorasick
A = ahocorasick.Automaton()
patterns = [
[['cat','dog'],'mammals'],
[['bass','tuna','trout'],'fish'],
[['toad','crocodile'],'amphibians'],
]
for row in patterns:
vals = row[0]
for val in vals:
A.add_word(val, (row[1], val))
A.make_automaton()
_string = 'tom loves lions tigers cats and bass'
def test():
vals = []
for item in A.iter(_string):
vals.append(item)
return vals
Running %timeit test() on my 2000 categories with about 2-3 traces per category and a _string length of about 100,000 got me 2.09 ms vs 631 ms doing sequential re.search() 315x faster!.

You'd need to have some way of determining if a given regex was "additive" compared to another one. Creating a regex "hierarchy" of sorts allowing you to determine that all regexs of a certain branch did not match

You could combine them in groups of maybe 20.
(?=(regex1)?)(?=(regex2)?)(?=(regex3)?)...(?=(regex20)?)
As long as each regex has zero (or at least the same number of) capture groups, you can look at what what captured to see which pattern(s) matched.
If regex1 matched, capture group 1 would have it's matched text. If not, it would be undefined/None/null/...

If you're using real regular expressions (the ones that correspond to regular languages from formal language theory, and not some Perl-like non-regular thing), then you're in luck, because regular languages are closed under union. In most regex languages, pipe (|) is union. So you should be able to construct a string (representing the regular expression you want) as follows:
(r1)|(r2)|(r3)|...|(r10000)
where parentheses are for grouping, not matching. Anything that matches this regular expression matches at least one of your original regular expressions.

I would recommend using Intel's Hyperscan if all you need is to know which regular expressions match. It is built for this purpose. If the actions you need to take are more sophisticated, you can also use ragel. Although it produces a single DFA and can result in many states, and consequently a very large executable program. Hyperscan takes a hybrid NFA/DFA/custom approach to matching that handles large numbers of expressions well.

I'd say that it's a job for a real parser. A midpoint might be a Parsing Expression Grammar (PEG). It's a higher-level abstraction of pattern matching, one feature is that you can define a whole grammar instead of a single pattern. There are some high-performance implementations that work by compiling your grammar into a bytecode and running it in a specialized VM.
disclaimer: the only one i know is LPEG, a library for Lua, and it wasn't easy (for me) to grasp the base concepts.

I'd almost suggest writing an "inside-out" regex engine - one where the 'target' was the regex, and the 'term' was the string.
However, it seems that your solution of trying each one iteratively is going to be far easier.

You could compile the regex into a hybrid DFA/Bucchi automata where each time the BA enters an accept state you flag which regex rule "hit".
Bucchi is a bit of overkill for this, but modifying the way your DFA works could do the trick.

I use Ragel with a leaving action:
action hello {...}
action ello {...}
action ello2 {...}
main := /[Hh]ello/ % hello |
/.+ello/ % ello |
any{0,20} "ello" % ello2 ;
The string "hello" would call the code in the action hello block, then in the action ello block and lastly in the action ello2 block.
Their regular expressions are quite limited and the machine language is preferred instead, the braces from your example only work with the more general language.

Try combining them into one big regex?

I think that the short answer is that yes, there is a way to do this, and that it is well known to computer science, and that I can't remember what it is.
The short answer is that you might find that your regex interpreter already deals with all of these efficiently when |'d together, or you might find one that does. If not, it's time for you to google string-matching and searching algorithms.

The fastest way to do it seems to be something like this (code is C#):
public static List<Regex> FindAllMatches(string s, List<Regex> regexes)
{
List<Regex> matches = new List<Regex>();
foreach (Regex r in regexes)
{
if (r.IsMatch(string))
{
matches.Add(r);
}
}
return matches;
}
Oh, you meant the fastest code? i don't know then....

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js