Regular Expression for detecting repeated substrings is SLOW - regex

I am trying to come up with a GNU extended regular expression that detects repeated substrings in a string of ascii-encoded bits. I have an expression that works -- sort of. The problem is that it executes really slowly when given a string that could have many solutions
The expression
([01]+)(\1)+
compiles quickly, but takes about a minute to execute against the string
1010101010101010101010101010101010101010101010101010101010
I am using the regex implementation from glibc 2.5-49 ( comes with CentOS 5.5.)
FWIW, the pcre library executes quickly, as in gregexp or perl directly. So the obvious, but wrong, answer is "use libpcre". I cannot easily introduce a new dependency in my project. I need to work within the std C library that comes with CentOS/RHEL.

If the input string can be of any considerable length, or if performance is at all a concern, then one of the better ways to solve this problem is not with regex, but with a more sophisticated string data structure that facilitates these kinds of queries much more efficiently.
Such a data structure is a suffix tree. Given a string S, its suffix tree is essentially the Patricia trie of all of its suffixes. Despite its seeming complexity, it can be built in linear time.
Suffix tree for "BANANA"(courtesy of Wikipedia)
You can do many kinds of queries really efficiently with a suffix tree, e.g. finding all occurences of a substring, the longest substring that occurs at least twice, etc. The kind of strings that you're after is called tandem repeats. To facilitate this query you'd have to preprocess the suffix tree in linear time so you can do lowest common ancestor queries in constant time.
This problem is very common in computational biology, where the DNA can be viewed as a VERY long string consisting of letters in ACGT. Thus, performance and efficiency is of utmost importance, and these very sophisticated algorithms and techniques were devised.
You should look into either implementing these techniques from scratch for your binary sequence, or perhaps it's easier to map your binary sequence to a "fake" DNA string and then using one of the many tools available for gene research.
See also
Wikipedia/Tandem repeats

Related

How to efficiently implement longest match in a lexer generator?

I'm interested in learning how to write a lexer generator like flex. I've been reading "Compilers: Principles, Techniques, and Tools" (the "dragon book"), and I have a basic idea of how flex works.
My initial approach is this: the user will supply a hash map of regexes mapping a regex to a token enum. The program will just loop through the regexes one by one in the order given and see if they match the start of the string (I could add a ^ to the beginning of each regex to implement this). If they do, I can add the token for that regex to a list of tokens for the program.
My first question is, is this the most efficient way to do it? Currently I have to loop through each regex, but in theory I could construct a DFA from all of the regexes combined and step through that more efficiently. However, there will be some overhead from creating this DFA.
My second question is, how would I implement the longest matching string tie breaker, like flex does? i.e, I want to match ifa as an identifier, and not the keyword if followed by the letter a. I don't see any efficient way to do this with regex. I think I'll have to loop through all of the regexes, try to match them, and if I have more than one match, take the longest result. However, if I converted the regexes to a DFA (that is, my own DFA data structure), then I could continue stepping through the characters until there are no more possible transition edges on the DFA. At that point, I can take the last time I passed through an acceptance state as the actual match of a token, since that should be the longest match.
Both of my questions point to writing my own translator from regex to a DFA. Is this required, or can I still do this efficiently with plain regex (implemented by a standard library) and still get the longest match?
EDIT: I kept the regex engine I'm using out of this because I wanted a general answer, but I'm using Rust's regex library: http://static.rust-lang.org/doc/master/regex/index.html
Timewise, it's much more efficient to compile all the regexes down into a single automaton that matches all patterns in parallel. It might blow up the space usage significantly, though (DFAs can have exponentially many states relative to the pattern sizes), so it's worth investigating whether this will hurt.
Typically, the way you'd implement maximal-munch (matching the longest string you can) is to run the matching automaton as normal. Keep track of the index of the last match that you find. When the automaton enters a dead state and stops, you can then output the substring from the beginning of the token up through the match point, then jump back in the input sequence to the point right after the match finished. This can be done pretty efficiently and without much slowdown at all.
In case it helps, here are some lecture slides from a compilers course I taught that explores scanning techniques.
Hope this helps!

Is there any way to optimize a generic regular expression?

I code in Eclipse, and when I do a CTRL-F to find some string, I see that apart from the standardized options of whole word, case sensitive, there is an option for regular expression search also (it is there in Notepad++ too).
I have tried it once or twice, and generally the results are almost instantaneous. But after all, the code files are not humongous, the biggest ones are not more than 500 lines long, with most lines filled less than half. Is there any way to optimize such that any user supplied regex will run much faster on a large piece of text, say 10-15 MB of size?
I can't think of any method because no standardized search algorithm like Rabin-Karp, or suffix tree would apply here!
I have no idea on how regular expression is implemented in Eclipse and why it is so slow. Here is just some thoughts:
First of all, there are a few concepts you should know: Nondeterministic finite automaton (NFA) and Deterministic finite automaton (DFA). In theory, Regular Expression, NFA, and DFA are equivalent, which means they have exactly the same ability to describe languages (sequences of characters). This implies that any one of them can be converted to another (see this site).
Regular Expression can be implemented by converting it to DFA, and using DFA to match text only takes linear time (many of the string matching algorithms, e.g. KMP, are actually special DFAs). However, the trouble is, most of modern Regular Expression implementations introduced features like backreferences making it impossible to use DFA.
So, if discarding those complex features is possible, implementing a fast Regular Expression would be feasible (do the matching in linear time). You may find more in this article.
What makes you think suffix tree isn't a suitable algorithm for this problem? From http://en.wikipedia.org/wiki/Suffix_tree:
Once [the suffix tree is] constructed, several operations can be performed quickly, for instance locating a substring in S, locating a substring if a certain number of mistakes are allowed, locating matches for a regular expression pattern etc.
I think a modified Boyer–Moore string search algorithm also would be possible.

creating a regular expression for a list of strings

I have extracted a series of tables from the scientific literature which consist of columns each of which is a distinct type. Here is an example
I'd like to be able to automatically generate regular expressions for each column. Obviously there are trivial solutions such as .* so I would add the constraints that they use only:
[A-Z] [a-z] [0-9]
explicit punctuation (e.g. ',',''')
"simple" quantifiers (e.g {3,4}
A "best" answer for the table above would be:
[A-Z]{3}
[A-Za-z\s\.]+
\d{4}\sm
\d{2}\u00b0\d{2}'\d{2}"N,\d{2}\u00b0\d{2}'\d{2}"E
(speciosissima|intermediate|troglodytes)
(hf|sr)
\d{4}
Of course the 4th regex would break if we move outside the geographical area but the software doesn't know that. The aim would be to collect many regexes for , say "Coordinates" and generalize them, probably partially manual. The enums would only be created if there were a small number of distinct strings.
I'd be grateful for examples of (especially F/OSS) software that can do this, especially in Java. (It's similar to Google's Refine). I am aware of this question 4 years ago but that didn't really answer the question and the text2re site which appears to be interactive.
NOTE: I note a vote to close as "too localised". This is a very general problem (the table given is only an example) as shown by Google/Freebase developing Refine to tackle the problem. It potentially refers to a very wide variety of tables (e.g. financial, journalism, etc.). Here's one with floating point values:
It would be useful to determine automatically that some authorities report ages in real numbers (e.g. not months, days) and use 2 digits of precision.
Your particular issue is a special case of "programming by demonstration". That is, given a bunch of input/output examples, you want to generate a program. For you, the inputs are strings and the output is whether each string belongs to the given column. In the end, you want to generate a program in the language of limited regular expressions that you proposed.
This particular instance of programming by demonstration seems closely related to Flash Fill, a recent project from MSR. There, instead of generating regular expressions to match data, they automatically generated programs to transform string data based on input/output examples.
I only skimmed through one of their papers, but I'll try to lay out what I understand here.
There are basically two important insights in this paper. The first was to design a small programming language to represent string transformations. Even using full-on regular expressions created too many possibilities to search through quickly. They designed their own abstract language for manipulating strings; however, your constraints (e.g. only using simple quantifiers) would probably play the same role as their custom language. This is largely possible because your particular problem has a somewhat smaller scope than theirs.
The second insight was on how to actually find programs in this abstract language that match with given input/output pairs. My understanding is that the key idea here is to use a technique called version space algebra. The rough idea about version space algebra is that you maintain a representation of the space of possible programs and repeatedly prune it by introducing additional constraints. The exact details of this process fall well outside my main interests, so you're better off reading something like this introduction to version space algebra, which includes some sample code as well.
They also have some clever approaches to rank different candidate programs and even guess which inputs might be problematic for an already-generated program. I saw a demo where they generated a program without giving it enough input/output pairs, and the program could actually highlight new inputs that were likely to be incorrect. This sort of ranking is very interesting, but requires some more sophisticated machine learning techniques and is probably not immediately applicable to your use case. Might still be interesting though. (Also, this might have been detailed in a different paper than the one I linked.)
So yeah, long story short, you can generate your expressions by feeding input/output examples into a system based on version space algebra. I hope that helps.
I'm currently researching the same (or something similar) (here). In general, this is called Grammar induction, or in case of regular expressions, it is induction of regular languages. There is the StaMinA competition about this field. Common algorithms are RPNI and Blue-Fringe.
Here is another related question. And here another one. And here another one.
My own approach (which I have partially prototyped) is heuristic and based on the premise that a given column will often have entries which are the same or similar character lengths and have similar punctuation. I would welcome comments (and resulting code will be Open Source).
flatten [A-Z] to 'A'
flatten [a-z] to 'a'
flatten [0-9] to '0'
flatten any other special codepoint sets (e.g. greek characters) to a single character (e.g. alpha)
The columns then become:
"AAA"
"Aaaaaaaaaa", "Aaaaaaaaaaaaa", "Aaa aaa Aaaaaa", etc.
"0000 a"
"00\u00b000'00"N,00\u00b000'00"E
...
...
"0000"
I shall then replace these by regular expressions such as
"([A-Z])([A-Z])([A-Z])"
...
"(\d)(\d)(\d)(\d)\s([0-9])"
and capture the individual characters into sets. This will show that (say) in 3. the final char is always "m" , so \d\d\d\d\s[m] and for 7. the value is [2][0][0][458].
For the columns that don't fit this model we search using "(.*)" and see if we can create useful sets (cols 5. and 6.) with a heuristic such as "at least 2 multiple strings and no more than 50% unique strings".
By using dynamic programming (cf. Kruskal) I hope to be able to align similar regexes, which will be useful for me, at least!

Using tr1::regex_search to match a big list of strings

I need to match any of a list of strings, and I'm wondering if I can just use a regular expression that is something like "item1|item2|item3|..." instead of just doing a separate strstr() for each string. But the list can be fairly large - up to 10000 items. Would a regex work well with that? Would it be faster than searching for each string separately?
The regex will work and will certainly be faster than searching for each string. Though I'm not sure how much memory footprint or time will the initial setup take given the 10000 input patterns.
However, this is a well-known problem and there is a lot of specific algorithms, for example:
Rabin-Karp string search algorithm.
Aho-Corasick string search algorithm.
and several others. They all have different trade-offs, so pick your poison.
In our project we needed the multiple replace solution, so we've chosen the Aho-Corasick algorithm and have built the replacing function upon it.

What's the Time Complexity of Average Regex algorithms?

I'm not new to using regular expressions, and I understand the basic theory they're based on--finite state machines.
I'm not so good at algorithmic analysis though and don't understand how a regex compares to say, a basic linear search. I'm asking because on the surface it seems like a linear array search. (If the regex is simple.)
Where could I go to learn more about implementing a regex engine?
This is one of the most popular outlines: Regular Expression Matching Can Be Simple And Fast
. Running a DFA-compiled regular expression against a string is indeed O(n), but can require up to O(2^m) construction time/space (where m = regular expression size).
Are you familiar with the term Deterministic/Non-Deterministic Finite Automata?
Real regular expressions (when I say real I'm refering to those regex that recognize Regular Languages, and not the regex that almost every programming language include with backreferences, etc) can be converted into a DFA/NFA and both can be implemented in a mechanical way in a programming language (a NFA can be converted into a DFA)
What you have to do is:
Find a way to convert a regex into an automaton
Implement the recognition of the automaton in the programming language of your preference
That way, given a regex you can convert it to a DFA and run it to see if it matches or not a specified text.
This can be implemented in O(n), because DFA don't go backward (like a Turing Machine), so it matches the string or not. That is supposing you won't take in count overlapped matches, otherwise you will have to go back and start matching again...
The classic regular expression can be implemented in a way which is fast in practice but has really bad worst case behaviour (the standard DFA) or in a way which has guaranteed reasonable worst case behaviour (keeping it as an NFA). The standard DFA can be extended to support lots of extra matching characters and flags, which make use of the fact that it is basically back-tracking search.
Examples of the standard approach are everywhere (e.g. built into Perl). There is an example that claims good worst case behaviour at http://code.google.com/p/re2/ - in fact it is even better than I expected in the worst case, so they may have found an extra trick or two.
If you are at all interested in this, or care about writing programs that can be made to lock up solid given pathological inputs, read http://swtch.com/~rsc/regexp/regexp1.html.