I have some large files (hundreds of MB) that I need to search for several thousand ~20-character unique strings.
I've found that using the pipe alternation metacharacter for matching regular expressions like (string1|string2|string3) speeds up the search process a lot (versus searching for one string at a time).
What's the limit to how well this will scale? How many expressions can I chain together like this? Will it cause some kind of overflow at some point? Is there a better way to do this?
EDIT
In an effort to keep my question brief, I didn't emphasize the fact that I've already implemented code using this alternation approach and I found it to be helpful: On a test case with a typical data set, running time was reduced from 87 minutes down to 18 seconds--a 290x speedup, apparently with O(n) instead of O(n*m).
My question relates to how this approach can be expected to work when other users run this code in the future using much larger data sets with bigger files and more search terms. The original O(n*m) code was existing code that's been in use for 13 years, and its slowness was pointed out recently as the genome-related data sets it operates on have recently gotten much bigger.
If you have a simple regular expression like (word1|word2|...|wordn), the regex engine will construct a state machine that can just pass over the input once to find whether the string matches.
Side-note: in theoretical computer science, "regular expressions" are defined in a way such that a single pass is always sufficient. However, practical regex implementation add features that allow construction of regex patterns which can't be always implemented as a single pass (see this example).
Again, for your pattern of regular expressions, the engine will almost certainly use a single pass. That is likely going to be faster than reading the data from memory multiple times ... and almost definitely a lot faster than reading the data multiple times from disk.
If you are just going to have regular expression of the form (word1|word2|....|wordn), why not just create an associated array of booleans. That should be very fast.
EDIT
# before the loop, set up the hash
%words = (
cat => 1,
dog => 1,
apple => 1,
.... etc
);
# A the loop to check a sentence
foreach $aword (split(/ /, $sentence))
if ($words{$aword}) print "Found $aword\n";
There is no theoretical limit to the extent of a regular expression, but practically it must fit within the limits of a specific platform and installation. You must find out empirically whether your plan will work, and I for one would be delighted to see your results.
One thing I would say is that you should compile the expression separately before you go on to use it. Either that or apply the /o option to compile just once (i.e. promise that the contents of the expression won't change). Something like this
my $re = join '|', #strings;
foreach my $file (#files) {
my $fh = IO::File->new($file, '<') or die "Can't open $file: $!";
while (<$fh>) {
next unless /\b(?:$re)\b/io;
chomp;
print "$_ found in $file\n";
last;
}
}
Related
In a scenario where a large number of strings must be parsed with regular expressions, considering the same RegEx needle will be used for all tests, which would be faster:
To test each string in an array, individually, or;
To concatenate everything into a single big string and test just once?
I believed number 2 would be best instead of having to fire up the RegEx engine multiple times for processing an array of strings. However, after some testing in PHP (PCRE), it seemed untrue.
Benchmark
I've made a simple benchmark in PHP 5.3 (source code) and got the following results:
122185 interactions in 5 seconds testing multiple smaller strings inside an array
26853 interactions in 5 seconds doing the single big-string test
Therefore, I must conclude the first method is up to 5 times faster. However, I'd like to ask for an authoritative answer confirming this; I could be assuming things mistakenly due to some PHP optimization I'm unaware of.
Is it always a more optimized solution to fragment large strings before testing them with regular expressions, not specifically in PCRE?
preg_grep()
I don't think this function should be considered here. It's a benchmark test, not an optimization issue. Not to mention that function is a PHP-specific method. Also, preg_match_all returns all matched substrings whereas preg_grep just indicates which array elements matched.
Your benchmark is flawed. Look at this piece of your code:
while(time() - $TimeStart < 5)
for($i = 0; $i < $Length; $i++, $Iterations++)
{
preg_match_all($RegEx, $Input[$i], $m);
}
}
The $Iterations should only be increased in the while, not inside the for. Dividing the former results results in:
24437 iterations using array
26853 iterations using big string
You shouldn't use time() for time measurements, microtime() would be more suitable to gain accuracy.
Lastly, this benchmark isn't complete, because to obtain the same results for both tests, the array method needs to perform array_merge() after every iteration. Also, somewhere a big string needs be transformed into an array and that takes time too.
You definitely should not merge all the target strings into one. For one thing, it will break a lot of regex that work okay on the shorter strings. Anchors like ^ and $, \A and \z, would suddenly find themselves with nothing to match. Also, regexes that rely heavily on things like .* or.*?`, which work on shorter strings despite their inherent inefficiency, would tend to become catastrophically slow when used on the Frankenstring.
But even if the concatenated version turned out to be faster, would it matter? Have you tried the array version and found it to be too slow? This is a pretty drastic solution (if solution it is); I'd hold off on implementing it until I had a problem it could solve, if I were you.
AFAIK no one have implemented an algorithm that takes a set of strings and substrings and gives back one or more regular expressions that would match the given substrings inside the strings. So, for instance, if I'd give my algorithm this two samples:
string1 = "fwef 1234 asdfd"
substring1 = "1234"
string2 = "asdf456fsdf"
substring2 = "456"
The algorithm would give me the regular expression "[0-9]*" back. I know it could give more than one regex or even no possible regex back and you might find 1000 reasons why such algorithm would be close to impossible to implement to perfection. But what's the closest thing?
I don't really care about regex itself also. Basically what I want is an algorithm that takes samples as the ones above and then finds a pattern in them that can be used to easily find the "kind" of text I want to find in a string without having to write any regex or code manually.
I don't have proof but I suspect no such discrete algorithm with a finite output could exist since you are asking for the creation of a regular language which could be "large" in respect to the input size.
With that, I suggest you peek at txt2re which can break down sample texts one-by-one and help you build regexes.
FlashFill a new feature of MS Excel 2013 would do exactly the task you want, but it does not give you the regular expression. It's a NP-complete problem and an open question for practical purposes. If you're interested in how to synthesise string manipulation from multiple examples, Go Flash Fill official website and read a few papers. They have pseudo-code and demo. movies as well.
There are many such algorithm in fact. This is a research area called "Grammatical inference".
I know RPNI, for example. (you could also look on the probabilistic branch, alergia, MDI, DEES). These algorithms generate DSA (Deterministic State Automata). In fact you absolutely don't need to enter the strings in your example. Only substrings.
There are also some algorithms to generate directly Non deterministic automata.
Of course, get the regular expression from an Non Deterministic Automata is easy.
The main ideas are simple:
Generate a PTSA (Prefix Tree State Automata) from your sample.
Then, you have to try to "merge" some states. From these merge, will emerge loops (i.e. * in the regular expression). All the difficulty being to choose the right rule to merge.
Here you go:
re = '|'.join(substrings)
If you want anything more general, your algorithm is going to have to make educated guesses about what type of strings are acceptable as matches, and it's trivial to demonstrate that no procedure can account for all possible sets of possible inputs without simply enumerating them all. For instance, consider some of these scenarios:
Match all prime numbers
Match hexadecimal strings, but no strings containing 'f' are in the sample set
Match the same string repeated twice
Match any even-length string
The root problem is that your question is incompletely specified. If you have a more specific requirement, that might be solvable, depending on what it is.
This may seem as somewhat odd question, but anyhow to the point;
I have a string that I need to search for many many possible character occurrences in several combinations (so character classes are out of question), so what would be the most efficent way to do this?
I was thinking either stack it into one regex:
if ($txt =~ /^(?:really |really |long | regex here)$/){}
or using several 'smaller' comparisons, but I'd assume this won't be very efficent:
if ($txt =~ /^regex1$/ || $txt =~ /^regex2$/ || $txt =~ /^regex3$/) {}
or perhaps nest several if comparisons.
I will appreciate any extra suggestions and other input on this issue.
Thanks
Ever since way back in v5.9.2, Perl compiles a set of N alternatives like:
/string1|string2|string3|string4|string5|.../
into a trie data structure, and if that is the first thing in the pattern, even uses Aho–Corasick matching to find the start point very quickly.
That means that your match of N alternatives will now run in O(1) time instead of in the O(N) time that this:
if (/string1/ || /string2/ || /string3/ || /string4/ || /string5/ || ...)
will run in.
So you can have O(1) or O(N) performance: your choice.
If you use re "debug" or -Mre-debug, Perl will show these trie structures in your patterns.
This will not replace some time testing. If possible though, I would suggest using the o flag if possible so that Perl doesn't recompile your (large) regex on every evaulation. Of course this is only possible if those combinations of characters do not change for each evaluation.
I think it depends on how long regex you have. Sometimes better to devide very long expressions.
I was asked this question in an interview for an internship, and the first solution I suggested was to try and use a regular expression (I usually am a little stumped in interviews). Something like this
(?P<str>[a-zA-Z]+)(?P<n>[0-9]+)
I thought it would match the strings and store them in the variable "str" and the numbers in the variable "n". How, I was not sure of.
So it matches strings of type "a1b2c3", but a problem here is that it also matches strings of type "a1b". Could anyone suggest a solution to deal with this problem?
Also, is there any other regular expression that could solve this problem?
Do you know why "regular expressions" are called "regular"? :-)
That would be too long to explain, I'll just outline the way. To match a pattern (i.e. decide whether a given string is "valid" or "invalid"), a theoretical informatician would use a finite state automaton. That's an abstract machine that has a finite number of states; each tick it reads a char from the input and jumps to another state. The pattern of where to jump from particular state when a particular character is read is fixed. Some states are marked as "OK", some--as "FAIL", so that by examining state of a machine you can check whether your text is "valid" (i.e. a valid e-mail).
For example, this machine only accepts "nice" as its "valid" word (a pic from Wikipedia):
A set of "valid" words such a machine theoretically can distinguish from invalid is called "regular language". Not every set is a regular language: for example, finite state automata are incapable of checking whether parentheses in string are balanced.
But constructing state machines was a complex task, compared to the complexity of defining what "valid" is. So the mathematicians (mainly S. Kleene) noted that every regular language could be described with a "regular expression". They had *s and |s and were the prototypes of what we know as regexps now.
What does it have to do with the problem? The problem in subject is essentially non-regular. It can't be expressed with anything that works like a finite automaton.
The essence is that it should contain a memory cell that is capable to hold an arbitrary number (repetition count in your case). Finite automata and classical regular expressions can not do this.
However, modern regexps are more expressive and are said to be able to check balanced parentheses! But this may serve as a good example that you shouldn't use regexps for tasks they don't suit. Let alone that it contains code snippets; this makes the expression far from being "regular".
Answering the initial question, you can't solve your problem with using anything "regular" only. However, regexps could be aid you in solving this problem, as in tster's answer
Perhaps, I should look closer to tster's answer (do a "+1" there, please!) and show why it's not the "regular expression" solution. One may think that it is, it just contains print statement (not essential) and a loop--and loop concept is compatible with finite state automaton expressive power. But there is one more elusive thing:
while ($line =~ s/^([a-z]+)(\d+)//i)
{
print $1
x # <--- this one
$2;
}
The task of reading a string and a number and printing repeatedly that string given number of times, where the number is an arbitrary integer, is undoable on a finite state machine without additional memory. You use a memory cell to keep that number and decrease it, and check for it to be greater than zero. But this number may be arbitrarily big, and it contradicts with a finite memory available to the finite state machine.
However, there's nothing wrong with classical pattern /([abc]*){5}/ that matches something "regular" repeated fixed number of times. We essentially have states that correspond to "matched pattern once", "matched pattern twice" ... "matched pattern 5 times". There's finite number of them, and that's the gist of the difference.
how about:
while ($line =~ s/^([a-z]+)(\d+)//i)
{
print $1 x $2;
}
Answering your question directly:
No, regular expressions match text and don't print anything, so there is no way to do it solely using regular expressions.
The regular expression you gave will match one string/number pair; you can then print that repeatedly using an appropriate mechanism. The Perl solution from #tster is about as compact as it gets. (It doesn't use the names that you applied in your regex; I'm pretty sure that doesn't matter.)
The remaining details depend on your implementation language.
Nope, this is your basic 'trick question' - no matter how you answer it that answer is wrong unless you have exactly the answer the interviewer was trained to parrot. See the workup of the issue given by Pavel Shved - note that all invocations have 'not' as a common condition, the tool just keeps sliding: Even when it changes state there is no counter in that state
I have a rather advanced book by Kenneth C Louden who is a college prof on the matter, in which it is stated that the issue at hand is codified as "Regex's can't count." The obvious answer to the question seems to me at the moment to be using the lookahead feature of Regex's ...
Probably depends on what build of what brand of regex the interviewer is using, which probably depends of flight-dynamics of Golf Balls.
Nice answers so far. Regular expressions alone are generally thought of as a way to match patterns, not generate output in the manner you mentioned.
Having said that, there is a way to use regex as part of the solution. #Jonathan Leffler made a good point in his comment to tster's reply: "... maybe you need a better regex library in your language."
Depending on your language of choice and the library available, it is possible to pull this off. Using C# and .NET, for example, this could be achieved via the Regex.Replace method. However, the solution is not 100% regex since it still relies on other classes and methods (StringBuilder, String.Join, and Enumerable.Repeat) as shown below:
string input = "aa67bc54c9";
string pattern = #"([a-z]+)(\d+)";
string result = Regex.Replace(input, pattern, m =>
// can be achieved using StringBuilder or String.Join/Enumerable.Repeat
// don't use both
//new StringBuilder().Insert(0, m.Groups[1].Value, Int32.Parse(m.Groups[2].Value)).ToString()
String.Join("", Enumerable.Repeat(m.Groups[1].Value, Int32.Parse(m.Groups[2].Value)).ToArray())
+ Environment.NewLine // comment out to prevent line breaks
);
Console.WriteLine(result);
A clearer solution would be to identify the matches, loop over them and insert them using the StringBuilder rather than rely on Regex.Replace. Other languages may have compact idioms to handle the string multiplication that doesn't rely on other library classes.
To answer the interview question, I would reply with, "it's possible, however the solution would not be a stand-alone 100% regex approach and would rely on other language features and/or libraries to handle the generation aspect of the question since the regex alone is helpful in matching patterns, not generating them."
And based on the other responses here you could beef up that answer further if needed.
Lets say that I have 10,000 regexes and one string and I want to find out if the string matches any of them and get all the matches.
The trivial way to do it would be to just query the string one by one against all regexes. Is there a faster,more efficient way to do it?
EDIT:
I have tried substituting it with DFA's (lex)
The problem here is that it would only give you one single pattern. If I have a string "hello" and patterns "[H|h]ello" and ".{0,20}ello", DFA will only match one of them, but I want both of them to hit.
This is the way lexers work.
The regular expressions are converted into a single non deterministic automata (NFA) and possibily transformed in a deterministic automata (DFA).
The resulting automaton will try to match all the regular expressions at once and will succeed on one of them.
There are many tools that can help you here, they are called "lexer generator" and there are solutions that work with most of the languages.
You don't say which language are you using. For C programmers I would suggest to have a look at the re2c tool. Of course the traditional (f)lex is always an option.
I've come across a similar problem in the past. I used a solution similar to the one suggested by akdom.
I was lucky in that my regular expressions usually had some substring that must appear in every string it matches. I was able to extract these substrings using a simple parser and index them in an FSA using the Aho-Corasick algorithms. The index was then used to quickly eliminate all the regular expressions that trivially don't match a given string, leaving only a few regular expressions to check.
I released the code under the LGPL as a Python/C module. See esmre on Google code hosting.
We had to do this on a product I worked on once. The answer was to compile all your regexes together into a Deterministic Finite State Machine (also known as a deterministic finite automaton or DFA). The DFA could then be walked character by character over your string and would fire a "match" event whenever one of the expressions matched.
Advantages are it runs fast (each character is compared only once) and does not get any slower if you add more expressions.
Disadvantages are that it requires a huge data table for the automaton, and there are many types of regular expressions that are not supported (for instance, back-references).
The one we used was hand-coded by a C++ template nut in our company at the time, so unfortunately I don't have any FOSS solutions to point you toward. But if you google regex or regular expression with "DFA" you'll find stuff that will point you in the right direction.
Martin Sulzmann Has done quite a bit of work in this field.
He has a HackageDB project explained breifly here which use partial derivatives seems to be tailor made for this.
The language used is Haskell and thus will be very hard to translate to a non functional language if that is the desire (I would think translation to many other FP languages would still be quite hard).
The code is not based on converting to a series of automata and then combining them, instead it is based on symbolic manipulation of the regexes themselves.
Also the code is very much experimental and Martin is no longer a professor but is in 'gainful employment'(1) so may be uninterested/unable to supply any help or input.
this is a joke - I like professors, the less the smart ones try to work the more chance I have of getting paid!
10,000 regexen eh? Eric Wendelin's suggestion of a hierarchy seems to be a good idea. Have you thought of reducing the enormity of these regexen to something like a tree structure?
As a simple example: All regexen requiring a number could branch off of one regex checking for such, all regexen not requiring one down another branch. In this fashion you could reduce the number of actual comparisons down to a path along the tree instead of doing every single comparison in 10,000.
This would require decomposing the regexen provided into genres, each genre having a shared test which would rule them out if it fails. In this way you could theoretically reduce the number of actual comparisons dramatically.
If you had to do this at run time you could parse through your given regular expressions and "file" them into either predefined genres (easiest to do) or comparative genres generated at that moment (not as easy to do).
Your example of comparing "hello" to "[H|h]ello" and ".{0,20}ello" won't really be helped by this solution. A simple case where this could be useful would be: if you had 1000 tests that would only return true if "ello" exists somewhere in the string and your test string is "goodbye;" you would only have to do the one test on "ello" and know that the 1000 tests requiring it won't work, and because of this, you won't have to do them.
If you're thinking in terms of "10,000 regexes" you need to shift your though processes. If nothing else, think in terms of "10,000 target strings to match". Then look for non-regex methods built to deal with "boatloads of target strings" situations, like Aho-Corasick machines. Frankly, though, it seems like somethings gone off the rails much earlier in the process than which machine to use, since 10,000 target strings sounds a lot more like a database lookup than a string match.
Aho-Corasick was the answer for me.
I had 2000 categories of things that each had lists of patterns to match against. String length averaged about 100,000 characters.
Main Caveat: The patters to match were all language patters not regex patterns e.g. 'cat' vs r'\w+'.
I was using python and so used https://pypi.python.org/pypi/pyahocorasick/.
import ahocorasick
A = ahocorasick.Automaton()
patterns = [
[['cat','dog'],'mammals'],
[['bass','tuna','trout'],'fish'],
[['toad','crocodile'],'amphibians'],
]
for row in patterns:
vals = row[0]
for val in vals:
A.add_word(val, (row[1], val))
A.make_automaton()
_string = 'tom loves lions tigers cats and bass'
def test():
vals = []
for item in A.iter(_string):
vals.append(item)
return vals
Running %timeit test() on my 2000 categories with about 2-3 traces per category and a _string length of about 100,000 got me 2.09 ms vs 631 ms doing sequential re.search() 315x faster!.
You'd need to have some way of determining if a given regex was "additive" compared to another one. Creating a regex "hierarchy" of sorts allowing you to determine that all regexs of a certain branch did not match
You could combine them in groups of maybe 20.
(?=(regex1)?)(?=(regex2)?)(?=(regex3)?)...(?=(regex20)?)
As long as each regex has zero (or at least the same number of) capture groups, you can look at what what captured to see which pattern(s) matched.
If regex1 matched, capture group 1 would have it's matched text. If not, it would be undefined/None/null/...
If you're using real regular expressions (the ones that correspond to regular languages from formal language theory, and not some Perl-like non-regular thing), then you're in luck, because regular languages are closed under union. In most regex languages, pipe (|) is union. So you should be able to construct a string (representing the regular expression you want) as follows:
(r1)|(r2)|(r3)|...|(r10000)
where parentheses are for grouping, not matching. Anything that matches this regular expression matches at least one of your original regular expressions.
I would recommend using Intel's Hyperscan if all you need is to know which regular expressions match. It is built for this purpose. If the actions you need to take are more sophisticated, you can also use ragel. Although it produces a single DFA and can result in many states, and consequently a very large executable program. Hyperscan takes a hybrid NFA/DFA/custom approach to matching that handles large numbers of expressions well.
I'd say that it's a job for a real parser. A midpoint might be a Parsing Expression Grammar (PEG). It's a higher-level abstraction of pattern matching, one feature is that you can define a whole grammar instead of a single pattern. There are some high-performance implementations that work by compiling your grammar into a bytecode and running it in a specialized VM.
disclaimer: the only one i know is LPEG, a library for Lua, and it wasn't easy (for me) to grasp the base concepts.
I'd almost suggest writing an "inside-out" regex engine - one where the 'target' was the regex, and the 'term' was the string.
However, it seems that your solution of trying each one iteratively is going to be far easier.
You could compile the regex into a hybrid DFA/Bucchi automata where each time the BA enters an accept state you flag which regex rule "hit".
Bucchi is a bit of overkill for this, but modifying the way your DFA works could do the trick.
I use Ragel with a leaving action:
action hello {...}
action ello {...}
action ello2 {...}
main := /[Hh]ello/ % hello |
/.+ello/ % ello |
any{0,20} "ello" % ello2 ;
The string "hello" would call the code in the action hello block, then in the action ello block and lastly in the action ello2 block.
Their regular expressions are quite limited and the machine language is preferred instead, the braces from your example only work with the more general language.
Try combining them into one big regex?
I think that the short answer is that yes, there is a way to do this, and that it is well known to computer science, and that I can't remember what it is.
The short answer is that you might find that your regex interpreter already deals with all of these efficiently when |'d together, or you might find one that does. If not, it's time for you to google string-matching and searching algorithms.
The fastest way to do it seems to be something like this (code is C#):
public static List<Regex> FindAllMatches(string s, List<Regex> regexes)
{
List<Regex> matches = new List<Regex>();
foreach (Regex r in regexes)
{
if (r.IsMatch(string))
{
matches.Add(r);
}
}
return matches;
}
Oh, you meant the fastest code? i don't know then....