Time complexity of regex and Allowing jitter in pattern finding - regex

To find patterns in string, I have the following code. In it, find.string finds substring of maximum length subject to (1) substring must be repeated consecutively at least th times and (2) substring length must be no longer than len.
reps <- function(s, n) paste(rep(s, n), collapse = "") # repeat s n times
find.string <- function(string, th = 3, len = floor(nchar(string)/th)) {
for(k in len:1) {
pat <- paste0("(.{", k, "})", reps("\\1", th-1))
r <- regexpr(pat, string, perl = TRUE)
if (attr(r, "capture.length") > 0) break
}
if (r > 0) substring(string, r, r + attr(r, "capture.length")-1) else ""
}
An example for the above mentioned code: for the string "a0cc0vaaaabaaaabaaaabaa00bvw" the pattern should come out to be "aaaab".
NOW I am trying to get patterns allowing jitter of 1 character. Example: for the string "a0cc0vaaaabaaadbaaabbaa00bvw" the pattern should come out to be "aaajb" where "j" can be anything. Can anyone suggest a modification of the above mentioned code or any new code for pattern finding, that could allow such jitters?
Also can anyone throw some light on the TIME COMPLEXITY and INTERNAL ALGORITHM used for the regexpr function ?
Thanks! :)

Not very efficient but tada:
reps <- function(s, n) paste(rep(s, n), collapse = "") # repeat s n times
find.string <- function(string, th = 3, len = floor(nchar(string)/th)) {
found <- FALSE
for(sublen in len:1) {
for(inlen in 0:sublen) {
pat <- paste0("((.{", sublen-inlen, "})(.)(.{", inlen, "}))", reps("(\\2.\\4)", th-1))
r <- regexpr(pat, string, perl = TRUE)
if (attr(r, "capture.length")[1] > 0){
found = TRUE
break;
}
}
if(found) break
}
if (r > 0) substring(string, r, r + attr(r, "capture.length")[1] - 1) else ""
}
find.string("a0cc0vaaaabaaadbaaabbaa00bvw"); # returns "aaaab"
Without any fuzzy matching tool available, I manually check each possibility. I use an inner loop to try different size prefix and suffix lengths on either size of the "jitter" character. The prefix is grouped as \2 and the suffix as \4 (the jitter is \3 but I don't use it). Then, the repeated part tries to match \2.\4 - so the prefix, any new jitter character, and the suffix.
I say not efficient because its evaluating O(len^2) different patterns, versus O(len) patterns in your code. For large len this might become a problem.
Note that I have multiple groups, and only look at the [1] position. The full r variable has more useful information, for example [1] will be the first part, [5] will be the 2nd part, [6] will be the 3rd part, etc. Also [3] will be the "jitter" character in the 1st part.
Regarding the complexity of the actual regex: it varies a lot. However, often the construction (setup) of a particular regex is vastly more intensive then the actual matching, which is why a single pattern used repeatedly can produce better results than multiple patterns. In truth, this varies a lot based on the pattern and the engine you're using - see links at the end for more info about complexity.
Regarding how regex works: just a note, this is going to be a very theoretical overview, its not meant to indicate how any particular regex engine works.
For a more practical overview, there are plenty of sites that cover just enough to know how to use a regex, but not how to build your own engine. - for example http://www.regular-expressions.info/engine.html
Regex is what's known as a state machine, specifically a (non-deterministic) finite state automaton (NFA). A very simple, real world state machine is a lightbulb: its either on, or off, and different inputs can change the state its in. A regex is much more complex, (generally) each symbol in the pattern forms a state, and different input can send it to different states. So if you have \d\d\d, 3 virtual states each can accept any digit, and any other input goes to a 4th "failure" state. The end result is the end state after all input is 'consumed'.
Perhaps you can imagine: this gets vastly more complicated, with many many states, when you use any ambiguity, such as wildcards or alternation. So our \d\d\d regex will basically be linear. But more complicated one will not be. Part of the optimization in a regex engine is converting a NFA to a DFA - a deterministic finite state automaton. Here, the ambiguity is removed, generating many more states, and this is the very computationally complex process referenced above (the construction stage).
This is really just a very theoretical overview of an ideal NFA. In practice, modern regex grammars can do a lot more than this, for example backtracing is not technically possible in a "proper" regex.
This might be a bit too high-level, but thats the basic idea. If you're curious, there are plenty of good articles about regex, different flavors, and their complexity. For example: http://swtch.com/~rsc/regexp/regexp1.html

There's basically two regex algorithm types, Perl-Style (with a lot of complex backtracking) and Thompson-NFA.
http://swtch.com/~rsc/regexp/regexp1.html
To determine which engine R uses R's svn repo is here:
*root repo:
http://svn.r-project.org/R/
http://svn.r-project.org/R/branches\R-exp-uncmin\src\regex
I poked around in there a bit and found a file called "engine.c" On first glance it doesn't look like a Thompson-NFA but I didn't take long to read it.
At any rate, the first link goes in depth into the complexity question in general and should give you a great idea as to how regex parsing works under the hood to boot.

Related

Alternation in regexes seems to be terribly slow in big files

I am trying to use this regex:
my #vulnerabilities = ($g ~~ m:g/\s+("Low"||"Medium"||"High")\s+/);
On chunks of files such as this one, the chunks that go from one "sorted" to the next. Every one must be a few hundred kilobytes, and all of them together take from 1 to 3 seconds all together (divided by 32 per iteration).
How can this be sped up?
Inspection of the example file reveals that the strings only occur as a whole line, starting with a tab and a space. From your responses I further gathered that you're really only interested in counts. If that is the case, then I would suggest something like this solution:
my %targets = "\t Low", "Low", "\t Medium", "Medium", "\t High", "High";
my %vulnerabilities is Bag = $g.lines.map: {
%targets{$_} // Empty
}
dd %vulnerabilities; # ("Low"=>2877,"Medium"=>54).Bag
This runs in about .25 seconds on my machine.
It always pays to look at the problem domain thoroughly!
This can be simplified a little bit. You use \s+ before and after, but is this necessary? I think you need just to assure word boundary or just one whitespace, thus, you can use
\s("Low"||"Medium"||"High")\s
or you can use \b instead of \s.
Second step is not to use capturing group, use non-capturing grous instead, because regex engine wastes time and memory for "remembering" groups, so you could try with:
\s(?:"Low"||"Medium"||"High")\s
TL;DR I've compared solutions on a recent rakudo, using your sample data. The ugly brute-force solution I present here is about twice as fast as the delightfully elegant solution Liz has presented. You could probably improve times another order of magnitude or more by breaking your data up and parallel processing it. I also discuss other options if that's not enough.
Alternations seems like a red herring
When I eliminated the alternation (leaving just "Low") and ran the code on a recent rakudo, the time taken was about the same. So I think that's a red herring and have not studied that aspect further.
Parallel processing looks promising
It's clear from your data that you could break it up, splitting at some arbitrary line, and then pattern match each piece in parallel, and then combine results.
That could net you a substantial win, depending on various factors related to your system and the data you process.
But I haven't explored this option.
The fastest results I've seen
The fastest results I've seen are with this code:
my %counts;
$g ~~ m:g / "\t " [ 'Low' || 'Medium' || 'High' ] \n { %counts{$/}++ } /;
say %counts.map: { .key.trim, .value }
This displays:
((Low 2877) (Medium 54))
This approach incorporates similar changes to those Michał Turczyn discussed, but pushed harder:
I've thrown away all capturing, not only not bothering to capture the 'Low' or whatever, but also throwing away all results of the match.
I've replaced the \s+ patterns with concrete characters rather than character classes. I've done so on the basis my casual tests with a recent rakudo suggested that's a bit faster.
Going beyond raku's regexes
Raku is designed for full Unicode generality. And its regex engine is extremely powerful. But it looks like your data is just ASCII and your pattern is a typical very simple regex. So you're using a sledgehammer to crack a nut. This shouldn't really matter -- the sledgehammer is supposed to be just fine as a nutcracker too -- but raku's regex engine remains very poorly optimized thus far.
Perhaps this nut is just a simple example and you're just curious about pushing raku's built in regex capabilities to their maximum current performance.
But if not, and you need yet more speed, and the speedups from this or other better solutions in raku, coupled with parallel processing, aren't enough to get you where you need to go, it's worth considering either not using raku or using it with another tool.
One idiomatic way to use raku with another tool is to use an Inline, with the obvious one in this case being Inline::Perl5. Using that you can try perl's fast default built in regex engine or even use its regex plugin capability to plug in a really fast regex engine.
And, given the simplicity of the pattern you're matching, you could even eschew regexes altogether by writing a quick bit of glue to some low-level raw text searching tool (perhaps saving character offsets and then generating corresponding raku match objects from the results).

I need to rework this code so it includes everything that could read "inpatient" - including when misspelt?

I either need to find a way of recoding this to include misspelt versions of inpatient or 'flagging' those that haven't been affected by this?
df1$Admission_Type <- as.character(df1$Admission_Type)
df1$Admission_Type[df1$Admission_Type == "Inpatient"]<-"ip"
df1$Admission_Type[df1$Admission_Type == "inpatient"]<-"ip"
df1$Admission_Type[df1$Admission_Type == "INPATIENT"]<-"ip"
It repeats like this.
To deal with case issues, convert all to lower case
df1 <- data.frame(Admission_Type = c("Inpatient", "inpatient", "INPATIENT", "inp", "impatient"), stringsAsFactors = FALSE)
df1$Admission_Type <- tolower(df1$Admission_Type)
Then you can use regular expressions to deal with misspellings. While impossible to get all, you can use intuition to get close. In my example, I made the (intentional) misspelling of "impatient". You can set up a regular expression to detect this possibly common mistake as such
grep("^i[nm]pat[ie][ei]nt", df1$Admission_Type, ignore.case = TRUE)
where I allowed the second position to be either an 'n' or 'm', or the 'ie' to be switched at positions 6-7. This returns
[1] 1 2 3 5
You can add likely possible misspelled letters to each position. Plenty of tips on how to make this regex more complicated to allow for missing/extra letters if you search.
Note you can use gsub to do the replacement automatically.
df1$Admission_Type[grepl("inpatient", df1$Admission_Type, ignore.case=TRUE)] = "ip" will cover the cases you listed. #JohnSG's answer shows how to include potential misspellings into the regular expression as well. (You'll probably want to create a new column to store your recodings (at least while you're testing out different options) rather than overwriting the original column of data.)
As #alistaire mentioned, you can use agrep for approximate matching. For example:
x = c("inpatient","Inpatient","Impatient","inpateint")
agrep("inpatient", x, max.dist=2, ignore.case=TRUE)
So, in your case, you could do:
df1$Admission_Type[agrep("inpatient", df1$aAdmisstion_Type, max.dist=2, ignore.case=TRUE)] = "ip"
agrep returns the indices of the matching values. max.dist controls how different the actual values can be from the target value and still be considered a match. You'll probably need to test and tweak this to capture mispellings while avoiding incorrect matches.
grepl covers the cases you listed in your questions, but for future reference, if you ever do need to match on a number of separate values, you can reduce the amount of code needed by using the %in% function. In your case, that would be:
df1$Admission_Type[df1$Admission_Type %in% c("Inpatient","inpatient","INPATIENT")]<-"ip"

Checking if a string contains an English sentence

As of right now, I decided to take a dictionary and iterate through the entire thing. Every time I see a newline, I make a string containing from that newline to the next newline, then I do string.find() to see if that English word is somewhere in there. This takes a VERY long time, each word taking about 1/2-1/4 a second to verify.
It is working perfectly, but I need to check thousands of words a second. I can run several windows, which doesn't affect the speed (Multithreading), but it still only checks like 10 a second. (I need thousands)
I'm currently writing code to pre-compile a large array containing every word in the English language, which should speed it up a lot, but still not get the speed I want. There has to be a better way to do this.
The strings I'm checking will look like this:
"hithisisastringthatmustbechecked"
but most of them contained complete garbage, just random letters.
I can't check for impossible compinations of letters, because that string would be thrown out because of the 'tm', in between 'thatmust'.
You can speed up the search by employing the Knuth–Morris–Pratt (KMP) algorithm.
Go through every dictionary word, and build a search table for it. You need to do it only once. Now your search for individual words will proceed at faster pace, because the "false starts" will be eliminated.
There are a lot of strategies for doing this quickly.
Idea 1
Take the string you are searching and make a copy of each possible substring beginning at some column and continuing through the whole string. Then store each one in an array indexed by the letter it begins with. (If a letter is used twice store the longer substring.
So the array looks like this:
a - substr[0] = "astringthatmustbechecked"
b - substr[1] = "bechecked"
c - substr[2] = "checked"
d - substr[3] = "d"
e - substr[4] = "echecked"
f - substr[5] = null // since there is no 'f' in it
... and so forth
Then, for each word in the dictionary, search in the array element indicated by its first letter. This limits the amount of stuff that has to be searched. Plus you can't ever find a word beginning with, say 'r', anywhere before the first 'r' in the string. And some words won't even do a search if the letter isn't in there at all.
Idea 2
Expand upon that idea by noting the longest word in the dictionary and get rid of letters from those strings in the arrays that are longer than that distance away.
So you have this in the array:
a - substr[0] = "astringthatmustbechecked"
But if the longest word in the list is 5 letters, there is no need to keep any more than:
a - substr[0] = "astri"
If the letter is present several times you have to keep more letters. So this one has to keep the whole string because the "e" keeps showing up less than 5 letters apart.
e - substr[4] = "echecked"
You can expand upon this by using the longest words starting with any particular letter when condensing the strings.
Idea 3
This has nothing to do with 1 and 2. Its an idea that you could use instead.
You can turn the dictionary into a sort of regular expression stored in a linked data structure. It is possible to write the regular expression too and then apply it.
Assume these are the words in the dictionary:
arun
bob
bill
billy
body
jose
Build this sort of linked structure. (Its a binary tree, really, represented in such a way that I can explain how to use it.)
a -> r -> u -> n -> *
|
b -> i -> l -> l -> *
| | |
| o -> b -> * y -> *
| |
| d -> y -> *
|
j -> o -> s -> e -> *
The arrows denote a letter that has to follow another letter. So "r" has to be after an "a" or it can't match.
The lines going down denote an option. You have the "a or b or j" possible letters and then the "i or o" possible letters after the "b".
The regular expression looks sort of like: /(arun)|(b(ill(y+))|(o(b|dy)))|(jose)/ (though I might have slipped a paren). This gives the gist of creating it as a regex.
Once you build this structure, you apply it to your string starting at the first column. Try to run the match by checking for the alternatives and if one matches, more forward tentatively and try the letter after the arrow and its alternatives. If you reach the star/asterisk, it matches. If you run out of alternatives, including backtracking, you move to the next column.
This is a lot of work but can, sometimes, be handy.
Side note I built one of these some time back by writing a program that wrote the code that ran the algorithm directly instead of having code looking at the binary tree data structure.
Think of each set of vertical bar options being a switch statement against a particular character column and each arrow turning into a nesting. If there is only one option, you don't need a full switch statement, just an if.
That was some fast character matching and really handy for some reason that eludes me today.
How about a Bloom Filter?
A Bloom filter, conceived by Burton Howard Bloom in 1970 is a
space-efficient probabilistic data structure that is used to test
whether an element is a member of a set. False positive matches are
possible, but false negatives are not; i.e. a query returns either
"inside set (may be wrong)" or "definitely not in set". Elements can
be added to the set, but not removed (though this can be addressed
with a "counting" filter). The more elements that are added to the
set, the larger the probability of false positives.
The approach could work as follows: you create the set of words that you want to check against (this is done only once), and then you can quickly run the "in/not-in" check for every sub-string. If the outcome is "not-in", you are safe to continue (Bloom filters do not give false negatives). If the outcome is "in", you then run your more sophisticated check to confirm (Bloom filters can give false positives).
It is my understanding that some spell-checkers rely on bloom filters to quickly test whether your latest word belongs to the dictionary of known words.
This code was modified from How to split text without spaces into list of words?:
from math import log
words = open("english125k.txt").read().split()
wordcost = dict((k, log((i+1)*log(len(words)))) for i,k in enumerate(words))
maxword = max(len(x) for x in words)
def infer_spaces(s):
"""Uses dynamic programming to infer the location of spaces in a string
without spaces."""
# Find the best match for the i first characters, assuming cost has
# been built for the i-1 first characters.
# Returns a pair (match_cost, match_length).
def best_match(i):
candidates = enumerate(reversed(cost[max(0, i-maxword):i]))
return min((c + wordcost.get(s[i-k-1:i], 9e999), k+1) for k,c in candidates)
# Build the cost array.
cost = [0]
for i in range(1,len(s)+1):
c,k = best_match(i)
cost.append(c)
# Backtrack to recover the minimal-cost string.
costsum = 0
i = len(s)
while i>0:
c,k = best_match(i)
assert c == cost[i]
costsum += c
i -= k
return costsum
Using the same dictionary of that answer and testing your string outputs
>>> infer_spaces("hithisisastringthatmustbechecked")
294.99768817854056
The trick here is finding out what threshold you can use, keeping in mind that using smaller words makes the cost higher (if the algorithm can't find any usable word, it returns inf, since it would split everything to single-letter words).
In theory, I think you should be able to train a Markov model and use that to decide if a string is probably a sentence or probably garbage. There's another question about doing this to recognize words, not sentences: How do I determine if a random string sounds like English?
The only difference for training on sentences is that your probability tables will be a bit larger. In my experience, though, a modern desktop computer has more than enough RAM to handle Markov matrices unless you are training on the entire Library of Congress (which is unnecessary- even 5 or so books by different authors should be enough for very accurate classification).
Since your sentences are mashed together without clear word boundaries, it's a bit tricky, but the good news is that the Markov model doesn't care about words, just about what follows what. So, you can make it ignore spaces, by first stripping all spaces from your training data. If you were going to use Alice in Wonderland as your training text, the first paragraph would, perhaps, look like so:
alicewasbeginningtogetverytiredofsittingbyhersisteronthebankandofhavingnothingtodoonceortwiceshehadpeepedintothebookhersisterwasreadingbutithadnopicturesorconversationsinitandwhatistheuseofabookthoughtalicewithoutpicturesorconversation
It looks weird, but as far as a Markov model is concerned, it's a trivial difference from the classical implementation.
I see that you are concerned about time: Training may take a few minutes (assuming you have already compiled gold standard "sentences" and "random scrambled strings" texts). You only need to train once, you can easily save the "trained" model to disk and reuse it for subsequent runs by loading from disk, which may take a few seconds. Making a call on a string would take a trivially small number of floating point multiplications to get a probability, so after you finish training it, it should be very fast.

Regex vs. string:find() for simple word boundary

Say I only need to find out whether a line read from a file contains a word from a finite set of words.
One way of doing this is to use a regex like this:
.*\y(good|better|best)\y.*
Another way of accomplishing this is using a pseudo code like this:
if ( (readLine.find("good") != string::npos) ||
(readLine.find("better") != string::npos) ||
(readLine.find("best") != string::npos) )
{
// line contains a word from a finite set of words.
}
Which way will have better performance? (i.e. speed and CPU utilization)
The regexp will perform better, but get rid of those '.*' parts. They complicate the code and don't serve any purpose. A regexp like this:
\y(good|better|best)\y
will search through the string in a single pass. The algorithm it builds from this regexp will look first for \y, then character 1 (g|b), then character 2 (g => go or b => be), character 3 (go => goo or be => bes|bet), character 4 (go => good or bes => best or bet => bett), etc. Without building your own state machine, this is as fast as it gets.
You won't know which is faster until you've measured, but the issues at stake are:
The regex implementation, esp. whether it needs to precompile (like Google RE2, POSIX regexes).
The implementation of string::find.
The length of the string you're searching in.
How many strings you're searching in.
My bets are on the regex, but again: you've got to measure to be sure.
Obviously not the second one (using 'find'), since you're running three comparisons (need to traverse the string at least 3 times) instead of one hopefully smart one. If the regex engine works at all like it should (and I suppose it does) then it will probably be at least three times faster.

How to find a formatted number in a string?

If I have a string, and I want to find if it contains a number of the form XXX-XX-XXX, and return its position in the string, is there an easy way to do that?
XXX-XX-XXX can be any number, such as 259-40-092.
This is usually a job for a regular expression. Have a look at the Boost.Regex library for example.
I did this before....
Regular Expression is your superhero, become his friend....
//Javascript
var numRegExp = /^[+]?([0-9- ]+){10,}$/;
if (numRegExp.test("259-40-092")) {
alert("True - Number found....");
else
alert("False - Not a Number");
}
To give you a position in the string, that will be your homework. :-)
The regular expression in C++ will be...
char* regExp = "[+]?([0-9- ]+){10,}";
Use Boost.Regex for this instance.
If you don't want regexes, here's an algorithm:
Find the first -
LOOP {
Find the next -
If not found, break.
Check if the distance is 2
Check if the 8 characters surrounding the two minuses are digits
If so, found the number.
}
Not optimal but but the scan speed will already be dominated by the cache/memory speed. It can be optimized by considering on what part the match failed, and how. For instance, if you've got "123-4X-X............", when you find the X you know that you can skip ahead quickly. The second - preceding the X cannot be the first - of a proper number. Similarly, in "123--" you know that the second - can't be the first - of a number either.