Optimizing a regex filled with '?' - regex

On the stenographic keyboard, there are the keys STKPWHRAO*EUFRPBLGTSDZ. The user presses several keys, then the keys are registered all at once when lifted. It's similar to playing chords on a piano. Example strokes are KAT, TPHOEUGT.
I have a regex which tests for valid steno chords. It can be any number of these keys but they must be in that order. My solution is qr/S?T?K?P?W?H?R?A?O?\*?E?U?F?R?P?B?L?G?T?S?D?Z?/ but since this regex gets called hundreds of times, the variable length might be a speed bottleneck. Each step forward in the regex is a bigger and bigger set of possibilities due to all the ?
Is there a faster regex approach to this? I need the regex to fail if keys are out of order.

To check if a string is a valid chord, you'd actually need
/^(?=.)S?T?K?P?W?H?R?A?O?\*?E?U?F?R?P?B?L?G?T?S?D?Z?\z/s
A simple optimization would be to make sure a match is possible.
/^(?=[STKPWHRAO*EUFBLGDZ])S?T?K?P?W?H?R?A?O?\*?E?U?F?R?P?B?L?G?T?S?D?Z?\z/s
The next step is to eliminate backtracking. That's where time is being lost.
/
^
(?=[STKPWHRAO*EUFBLGDZ])
S?+ T?+ K?+ P?+ W?+ H?+ R?+ A?+ O?+ \*?+ E?+
U?+ F?+ R?+ P?+ B?+ L?+ G?+ T?+ S?+ D?+ Z?+
\z
/x
Fortunately, even though S, T, P and R appear twice, backtracking could be completely eliminated without trouble. This should virtually the matching time to virtually nothing.
If even that isn't fast enough, the next step is writing a specialized C function. Starting the regex matching engine is expensive, and completely avoidable with a simple function.
Note that the above optimizations only help when the pattern doesn't match. They should be neutral when the pattern matches. The C function, on the other hand, would help even when then pattern matches.
Benchmarks:
use strict;
use warnings;
use feature qw( say );
use Benchmark qw( cmpthese );
my %tests = (
orig => q{ $s =~ /^(?=.)S?T?K?P?W?H?R?A?O?\*?E?U?F?R?P?B?L?G?T?S?D?Z?\z/s},
new => q{ $s =~
/
^
(?=[STKPWHRAO*EUFBLGDZ])
S?+ T?+ K?+ P?+ W?+ H?+ R?+ A?+ O?+ \*?+ E?+
U?+ F?+ R?+ P?+ B?+ L?+ G?+ T?+ S?+ D?+ Z?+
\z
/x
},
);
$_ = 'use strict; use warnings; our $s; ' . $_
for values %tests;
{ say "Matching:"; local our $s = "STAODZ"; cmpthese(-3, \%tests); }
{ say "Not matching:"; local our $s = "STPRSTPR"; cmpthese(-3, \%tests); }
Output:
Matching:
Rate new orig
new 509020/s -- -29%
orig 712274/s 40% --
Not matching:
Rate orig new
orig 158758/s -- -73%
new 579851/s 265% --
Which means
matching slowed from 1.40μs to 1.96μs (in this case), and
non-matching speed up from 6.30μs to 1.72μs (in this case).
To check if a string is a sequence of valid chords, you'd simply need
/^[STKPWHRAO*EUFBLGDZ]+\z/
If you want to extract all the chords in a string, I'd start by extracting the sequences matched by the following, then finding the chords within the extracted sequences:
/([STKPWHRAO*EUFBLGDZ]+)/

the variable length might be a speed bottleneck
You shouldn't work like that
First, write and debug your program
then, if it isn't fast enough for it's purpose, profile your program to find where the bottlenecks are
then optimise the bottlenecks
For goodness sake don't spend ages trying to guess where the bottlenecks are and optimising them before your code is complete, as you will more than likely find that you have guessed wrongly and wasted a lot of time
In any case, the regex engine is written in C and is pretty damn fast. I doubt very much whether the short pattern that you have written will take a significant amount of time to test
Each step forward in the regex is a bigger and bigger set of possibilities due to all the ?
That isn't true either. At each point in the regex there is only one character to test. The next character in the string either matches it or it doesn't. Either is fine, and the regex engine just goes on to the next step in the pattern. The matching process will be pretty much constant regardless of the string to be matched.

Related

Alternation in regexes seems to be terribly slow in big files

I am trying to use this regex:
my #vulnerabilities = ($g ~~ m:g/\s+("Low"||"Medium"||"High")\s+/);
On chunks of files such as this one, the chunks that go from one "sorted" to the next. Every one must be a few hundred kilobytes, and all of them together take from 1 to 3 seconds all together (divided by 32 per iteration).
How can this be sped up?
Inspection of the example file reveals that the strings only occur as a whole line, starting with a tab and a space. From your responses I further gathered that you're really only interested in counts. If that is the case, then I would suggest something like this solution:
my %targets = "\t Low", "Low", "\t Medium", "Medium", "\t High", "High";
my %vulnerabilities is Bag = $g.lines.map: {
%targets{$_} // Empty
}
dd %vulnerabilities; # ("Low"=>2877,"Medium"=>54).Bag
This runs in about .25 seconds on my machine.
It always pays to look at the problem domain thoroughly!
This can be simplified a little bit. You use \s+ before and after, but is this necessary? I think you need just to assure word boundary or just one whitespace, thus, you can use
\s("Low"||"Medium"||"High")\s
or you can use \b instead of \s.
Second step is not to use capturing group, use non-capturing grous instead, because regex engine wastes time and memory for "remembering" groups, so you could try with:
\s(?:"Low"||"Medium"||"High")\s
TL;DR I've compared solutions on a recent rakudo, using your sample data. The ugly brute-force solution I present here is about twice as fast as the delightfully elegant solution Liz has presented. You could probably improve times another order of magnitude or more by breaking your data up and parallel processing it. I also discuss other options if that's not enough.
Alternations seems like a red herring
When I eliminated the alternation (leaving just "Low") and ran the code on a recent rakudo, the time taken was about the same. So I think that's a red herring and have not studied that aspect further.
Parallel processing looks promising
It's clear from your data that you could break it up, splitting at some arbitrary line, and then pattern match each piece in parallel, and then combine results.
That could net you a substantial win, depending on various factors related to your system and the data you process.
But I haven't explored this option.
The fastest results I've seen
The fastest results I've seen are with this code:
my %counts;
$g ~~ m:g / "\t " [ 'Low' || 'Medium' || 'High' ] \n { %counts{$/}++ } /;
say %counts.map: { .key.trim, .value }
This displays:
((Low 2877) (Medium 54))
This approach incorporates similar changes to those Michał Turczyn discussed, but pushed harder:
I've thrown away all capturing, not only not bothering to capture the 'Low' or whatever, but also throwing away all results of the match.
I've replaced the \s+ patterns with concrete characters rather than character classes. I've done so on the basis my casual tests with a recent rakudo suggested that's a bit faster.
Going beyond raku's regexes
Raku is designed for full Unicode generality. And its regex engine is extremely powerful. But it looks like your data is just ASCII and your pattern is a typical very simple regex. So you're using a sledgehammer to crack a nut. This shouldn't really matter -- the sledgehammer is supposed to be just fine as a nutcracker too -- but raku's regex engine remains very poorly optimized thus far.
Perhaps this nut is just a simple example and you're just curious about pushing raku's built in regex capabilities to their maximum current performance.
But if not, and you need yet more speed, and the speedups from this or other better solutions in raku, coupled with parallel processing, aren't enough to get you where you need to go, it's worth considering either not using raku or using it with another tool.
One idiomatic way to use raku with another tool is to use an Inline, with the obvious one in this case being Inline::Perl5. Using that you can try perl's fast default built in regex engine or even use its regex plugin capability to plug in a really fast regex engine.
And, given the simplicity of the pattern you're matching, you could even eschew regexes altogether by writing a quick bit of glue to some low-level raw text searching tool (perhaps saving character offsets and then generating corresponding raku match objects from the results).

Matching the IP using regular expression

set ip 10.10.
if {[regexp
{^(([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])\.?){4}$} $ip
match]} { puts $match }
the above pattern matching 10.10. can anyone tell me how this happening
First, using a regular expression to check ip addresses is extremely fragile and unnecessarily complex, and you still have to do the heavy lifting yourself. Instead, use the Tcllib_ip package.
package require ip
If you want to know if a given string is an IPv4 address, just check with
::ip::is 4 $str ;# 1 if valid ipv4, 0 otherwise
or
::ip::version $str ;# returns 4 or 6 for ipv4 or ipv6, -1 otherwise
The commands in the package also handle address strings that aren't dotted decimal.
The package isn't included in all distributions, but can be installed using teacup install or by downloading the files and sourcing them into the script.
To answer the question: the original asker has one error and one problem. The error is that the regular expression used to match the ip address also matches strings that aren't ip addresses. This is one of the most common problems when using regular expressions. The reason and the fix is addressed in other answers to the question. To recap: Captain noted that since the original regular expression makes the dot optional, the string 10.10. can be matched as 1 0. 1 0.. There are several possible solutions: {^(([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(\.|$)){4}$} as suggested by the same Captain seems valid but may turn out to have more problems if tested.
The main problem is that a non-trivial regular expression is used to match the address. For all but the most trivial regular expressions, rigorous testing must be performed to ensure that they don't produce false positives. This testing is usually impractical to make exhaustive, which means that you can't know for sure if it works until an angry customer tells you it doesn't. When a case of false positive match is found, the solution is either to drop the regular expression and try another method, or alternatively to make the regular expression more complex in order to make the match more strict. At this point, the test suite may also have to grow.
A better way is to step back and look for other solutions. If there is a standard library function for it, that should be used. If we imagine there is none in this case, simply reflecting on the most basic formulation of an ipv4 decimal-dot address ("four groups of integers from 0 to 255, joined by dots") suggests some simple and safe functions:
proc isOctet n {
expr {[string is integer -strict $n] && 0 <= $n && $n <= 255}
}
proc splitIpv4dd1 str {
split $str .
}
proc splitIpv4dd2 str {
scan $str %d.%d.%d.%d
}
proc splitIpv4dd3 str {
lrange [regexp -inline {^(\d+)\.(\d+)\.(\d+)\.(\d+)$} $str] 1 end
}
# plug any of the preceding splitIpv4ddN functions into this command
proc putsIpv4dd str {
set count 0
foreach n [splitIpv4dd1 $str] {
if {[isOctet $n]} {
incr count
}
}
if {$count == 4} {puts $str}
}
It is much easier to verify that each of these functions does its job correctly without false negatives or positives, and if they do, the command to print ip addresses can be assumed to work correctly. The third splitting function uses a regular expression, but in this case it's a trivial one without alternatives and optional atoms.
One important goal when writing robust and maintainable code is to keep functions cohesive and clear-cut without loopholes or irregularities. Matching with non-trivial regular expressions runs counter to this.
I certainly understand and actually applaud the wish to understand what went wrong, but the correct conclusion to draw from this is that regular expression matching isn't a good method to use in this case.
You can try to use this regex:
^(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])$
Regex Demo
To answer "how this is happening" - ´.´ optional, it finds 1, 0., 1, 0.
And the answer to the unasked question
The below expression will make the dot optional only if it is the end of the string (modified to ensure no trailing dot):
^(([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])(\.(?=[0-9])|$)){4}$
Please remember that the original question was asking "how is this happening" - i.e. understanding the regular expression behaviour... NOTHING about how to change the regex or how this should be done...

Time complexity of regex and Allowing jitter in pattern finding

To find patterns in string, I have the following code. In it, find.string finds substring of maximum length subject to (1) substring must be repeated consecutively at least th times and (2) substring length must be no longer than len.
reps <- function(s, n) paste(rep(s, n), collapse = "") # repeat s n times
find.string <- function(string, th = 3, len = floor(nchar(string)/th)) {
for(k in len:1) {
pat <- paste0("(.{", k, "})", reps("\\1", th-1))
r <- regexpr(pat, string, perl = TRUE)
if (attr(r, "capture.length") > 0) break
}
if (r > 0) substring(string, r, r + attr(r, "capture.length")-1) else ""
}
An example for the above mentioned code: for the string "a0cc0vaaaabaaaabaaaabaa00bvw" the pattern should come out to be "aaaab".
NOW I am trying to get patterns allowing jitter of 1 character. Example: for the string "a0cc0vaaaabaaadbaaabbaa00bvw" the pattern should come out to be "aaajb" where "j" can be anything. Can anyone suggest a modification of the above mentioned code or any new code for pattern finding, that could allow such jitters?
Also can anyone throw some light on the TIME COMPLEXITY and INTERNAL ALGORITHM used for the regexpr function ?
Thanks! :)
Not very efficient but tada:
reps <- function(s, n) paste(rep(s, n), collapse = "") # repeat s n times
find.string <- function(string, th = 3, len = floor(nchar(string)/th)) {
found <- FALSE
for(sublen in len:1) {
for(inlen in 0:sublen) {
pat <- paste0("((.{", sublen-inlen, "})(.)(.{", inlen, "}))", reps("(\\2.\\4)", th-1))
r <- regexpr(pat, string, perl = TRUE)
if (attr(r, "capture.length")[1] > 0){
found = TRUE
break;
}
}
if(found) break
}
if (r > 0) substring(string, r, r + attr(r, "capture.length")[1] - 1) else ""
}
find.string("a0cc0vaaaabaaadbaaabbaa00bvw"); # returns "aaaab"
Without any fuzzy matching tool available, I manually check each possibility. I use an inner loop to try different size prefix and suffix lengths on either size of the "jitter" character. The prefix is grouped as \2 and the suffix as \4 (the jitter is \3 but I don't use it). Then, the repeated part tries to match \2.\4 - so the prefix, any new jitter character, and the suffix.
I say not efficient because its evaluating O(len^2) different patterns, versus O(len) patterns in your code. For large len this might become a problem.
Note that I have multiple groups, and only look at the [1] position. The full r variable has more useful information, for example [1] will be the first part, [5] will be the 2nd part, [6] will be the 3rd part, etc. Also [3] will be the "jitter" character in the 1st part.
Regarding the complexity of the actual regex: it varies a lot. However, often the construction (setup) of a particular regex is vastly more intensive then the actual matching, which is why a single pattern used repeatedly can produce better results than multiple patterns. In truth, this varies a lot based on the pattern and the engine you're using - see links at the end for more info about complexity.
Regarding how regex works: just a note, this is going to be a very theoretical overview, its not meant to indicate how any particular regex engine works.
For a more practical overview, there are plenty of sites that cover just enough to know how to use a regex, but not how to build your own engine. - for example http://www.regular-expressions.info/engine.html
Regex is what's known as a state machine, specifically a (non-deterministic) finite state automaton (NFA). A very simple, real world state machine is a lightbulb: its either on, or off, and different inputs can change the state its in. A regex is much more complex, (generally) each symbol in the pattern forms a state, and different input can send it to different states. So if you have \d\d\d, 3 virtual states each can accept any digit, and any other input goes to a 4th "failure" state. The end result is the end state after all input is 'consumed'.
Perhaps you can imagine: this gets vastly more complicated, with many many states, when you use any ambiguity, such as wildcards or alternation. So our \d\d\d regex will basically be linear. But more complicated one will not be. Part of the optimization in a regex engine is converting a NFA to a DFA - a deterministic finite state automaton. Here, the ambiguity is removed, generating many more states, and this is the very computationally complex process referenced above (the construction stage).
This is really just a very theoretical overview of an ideal NFA. In practice, modern regex grammars can do a lot more than this, for example backtracing is not technically possible in a "proper" regex.
This might be a bit too high-level, but thats the basic idea. If you're curious, there are plenty of good articles about regex, different flavors, and their complexity. For example: http://swtch.com/~rsc/regexp/regexp1.html
There's basically two regex algorithm types, Perl-Style (with a lot of complex backtracking) and Thompson-NFA.
http://swtch.com/~rsc/regexp/regexp1.html
To determine which engine R uses R's svn repo is here:
*root repo:
http://svn.r-project.org/R/
http://svn.r-project.org/R/branches\R-exp-uncmin\src\regex
I poked around in there a bit and found a file called "engine.c" On first glance it doesn't look like a Thompson-NFA but I didn't take long to read it.
At any rate, the first link goes in depth into the complexity question in general and should give you a great idea as to how regex parsing works under the hood to boot.

Hunspell/Aspell data conversion to human-readable inflection list

Is there an easy way to generate a human-readable inflection list from Hunspell/Aspell dictionary data files?
For example, I'd like to generate the following outputs (for different languages):
...
book, books
book, books, booked, booking
...
go, goes, went, gone, going
...
I looked at the Hunspell/Aspell docs, but couldn't find an API call that would do this.
There is a method that the command line one does, but it doesn't output quite in the format you're looking for. You could also do this manually if you wanted though just by some simple scripting with regex.
The format of for each set of affixes is
TYPE TAG REMOVE REPLACE MATCH
Such that where TAG matches what follows what's behind the /in a given word in the .dicfile, you can do the following (presuming you've already stripped the word of the /...):
if($word =~ /$match$/) $word =~ s/$remove$/$replace/;
Notice the $ there matching the end-of-line/word. Adjust with ^ if it's a prefix.
There are three caveats:
The $match directly from the .aff file is in almost all cases equivalent to standard regex. There are minor variations such that if the match is something like [abc-gh], you'd be better to change it to (a|b|c|-|g|h) or [abcgh-] (hunspell doesn't use hyphen as a metacharacter) otherwise it'll be interpreted as [abcdefgh] (standard regex). For a negated character class, your options are to manually move the - to the end of the expression (e.g. [^a-df] to [^adf-] or to use negative look behinds.
If $replace is 0, then you should change it to an empty string.
If your result ends with /..., you need to reprocess it again because it has a double affix.
Be careful. By my rough calculations, the dictionary I'm working on could have more than 50 million words being formed (and I wouldn't be surprised if it hits beyond 100 million).

Fast algorithm to extract thousands of simple patterns out of large amounts of text

I want to be able to match efficiently thousands of regexps out of GBs of text knowing that most of these regexps will be fairly simple, like:
\bBarack\s(Hussein\s)?Obama\b
\b(John|J\.)\sBoehner\b
etc.
My current idea is to try to extract out of each regexp some kind of longest substring, then use Aho-Corasick to match these substrings and eliminate most of the regexp and then match all the remaining regexp combined. Can anyone think of something better?
You can use (f)lex to generate a DFA, which recognises all the literals in parallel. This might get tricky if there are too many wildcards present, but it works for upto about 100 literals (for a 4 letter alfabet; probably more for natural text). You may want to suppress the default action (ECHO), and only print the line+column numbers of the matches.
[ I assume grep -F does about the same ]
%{
/* C code to be copied verbatim */
#include <stdio.h>
%}
%%
"TTGATTCACCAGCGCGTATTGTC" { printf("#%d: %d:%s\n", yylineno, yycolumn, "OMG! the TTGA pattern again" ); }
"AGGTATCTGCTTCAATCAGCG" { printf("#%d: %d:%s\n", yylineno, yycolumn, "WTF?!" ); }
...
more lines
...
[bd-fh-su-z]+ {;}
[ \t\r\n]+ {;}
. {;}
%%
int main(void)
{
/* Call the lexer, then quit. */
yylex();
return 0;
}
A script like the one above can be generated form txt input with awk or any other script language.
A slightly smarter implementation than running every regex on every file:
For each regex:
load regex into a regex engine
assemble a list of regex engines
For each byte in the file:
insert byte to every regex engine
print results if there are matches
But I don't know of any programs that do this already - you'd have to code it yourself. This also implies you have the ram to keep the regex state around, and that you don't have any evil regexes
I'm not sure if you'd blow some regex size limit, but you could just OR them all up together into one giant regex:
((\bBarack\s(Hussein\s)?Obama\b)|(\b(John|J\.)\sBoehner\b)|(etc)|(etc))
If you hit some limit, you could do this with chunks of 100 at a time or however many you can manage
If you need really fast implementation of some specific case, you can implement suffix tree of Aho–Corasick algorithm yourself. But in most cases union of all your regexes into single regex, as recommended earlier, will be not bad too