Lengthy perl regex - regex

This may seem as somewhat odd question, but anyhow to the point;
I have a string that I need to search for many many possible character occurrences in several combinations (so character classes are out of question), so what would be the most efficent way to do this?
I was thinking either stack it into one regex:
if ($txt =~ /^(?:really |really |long | regex here)$/){}
or using several 'smaller' comparisons, but I'd assume this won't be very efficent:
if ($txt =~ /^regex1$/ || $txt =~ /^regex2$/ || $txt =~ /^regex3$/) {}
or perhaps nest several if comparisons.
I will appreciate any extra suggestions and other input on this issue.
Thanks

Ever since way back in v5.9.2, Perl compiles a set of N alternatives like:
/string1|string2|string3|string4|string5|.../
into a trie data structure, and if that is the first thing in the pattern, even uses Aho–Corasick matching to find the start point very quickly.
That means that your match of N alternatives will now run in O(1) time instead of in the O(N) time that this:
if (/string1/ || /string2/ || /string3/ || /string4/ || /string5/ || ...)
will run in.
So you can have O(1) or O(N) performance: your choice.
If you use re "debug" or -Mre-debug, Perl will show these trie structures in your patterns.

This will not replace some time testing. If possible though, I would suggest using the o flag if possible so that Perl doesn't recompile your (large) regex on every evaulation. Of course this is only possible if those combinations of characters do not change for each evaluation.

I think it depends on how long regex you have. Sometimes better to devide very long expressions.

Related

How many regular expressions can I chain together using alternation?

I have some large files (hundreds of MB) that I need to search for several thousand ~20-character unique strings.
I've found that using the pipe alternation metacharacter for matching regular expressions like (string1|string2|string3) speeds up the search process a lot (versus searching for one string at a time).
What's the limit to how well this will scale? How many expressions can I chain together like this? Will it cause some kind of overflow at some point? Is there a better way to do this?
EDIT
In an effort to keep my question brief, I didn't emphasize the fact that I've already implemented code using this alternation approach and I found it to be helpful: On a test case with a typical data set, running time was reduced from 87 minutes down to 18 seconds--a 290x speedup, apparently with O(n) instead of O(n*m).
My question relates to how this approach can be expected to work when other users run this code in the future using much larger data sets with bigger files and more search terms. The original O(n*m) code was existing code that's been in use for 13 years, and its slowness was pointed out recently as the genome-related data sets it operates on have recently gotten much bigger.
If you have a simple regular expression like (word1|word2|...|wordn), the regex engine will construct a state machine that can just pass over the input once to find whether the string matches.
Side-note: in theoretical computer science, "regular expressions" are defined in a way such that a single pass is always sufficient. However, practical regex implementation add features that allow construction of regex patterns which can't be always implemented as a single pass (see this example).
Again, for your pattern of regular expressions, the engine will almost certainly use a single pass. That is likely going to be faster than reading the data from memory multiple times ... and almost definitely a lot faster than reading the data multiple times from disk.
If you are just going to have regular expression of the form (word1|word2|....|wordn), why not just create an associated array of booleans. That should be very fast.
EDIT
# before the loop, set up the hash
%words = (
cat => 1,
dog => 1,
apple => 1,
.... etc
);
# A the loop to check a sentence
foreach $aword (split(/ /, $sentence))
if ($words{$aword}) print "Found $aword\n";
There is no theoretical limit to the extent of a regular expression, but practically it must fit within the limits of a specific platform and installation. You must find out empirically whether your plan will work, and I for one would be delighted to see your results.
One thing I would say is that you should compile the expression separately before you go on to use it. Either that or apply the /o option to compile just once (i.e. promise that the contents of the expression won't change). Something like this
my $re = join '|', #strings;
foreach my $file (#files) {
my $fh = IO::File->new($file, '<') or die "Can't open $file: $!";
while (<$fh>) {
next unless /\b(?:$re)\b/io;
chomp;
print "$_ found in $file\n";
last;
}
}

Finding a string *and* its substrings in a haystack

Suppose you have a string (e.g. needle). Its 19 continuous substrings are:
needle
needl eedle
need eedl edle
nee eed edl dle
ne ee ed dl le
n e d l
If I were to build a regex to match, in a haystack, any of the substrings I could simply do:
/(needle|needl|eedle|need|eedl|edle|nee|eed|edl|dle|ne|ee|ed|dl|le|n|e|d|l)/
but it doesn't look really elegant. Is there a better way to create a regex that will greedly match any one of the substrings of a given string?
Additionally, what if I posed another constraint, wanted to match only substrings longer than a threshold, e.g. for substrings of at least 3 characters:
/(needle|needl|eedle|need|eedl|edle|nee|eed|edl|dle)/
note: I deliberately did not mention any particular regex dialect. Please state which one you're using in your answer.
As Qtax suggested, the expression
n(e(e(d(l(e)?)?)?)?)?|e(e(d(l(e)?)?)?)?|e(d(l(e)?)?)?|d(l(e)?)?|l(e)?|e
would be the way to go if you wanted to write an explicit regular expression (egrep syntax, optionally replace (...) by (?:...)). The reason why this is better than the initial solution is that the condensed version requires only O(n^2) space compared to O(n^3) space in the original version, where n is the length of the input. Try this with extraordinarily as input to see the difference. I guess the condensed version is also faster with many regexp engines out there.
The expression
nee(d(l(e)?)?)?|eed(l(e)?)?|edl(e)?|dle
will look for substrings of length 3 or longer.
As pointed out by vhallac, the generated regular expressions are a bit redundant and can be optimized. Apart from the proposed Emacs tool, there is a Perl package Regexp::Optimizer that I hoped would help here, but a quick check failed for the first regular expression.
Note that many regexp engines perform non-overlapping search by default. Check this with the requirements of your problem.
I have found elegant almostsolution, depending how badly you need only one regexp. For example here is the regexp, which finds common substring (perl) of length 7:
"$needle\0$heystack" =~ /(.{7}).*?\0.*\1/s
Matching string is in \1. Strings should not contain null character which is used as separator.
You should make a cycle which starters with length of the needle and goes downto treshold and tries to match the regexp.
Is there a better way to create a regex that will match any one of the
substrings of a given string?
No. But you can generate such expression easily.
Perhaps you're just looking for
.*(.{1,6}).*

Is there a simple regex to compare numbers to x?

I want a regex that will match if a number is greater than or equal to an arbitrary number. This seems monstrously complex for such a simple task... it seems like you need to reinvent 'counting' in an explicit regex hand-crafted for the x.
For example, intuitively to do this for numbers greater than 25, I get
(\d{3,}|[3-9]\d|2[6-9]\d)
What if the number was 512345? Is there a simpler way?
Seems that there is no simpler way. regex is not thing that for numbers.
You may try this one:
\[1-9]d{6,}|
[6-9]\d{5}|
5[2-9]\d{4}|
51[3-9]\d{3}|
512[4-9]\d{2}|
5123[5-9]\d|
51234[6-9]
(newlines for clarity)
What if the number was 512345? Is there a simpler way?
No, a regex to match a number in a certain range will be a horrible looking thing (especially large numbers ranges).
Regex is simply not meant for such tasks. The better solution would be to "freely" match the digits, like \d+, and then compare them with the language's relational operators (<, >, ...).
In Perl you can use the conditional regexp construct (?(condition)yes-pattern) where the (condition) is (?{CODE}) to run arbitrary Perl code. If you make the yes-pattern be (*FAIL) then you have a regexp fragment which succeeds only when CODE returns false. Thus:
foreach (0 .. 50) {
if (/\A(\d+)(?(?{$1 <= 25})(*FAIL))\z/) {
say "$_ matches";
}
else {
say "$_ does not match";
}
}
The code-evaluation feature used to be marked as experimental but the latest 'perlre' manual page (http://perldoc.perl.org/perlre.html) seems to now imply it is a core language feature.
Technically, what you have is no longer a 'regular expression' of course, but some hybrid of regexp and external code.
I've never heard of a regex flavor that can do that. Writing a Perl module to generate the appropriate regex (as you mentioned in your comment) sounds like a good idea to me. In fact, I'd be surprised if it hasn't been done already. Check CPAN first.
By the way, your regex contains a few more errors besides the excess pipes Yuriy pointed out.
First, the "three or more digits" portion will match invalid numbers like 024 and 00000007. You can solve that by requiring the first digit to be greater than zero. If you want to allow for leading zeroes, you can match them separately.
The third part, 2[6-9]\d, only matches numbers >= 260. Perhaps you meant to make the third digit optional (i.e. 2[6-9]\d?), but that would be redundant.
You should anchor the regex somehow to make sure you aren't matching part of a longer number or a "word" with digits in it. I don't know the best way to do that in your particular situation, but word boundaries (i.e. \b) will probably be all you need.
End result:
\b0*([1-9]\d{2,}|[3-9]\d|2[6-9])\b

efficient way to search the same regular expression on multiple text

I have multiple texts fields every field is paragraph of text and I want to search for a specifc pattern on those fields using regular expression for example:
my $text1 =~/(my pattern)/ig;
my $text2 =~/(my pattern)/ig;
...
my $textn=~/(my pattern)/ig;
I wonder if there are an effecint way to search multiple text with the same regular expression on perl or I should use the above format?
Use a topicaliser.
for ($text1, $text2, $textn) {
/(my pattern)/ig && do { ... };
}
When you have numbered variables, it's a red flag that you should consider a compound data structure instead. With a simple array it looks nearly the same:
for (#texts) {
my $pattern = qr/((?:i)my pattern)/;
my #matches;
push #matches, $text1 =~ /$pattern/g;
push #matches, $text2 =~ /$pattern/g;
push #matches, $textn =~ /$pattern/g;
That's about as efficient as I can think of - theoretically pre-compiles the regex once, though I'm not sure if interpolating it back into // to get the 'g' modifier undoes any of that compilation. Of course, I also have to wonder if this is really a bottleneck, and if you're just looking at some premature optimisation.
The answer to this question depends on whether your pattern contains any variables. If it does not, perl is already smart enough to only build the RE once, as long as it's identical everywhere.
Now, if you do use variables, then #Tanktalus's answer is close, but adds unnecessary complexity, by compiling the RE an additional time.
Use this:
my #matches;
push #matches, $text1 =~ /((?:i)my pattern with a $variable)/o;
push #matches, $text2 =~ /((?:i)my pattern with a $variable)/o;
push #matches, $textn =~ /((?:i)my pattern with a $variable)/o;
Why?
By using a variable in the RE pattern, perl is forced to re-compile for every instance, even when that variable is a pre-compiled RE as in #Tanktalus's answer. The /o ensures that it is only compiled once, the first time it's encountered, but it still must be compiled once for every occurence int he code. This is because Perl has no way of knowing if $pattern changed between the different uses.
Other considerations
In practice, as #Tanktalus also said, I suspect this is a big fat case of premature optimization. /o/ only matters when your pattern contains variables (otherwise perl is smart enough to only compile once anyway!)
The far more useful reason to use a pre-compiled RE as #Tanktalus has suggested, is to improve code readability. If you have a big hairy RE, then using $pattern everywhere will greatly improve readability, and with only a minor cost in performance (one you're not likely to ever notice).
Conclusion
Just use /o for your REs if they contain variables (unless you actually need the variables to change the RE on every run), and don't worry about it otherwise.

Any suggestions for improving (optimizing) existing string substitution in Perl code?

Perl 5.8
Improvements for fairly straightforward string substitutions, in an existing Perl script.
The intent of the code is clear, and the code is working.
For a given string, replace every occurrence of a TAB, LF or CR character with a single space, and replace every occurrence of a double quote with two double quotes. Here's a snippet from the existing code:
# replace all tab, newline and return characters with single space
$val01 =~s/[\t\n\r]/ /g;
$val02 =~s/[\t\n\r]/ /g;
$val03 =~s/[\t\n\r]/ /g;
# escape all double quote characters by replacing with two double quotes
$val01 =~s/"/""/g;
$val02 =~s/"/""/g;
$val03 =~s/"/""/g;
Question:Is there a better way to perform these string manipulations?
By "better way", I mean to perform them more efficiently, avoiding use of regular expressions (possibly using tr/// to replace the tab, newline and lf characters), or possibly using using the (qr//) to avoid recompilation.
NOTE: I've considered moving the string manipulation operations to a subroutine, to reduce the repetition of the regular expressions.
NOTE: This code works, it isn't really broken. I just want to know if there is a more appropriate coding convention.
NOTE: These operations are performed in a loop, a large number (>10000) of iterations.
NOTE: This script currently executes under perl v5.8.8. (The script has a require 5.6.0, but this can be changed to require 5.8.8. (Installing a later version of Perl is not currently an option on the production server.)
> perl -v
This is perl, v5.8.8 built for sun4-solaris-thread-multi
(with 33 registered patches, see perl -V for more detail)
Your existing solution looks fine to me.
As for avoiding recompilation, you don't need to worry about that. Perl's regular expressions are compiled only once as it is, unless they contain interpolated expressions, which yours don't.
For the sake of completeness, I should mention that even if interpolated expressions are present, you can tell Perl to compile the regex once only by supplying the /o flag.
$var =~ s/foo/bar/; # compiles once
$var =~ s/$foo/bar/; # compiles each time
$var =~ s/$foo/bar/o; # compiles once, using the value $foo has
# the first time the expression is evaluated
TMTOWTDI
You could use the tr or the index or the substr or the split functions as alternatives. But you must make measurements to identify the best method for your particular system.
You might be prematurely optimizing. Have you tried using a profiler, such as Devel::NYTProf, to see where your program spends the most of its time?
My guess would be that tr/// would be (slightly) quicker than s/// in your first regex. How much faster would, of course, be determined by factors that I don't know about your program and your environment. Profiling and benchmarking will answer that question.
But if you're interested in any kind of improvement to your code, can I suggest a maintainability fix? You run the same substitution (or set of substitutions) on three variables. This means that when you change that substitution, you need to change it three times - and doing the same thing three times is always dangerous :)
You might consider refactoring the code to look something like this:
foreach ($val01, $val02, $val03) {
s/[\t\n\r]/ /g;
s/"/""/g;
}
Also, it would probably be a good idea to have those values in an array rather than three such similarly named variables.
foreach (#vals) {
s/[\t\n\r]/ /g;
s/"/""/g;
}