Unanchored substring searching: index vs regex? - regex

I am writing some Perl scripts where I need to do a lot of string matching.
For example:
my $str1 = "this is a test string";
my $str2 = "test";
To see if $str1 contains $str2 - I found that there are 2 approaches:
Approach 1:
use Index function:
if ( index($str1, $str2) != -1 ) { .... }
Approach 2:
use regular expression:
if( $str1 =~ /$str2/ ) { .... }
Which is better? and when should we use each of these over the other?

Here is the result of Benchmark:
use Benchmark qw(:all) ;
my $count = -1;
my $str1 = "this is a test string";
my $str2 = "test";
my $str3 = qr/test/;
cmpthese($count, {
'type1' => sub { if ( index($str1, $str2) != -1 ) { 1 } },
'type2' => sub { if( $str1 =~ $str3 ) { 1 } },
});
Result (when a match happens):
Rate type2 type1
type2 1747627/s -- -70%
type1 5770465/s 230% --
To be able to draw a conclusion, test not to match:
my $str2 = "text";
my $str3 = qr/text/;
Result (when a match does not happen):
Rate type2 type1
type2 1857295/s -- -67%
type1 5560630/s 199% --
Conclusion:
The index function is much faster than the regexp match.

When I see code that uses index, I usually see an index within an index within an index, etc. There's also more branching too: "if found, look for this; otherwise since not found, look for that." Almost always a single regex would have worked. So, for me, I almost always use a regex unless there's some specific reason I want to use an index.
Unfortunately, most programmers I run into don't read regex well and so for maintainability, the index method should be used more than I do.

If you need a substring match, use index. If you need a regexp match (with special meaning for regexp metacharacters), use =~. A substring match is usually faster, but regexps in Perl are quite well optimized, and simple regexp matches can be surprisingly fast. Benchmark it for yourself.
Since Perl 5.6, Perl is smart enough to recompile the regexp in $str =~ /$str2/ iff $str2 has changed since the last compilation. To fully control when your regexp is compiled, use qr/$str2/. See Does the 'o' modifier for Perl regular expressions still provide any benefit? for q/.../o (obsolete) and qr/.../ (not needed most of the time, but can be useful).

Related

Passing a parameter to a regular expression to match the first letter in a word in perl

So here is what I'm doing. This is for homework, and I know I can't come on here and get you guys to do my homework for me but I'm stuck. We have to use perl (First time ever using it so forgive my stupidity) to make a function $starts_with that takes a parameter $str0 and $prefix. if $str0 starts with $prefix. then the function returns true. if it doesn't then it isn't pretty simple. We have to use regular expressions because that is the whole point of the exercise so here is my code
sub starts_with
{
$str0 = $_[0];
$prefix = $_[1];
if($prefix =~ /^($str0)/)
{
print $str0."\n";
print m/^(prefix)/."\n";
$startsWith = "Y"
}
if ($startsWith eq "Y")
{
print $str0." starts with ".$prefix."\n";
}
else
{
print $str0." does not start with ".$prefix."\n";
}
}
I'm almost ashamed to put this up here because I have no Idea what I'm doing yet. But I am trying to learn. I don't know how to do true false in perl thats why I have the $startsWith variable. you can fix that if you want. the part I need to fix is the line
if(str0 =~ /^($prefix)/)
I also need to find out how to refer to the first letter in str0...I think
A couple points without giving away the answer:
1) Arguments to functions are passed in a special variable called #_, which is what you are accessing when you say $_[0] and $_[1], but can be written much more concisely by assigned the argument list (#_) to your variables in list context
sub starts_with {
my ($str0, $prefix) = #_;
...
}
2) This statement: if($prefix =~ /^($str0)/) tests the exact opposite condition you are trying to prove. It says does the prefix start with the value of the variable $str0. What you really want to test is if $str0 starts with $prefix.
It might also be using to prefix your pattern with m flag, m/PATTERN which means match this pattern.
3) You don't have a return statement in your function, (As #M42 points out) the result of the last expression is returned; that expression being print will return true. You probably want to return true or false explicity.
See if you can use this to get started.
What I would do :
use Modern::Perl; # or use strict; use warnings; use feature qw/say/;
sub starts_with {
# better use #_, the default array instead of just elements of them
# ...like $_[0]
my ($str, $pref) = #_;
# very short expression, the pattern matching return a boolean.
# \Q\E is there to treat the prefix as-is (no metacharacters)
return $str =~ /^\Q$pref\E/;
}
# using our function
if (starts_with("foobar", "f")) {
say "TRUE";
}
else {
say "FALSE";
}
Golfing it a bit...
sub starts_with { $_[0] =~ /^\Q$_[1]/ }
Don't hand that version in though :-)

Regex performance: validating alphanumeric characters

When trying to validate that a string is made up of alphabetic characters only, two possible regex solutions come to my mind.
The first one checks that every character in the string is alphanumeric:
/^[a-z]+$/
The second one tries to find a character somewhere in the string that is not alphanumeric:
/[^a-z]/
(Yes, I could use character classes here.)
Is there any significant performance difference for long strings?
(If anything, I'd guess the second variant is faster.)
Just by looking at it, I'd say the second method is faster.
However, I made a quick non-scientific test, and the results seem to be inconclusive:
Regex Match vs. Negation.
P.S. I removed the group capture from the first method. It's superfluous, and would only slow it down.
Wrote this quick Perl code:
#testStrings = qw(asdfasdf asdf as aa asdf as8up98;n;kjh8y puh89uasdf ;lkjoij44lj 'aks;nasf na ;aoij08u4 43[40tj340ij3 ;salkjaf; a;lkjaf0d8fua ;alsf;alkj
a a;lkf;alkfa as;ldnfa;ofn08h[ijo ok;ln n ;lasdfa9j34otj3;oijt 04j3ojr3;o4j ;oijr;o3n4f;o23n a;jfo;ie;o ;oaijfoia ;aosijf;oaij ;oijf;oiwj;
qoeij;qwj;ofqjf08jf0 ;jfqo;j;3oj4;oijt3ojtq;o4ijq;onnq;ou4f ;ojfoqn;aonfaoneo ;oef;oiaj;j a;oefij iiiii iiiiiiiii iiiiiiiiiii);
print "test 1: \n";
foreach my $i (1..1000000) {
foreach (#testStrings) {
if ($_ =~ /^([a-z])+$/) {
#print "match"
} else {
#print "not"
}
}
}
print `date` . "\n";
print "test 2: \n";
foreach my $j (1..1000000) {
foreach (#testStrings) {
if ($_ =~ /[^a-z]/) {
#print "match"
} else {
#print "not"
}
}
}
then ran it with:
date; <perl_file>; date
it isn't 100% scientific, but it gives us a good idea. The first Regex took 10 or 11 seconds to execute, the second Regex took 8 seconds.

Why does my regex fail when the number ends in 0?

This is a really basic regex question but since I can't seem to figure out why the match is failing in certain circumstances I figured I'd post it to see if anyone else can point out what I'm missing.
I'm trying to pull out the 2 sets of digits from strings of the form:
12309123098_102938120938120938
1321312_103810312032123
123123123_10983094854905490
38293827_1293120938129308
I'm using the following code to process each string:
if($string && $string =~ /^(\d)+_(\d)+$/) {
if(IsInteger($1) && IsInteger($2)) { print "success ('$1','$2')"; }
else { print "fail"; }
}
Where the IsInterger() function is as follows:
sub IsInteger {
my $integer = shift;
if($integer && $integer =~ /^\d+$/) { return 1; }
return;
}
This function seems to work most of the time but fails on the following for some reason:
1287123437_1268098784380
1287123437_1267589971660
Any ideas on why these fail while others succeed? Thanks in advance for your help!
This is an add-on to the answers from unicornaddict and ZyX: what are you trying to match?
If you're trying to match the sequences left and right of '_', unicorn addict is correct and your regex needs to be ^(\d+)_(\d+)$. Also, you can get rid of the first qualifier and the 'IsIntrger()` function altogether - you already know it's an integer - it matched (\d+)
if ($string =~ /^(\d+)_(\d+)$/) {
print "success ('$1','$2')";
} else {
print "fail\n";
}
If you're trying to match the last digit in each and wondering why it's failing, it's the first check in IsInteger() ( if($intger && ). It's redundant anyway (you know it's an integer) and fails on 0 because, as ZyX notes - it evaluates to false.
Same thing applies though:
if ($string =~ /^(\d)+_(\d)+$/) {
print "success ('$1','$2')";
} else {
print "fail\n";
}
This will output success ('8','8') given the input 12309123098_102938120938120938
Because you have 0 at the end of the second string, (\d)+ puts only the last match in the $N variable, string "0" is equivalent to false.
When in doubt, check what your regex is actually capturing.
use strict;
use warnings;
my #data = (
'1321312_103810312032123',
'123123123_10983094854905490',
);
for my $s (#data){
print "\$1=$1 \$2=$2\n" if $s =~ /^(\d)+_(\d)+$/;
# Output:
# $1=2 $2=3
# $1=3 $2=0
}
You probably intended the second of these two approaches.
(\d)+ # Repeat a regex group 1+ times,
# capturing only the last instance.
(\d+) # Capture 1+ digits.
In addition, both in your main loop and in IsInteger (which seems unnecessary, given the initial regex in the main loop), you are testing for truth rather than something more specific, such as defined or length. Zero, for example, is a valid integer but false.
Shouldn't + be included in the grouping:
^(\d+)_(\d+)$ instead of ^(\d)+_(\d)+$
Many people have commented on your regex, but the problem you had in your IsInteger (which you really don't need for your example). You checked for "truth" when you really want to check for defined:
sub IsInteger {
my $integer = shift;
if( defined $integer && $integer =~ /^\d+$/) { return 1; }
return;
}
You don't need most of the infrastructure in that subroutine though:
sub IsInteger {
defined $_[0] && $_[0] =~ /^\d+$/
}

Powershell, how many replacements did you make?

I need to know how many replacements are made by Powershell when using either the -replace operator or Replace() method. Or, if that's not possible, if it made any replacements at all.
For example, in Perl, because the substitution operation returns the number of replacements made, and zero is false while non-zero is true in a boolean context, one can write:
$greeting = "Hello, Earthlings";
if ($greeting ~= s/Earthlings/Martians/) { print "Mars greeting ready." }
However with Powershell the operator and method return the new string. It appears that the operator provides some additional information, if one knows how to ask for it (e.g., captured groups are stored in a new variable it creates in the current scope), but I can't find out how to get a count or success value.
I could just compare the before and after values, but that seems entirely inefficient.
You're right, I don't think you can squeeze anything extra out of -replace. However, you can find the number of matches using Regex.Matches(). For example
> $greeting = "Hello, Earthlings"
> $needle = "l"
> $([regex]::matches($greeting, $needle)).Length # cast explicitly to an array
3
You can then use the -replace operator which uses the same matching engine.
After looking a little deeper, there's an overload of Replace which takes a MatchEvaluator delegate which is called each time a match is made. So, if we use that as an accumulator, it can count the number of replacements in one go.
> $count = 0
> $matchEvaluator = [System.Text.RegularExpressions.MatchEvaluator]{$count ++}
> [regex]::Replace("Hello, Earthlings","l",$matchEvaluator)
> $count
Heo, Earthings
3
Here a complete functional example which preserves the replacement behavior and count the number of matches
$Script:Count = 0
$Result = [regex]::Replace($InputText, $Regex, [System.Text.RegularExpressions.MatchEvaluator] {
param($Match)
$Script:Count++
return $Match.Result($Replacement)
})
None of the above answers are actually do replacement and working in recent PS versions:
James Kolpack - show how to count a removed regex (not replaced);
Kino101 - incomplete answer, variables not defined;
Annarfych - outdated answer, in recent PS version the evaluator count variable need to be global
Here is how you can do a replace and count it:
$String = "Hello World"
$Regex = "l|o" #search for 'l' or 'o'
$ReplaceWith = "?"
$Count = 0
$Result = [regex]::Replace($String, $Regex, { param($found); $Global:Count++; return $found.Result($ReplaceWith) })
$Result
$Count
Result in Powershell 5.1:
He??? W?r?d
5
Version of the script that actually does replace things and not null them:
$greeting = "Hello, earthlings. Mars greeting ready"
$counter = 0
$search = '\s'
$replace = ''
$evaluator = [System.Text.RegularExpressions.MatchEvaluator] {
param($found)
$counter++
Write-Output ([regex]::Replace($found, [regex] $search, $replace))
}
[regex]::Replace($greeting, [regex] $search, $evaluator);
$counter
->
> Hello,earthlings.Marsgreetingready
> 4

Regex default value if not found

I would like to supply my regular expression with a 'default' value, so if the thing I was looking for is not found, it will return the default value as if it had found it.
Is this possible to do using regex?
It sounds like you want some sort of regex syntax that says "if the regexp does not match any part of the given string pretend that it matched the following substring: 'foobar'". Such a feature does not exist in any regexp syntax I've seen.
You'll probably need to something like this:
matched_string = string.find_regex_match(regex);
if(matched_string == null) {
string = "default";
}
(This will of course need to be adjusted to the language you're using)
It's hard to answer this without a specific language, but in Perl at least, something like this works:
$string='hello';
$default = 1234;
($match) = ($string =~ m/(\d+)/ or $default);
print "$match\n";
1234
Not strictly part of the regex, but avoids the extra conditional block.
As far as I know, you can't do that with RegExp`s, at least with Perl Compatible Regular Expressions.
You can see by your self here.
Here's what I did in javascript...
function match(regx, str, dflt, index = 0) {
if (!str) return dflt
let x = str.match(regx)
return x ? x[index] || dflt : dflt
}