Character match count between strings in Perl - regex

I have a string (say string 1) that needs to be matched to another string (string2). Both the strings will have the same length and are case in-sensitive.
I want to print the number of character matches between both the strings.
E.g.: String 1: stranger
String 2: strangem
Match count = 7
I tried this:
$string1 = "stranger";
$string2 = "strangem";
my $count = $string1 =~ m/string2/ig;
print "$count\n";
How can I fix this?

Exclusive or, then count the null characters (where the strings were the same):
my $string1 = "stranger";
my $string2 = "strangem";
my $count = ( lc $string1 ^ lc $string2 ) =~ tr/\0//;
print "$count\n";
I missed the "case in-sensitive" bit.

You can use substr for that:
#!/usr/bin/perl
use warnings;
use strict;
my $string1=lc('stranger');
my $string2=lc('strangem');
my $count=0;
for (0..length($string1)-1) {
$count++ if substr($string1,$_,1) eq substr($string2,$_,1);
}
print $count; #prints 7
Or you can use split to get all characters as an array, and loop:
#!/usr/bin/perl
use warnings;
use strict;
my $string1=lc('stranger');
my $string2=lc('strangem');
my $count=0;
my #chars1=split//,$string1;
my #chars2=split//,$string2;
for (0..$#chars1) {
$count++ if $chars1[$_] eq $chars2[$_];
}
print $count; #prints 7
(fc gives more accurate results than lc, but I went for backwards compatibility.)

Not tested
sub cm
{
my #a = shift;
my #b = shift;
# First match prefix of string:
my $n = 0;
while ($n < $#a && $n < $#b && $a[$n] eq $b[$n]) {
++$n;
}
# Then skip one char on either side, and recurse.
if ($n < $#a && $n < $#b) {
# Match rest by skipping one place:
my $n2best = 0;
my $n2a = cm(splice(#a, $n), splice(#b, $n + 1));
$n2best = $n2a;
my $n2b = cm(splice(#a, $n + 1), splice(#b, $n));
$n2best = $n2b if $n2b > $n2best;
my $n2c = cm(splice(#a, $n + 1), splice(#b, $n + 1));
$n2best = $n2c if $n2c > $n2best;
$n += $n2best;
}
return $n;
}
sub count_matches
{
my $a = shift;
my $b = shift;
my #a_chars = split //, $a;
my #b_chars = split //, $b;
return cm(#a_chars, #b_chars);
}
print count_matches('stranger', 'strangem')

Related

matched count for characters in two variables

Iam trying to match two variables and get the count of characters that are matched.
For ex:
$name1 = 'cat';
$name2 = 'dcat';
result needs to be 3.
Tried this but always prints 1
$count = 0;
$count++ while ($name1 =~ /$name2/g);
print "$count\n";
$count = 0;
$matchstr = "";
foreach (split(//,$name1)){
$matchstr .= $_;
$count++ if $name2 =~ /$matchstr/;
}
print "No of matched characters are: $count \n ";

Find n occurrences from group of characters

Given a string, I am suppose to print "two" if i find exactly two characters from the group xyz.
Given jxyl print two
Given jxyzl print nothing
Given jxxl print two
I am very new to perl so this is my approach.
my $word = "jxyl";
#char = split //, $word;
my $size = $#char;
for ( $i = 0; $i < $size - 1; $i++ ) {
if ( $char[i] eq "x" || $char[i] eq "y" || $char eq "z" ) {
print "two";
}
}
Can anyone tell me why this is isn't working correctly?
From the FAQ:
perldoc -q count
How can I count the number of occurrences of a substring within a string?
use warnings;
use strict;
while (<DATA>) {
chomp;
my $count = () = $_ =~ /[xyz]/g;
print "$_ two\n" if $count == 2;
}
__DATA__
jxyl
jxyzl
jxxl
Outputs:
jxyl two
jxxl two
You basically want to count the number of specific characters in a string.
You can use tr:
#!/usr/bin/perl
use strict;
use warnings;
while (<DATA>) {
chomp;
my $count = $_ =~ tr/xyz//;
print "$_ - $count\n";
}
__DATA__
jxyl
jxyzl
jxxl
Outputs:
jxyl - 2
jxyzl - 3
jxxl - 2
Determining if there are exactly 2 can be done after the counting.
Definitely not the best way to do it, but here is a regex for fun and to show there is more than one way to do things.
perl -e'$word = "jxyl"; print "two" if $word =~ /^[^xyz]*[xyz][^xyz]*[xyz][^xyz]*$/'

How to find the largest repeating string with overlap in a line

I have a series of lines such as
my $string = "home test results results-apr-25 results-apr-251.csv";
#str = $string =~ /(\w+)\1+/i;
print "#str";
How do I find the largest repeating string with overlap which are separated by whitespace?
In this case I'm looking for the output :
results-apr-25
It looks like you need the String::LCSS_XS which calculates Longest Common SubStrings. Don't try it's Perl-only twin brother String::LCSS because there are bugs in that one.
use strict;
use warnings;
use String::LCSS_XS;
*lcss = \&String::LCSS_XS::lcss; # Manual import of `lcss`
my $var = 'home test results results-apr-25 results-apr-251.csv';
my #words = split ' ', $var;
my $longest;
my ($first, $second);
for my $i (0 .. $#words) {
for my $j ($i + 1 .. $#words) {
my $lcss = lcss(#words[$i,$j]);
unless ($longest and length $lcss <= length $longest) {
$longest = $lcss;
($first, $second) = #words[$i,$j];
}
}
}
printf qq{Longest common substring is "%s" between "%s" and "%s"\n}, $longest, $first, $second;
output
Longest common substring is "results-apr-25" between "results-apr-25" and "results-apr-251.csv"
my $var = "home test results results-apr-25 results-apr-251.csv";
my #str = split " ", $var;
my %h;
my $last = pop #str;
while (my $curr = pop #str ) {
if(($curr =~/^$last/) || $last=~/^$curr/) {
$h{length($curr)}= $curr ;
}
$last = $curr;
}
my $max_key = max(keys %h);
print $h{$max_key},"\n";
If you want to make it without a loop, you will need the /g regex modifier.
This will get you all the repeating string:
my #str = $string =~ /(\S+)(?=\s\1)/ig;
I have replaced \w with \S (in your example, \w doesn't match -), and used a look-ahead: (?=\s\1) means match something that is before \s\1, without matching \s\1 itself—this is required to make sure that the next match attempt starts after the first string, not after the second.
Then, it is simply a matter of extracting the longest string from #str:
my $longest = (sort { length $b <=> length $a } #str)[0];
(Do note that this is a legible but far from being the most efficient way of finding the longest value, but this is the subject of a different question.)
How about:
my $var = "home test results results-apr-25 results-apr-251.csv";
my $l = length $var;
for (my $i=int($l/2); $i; $i--) {
if ($var =~ /(\S{$i}).*\1/) {
say "found: $1";
last;
}
}
output:
found: results-apr-25

How to match sequence group?

say, the given string is abcwhateverdefwhatever34567whatever012 How to match those group which are in sequence like match abc, def, 34567,012?
the regex i have now is (.)\1{2,} but it matches the same characters but not in sequence
If you're still looking for PHP code.
function getSequence($str) {
$prev = 0; $next = 0; $length = strlen($str);
$temp = "";
for($i = 0; $i < $length; $i++) {
$next = ord($str[$i]);
if ($next == $prev + 1) {
$temp .= $str[$i];
} else {
if (strlen($temp) > 1) $result[] = $temp;
$temp = $str[$i];
}
$prev = $next;
}
if (strlen($temp) > 1) $result[] = $temp;
return $result;
}
$str = "abcwhateverdefwhatever34567whatever012";
print_r(getSequence($str));
Here's a solution that solves the problem with regex. It's not very efficient though and I wouldn't recommend it.
from re import findall, X
text = "abcwhateverdefwhatever34567whatever012"
reg = r"""
(?:
(?:0(?=1))|
(?:(?<=0)1)|(?:1(?=2))|
(?:(?<=1)2)|(?:2(?=3))|
(?:(?<=2)3)|(?:3(?=4))|
(?:(?<=3)4)|(?:4(?=5))|
(?:(?<=4)5)|(?:5(?=6))|
(?:(?<=5)6)|(?:6(?=7))|
(?:(?<=6)7)|(?:7(?=8))|
(?:(?<=7)8)|(?:8(?=9))|
(?:(?<=8)9)|
(?:a(?=b))|
(?:(?<=a)b)|(?:b(?=c))|
(?:(?<=b)c)|(?:c(?=d))|
(?:(?<=c)d)|(?:d(?=e))|
(?:(?<=d)e)|(?:e(?=f))|
(?:(?<=e)f)
){1,}
"""
print findall(reg, text, X)
The result is:
['abc', 'def', '34567', '012']
As you can see I only added the numbers and the first 6 letters in the alphabet. It's should be fairly obvious how to continue.

Replace only up to N matches on a line

In Perl, how to write a regular expression that replaces only up to N matches per string?
I.e., I'm looking for a middle ground between s/aa/bb/; and s/aa/bb/g;. I want to allow multiple substitutions, but only up to N times.
I can think of three reliable ways. The first is to replace everything after the Nth match with itself.
my $max = 5;
$s =~ s/(aa)/ $max-- > 0 ? 'bb' : $1 /eg;
That's not very efficient if there are far more than N matches. For that, we need to move the loop out of the regex engine. The next two methods are ways of doing that.
my $max = 5;
my $out = '';
$out .= $1 . 'bb' while $max-- && $in =~ /\G(.*?)aa/gcs;
$out .= $1 if $in =~ /\G(.*)/gcs;
And this time, in-place:
my $max = 5;
my $replace = 'bb';
while ($max-- && $s =~ s/\G.*?\Kaa/$replace/s) {
pos($s) = $-[0] + length($replace);
}
You might be tempted to do something like
my $max = 5;
$s =~ s/aa/bb/ for 1..$max;
but that approach will fail for other patterns and/or replacement expressions.
my $max = 5;
$s =~ s/aa/ba/ for 1..$max; # XXX Turns 'aaaaaaaa'
# into 'bbbbbaaa'
# instead of 'babababa'
And of course, starting from the beginning of the string every time could be expensive.
What you want is not posible in regular expressions. But you can put the replacement in a for-loop:
my $i;
my $aa = 'aaaaaaaaaaaaaaaaaaaa';
for ($i=0;$i<4;$i++) {
$aa =~ s/aa/bb/;
}
print "$aa\n";
result:
bbbbbbbbaaaaaaaaaaaa
You can use the /e flag which evaluates the right side as an expression:
my $n = 3;
$string =~ s/(aa)/$n-- > 0 ? "bb" : $1/ge;
Here's a solution using the /e modifier, with which you can use
perl code to generate the replacement string:
my $count = 0;
$string =~ s{ $pattern }
{
$count++;
if ($count < $limit ) {
$replace;
} else {
$&; # faking a no-op, replacing with the original match.
}
}xeg;
With perl 5.10 or later you can drop the $& (which has weird
performance complications) and use ${^MATCH} via the /p modifier
$string =~ s{ $pattern }
{
$count++;
if ($count < $limit ) {
$replace;
} else {
${^MATCH};
}
}xegp;
It's too bad you can't just do this, but you can't:
last if $count >= $limit;