Counting occurrences of a word in a string in Perl - regex

I am trying to find out the number of occurrences of "The/the". Below is the code I tried"
print ("Enter the String.\n");
$inputline = <STDIN>;
chop($inputline);
$regex="\[Tt\]he";
if($inputline ne "")
{
#splitarr= split(/$regex/,$inputline);
}
$scalar=#splitarr;
print $scalar;
The string is :
Hello the how are you the wanna work on the project but i the u the
The
The output that it gives is 7. However with the string :
Hello the how are you the wanna work on the project but i the u the
the output is 5. I suspect my regex. Can anyone help in pointing out what's wrong.

I get the correct number - 6 - for the first string
However your method is wrong, because if you count the number of pieces you get by splitting on the regex pattern it will give you different values depending on whether the word appears at the beginning of the string. You should also put word boundaries \b into your regular expression to prevent the regex from matching something like theory
Also, it is unnecessary to escape the square brackets, and you can use the /i modifier to do a case-independent match
Try something like this instead
use strict;
use warnings;
print 'Enter the String: ';
my $inputline = <>;
chomp $inputline;
my $regex = 'the';
if ( $inputline ne '' ) {
my #matches = $inputline =~ /\b$regex\b/gi;
print scalar #matches, " occurrences\n";
}

With split, you're counting the substrings between the the's. Use match instead:
#!/usr/bin/perl
use warnings;
use strict;
my $regex = qr/[Tt]he/;
for my $string ('Hello the how are you the wanna work on the project but i the u the The',
'Hello the how are you the wanna work on the project but i the u the',
'the theological cathedral'
) {
my $count = () = $string =~ /$regex/g;
print $count, "\n";
my #between = split /$regex/, $string;
print 0 + #between, "\n";
print join '|', #between;
print "\n";
}
Note that both methods return the same number for the two inputs you mentioned (and the first one returns 6, not 7).

The following snippet uses a code side-effect to increment a counter, followed by an always-failing match to keep searching. It produces the correct answer for matches that overlap (e.g. "aaaa" contains "aa" 3 times, not 2). The split-based answers don't get that right.
my $i;
my $string;
$i = 0;
$string = "aaaa";
$string =~ /aa(?{$i++})(?!)/;
print "'$string' contains /aa/ x $i (should be 3)\n";
$i = 0;
$string = "Hello the how are you the wanna work on the project but i the u the The";
$string =~ /[tT]he(?{$i++})(?!)/;
print "'$string' contains /[tT]he/ x $i (should be 6)\n";
$i = 0;
$string = "Hello the how are you the wanna work on the project but i the u the";
$string =~ /[tT]he(?{$i++})(?!)/;
print "'$string' contains /[tT]he/ x $i (should be 5)\n";

What you need is 'countof' operator to count the number of matches:
my $string = "Hello the how are you the wanna work on the project but i the u the The";
my $count = () = $string =~/[Tt]he/g;
print $count;
If you want to select only the word the or The, add word boundary:
my $string = "Hello the how are you the wanna work on the project but i the u the The";
my $count = () = $string =~/\b[Tt]he\b/g;
print $count;

Related

Perl regex to replace part of one string with a portion of another

I have a need in Perl to replace a section of one string with most of another. :-) This needs be done for multiple pairs of strings.
For example, I need to replace
"/root_vdm_2/fs_clsnymigration"
within
/root_vdm_2/fs_clsnymigration/CLSNYMIGRATION/NY_HQ_S1
with
rfsn_clsnymigration
so that I end up with
/rfsn_clsnymigration/CLSNYMIGRATION/NY_HQ_S1
(without the leading "/root_vdm_2" part) ... but I am sufficiently sleep-deprived to have lost sight of how to accomplish this.
Help ?
Try this regex:
^\/root_vdm_2\/fs_clsnymigration
Substitute with:
\/rfsn_clsnymigration
example:
$string = "/root_vdm_2/fs_clsnymigration/CLSNYMIGRATION/NY_HQ_S1";
$string=~s/^\/root_vdm_2\/fs_clsnymigration/\/rfsn_clsnymigration/;
print $string;
Output:
/rfsn_clsnymigration/CLSNYMIGRATION/NY_HQ_S1
EDIT 1
$string = "/root_vdm_2/fs_clsnymigration/CLSNYMIGRATION/NY_HQ_S1/LISU,rfsn_clsnymigration
/root_vdm_2/fs_users/users/Marketing,rfsw_users
/root_vdm_3/fs_sandi/sandi_users,rfsw_sandi
/root_vdm_3/fs_pci/Analytics,rfsw_pci
/root_vdm_4/fs_camnt01/camnt01/AV,rfsw_camnt01
/root_vdm_1/fs_stcloud01/sfa,rfss_stcloud01
/root_vdm_3/fs_stcloud03/data4,rfss_stcloud03
/root_vdm_2/fs_stcloud02/depart1,rfss_stcloud02";
$string=~s/^\/root_vdm_.\/fs_[^\/]*/\/rfsn_clsnymigration/gm;
print $string;
Output:
/rfsn_clsnymigration/CLSNYMIGRATION/NY_HQ_S1/LISU,rfsn_clsnymigration
/rfsn_clsnymigration/users/Marketing,rfsw_users
/rfsn_clsnymigration/sandi_users,rfsw_sandi
/rfsn_clsnymigration/Analytics,rfsw_pci
/rfsn_clsnymigration/camnt01/AV,rfsw_camnt01
/rfsn_clsnymigration/sfa,rfss_stcloud01
/rfsn_clsnymigration/data4,rfss_stcloud03
/rfsn_clsnymigration/depart1,rfss_stcloud02
use strict;
use warnings;
while (<DATA>) {
chomp;
my ($lhs, $rhs) = split(/,/, $_, 2);
my #parts = split(/\//, $lhs);
splice(#parts, 1, 2, $rhs);
print join('/', #parts) . "\n";
}
__DATA__
/root_vdm_2/fs_clsnymigration/CLSNYMIGRATION/NY_HQ_S1/LISU,rfsn_clsnymigration
/root_vdm_2/fs_users/users/Marketing,rfsw_users
/root_vdm_3/fs_sandi/sandi_users,rfsw_sandi
/root_vdm_3/fs_pci/Analytics,rfsw_pci
/root_vdm_4/fs_camnt01/camnt01/AV,rfsw_camnt01
/root_vdm_1/fs_stcloud01/sfa,rfss_stcloud01
/root_vdm_3/fs_stcloud03/data4,rfss_stcloud03
/root_vdm_2/fs_stcloud02/depart1,rfss_stcloud02
My challenge was to replace part of $string1 with all of $string2, split on the commas.
/root_vdm_2/fs_clsnymigration/CLSNYMIGRATION/NY_HQ_S1/LISU,rfsn_clsnymigration
/root_vdm_2/fs_users/users/Marketing,rfsw_users
/root_vdm_3/fs_sandi/sandi_users,rfsw_sandi
/root_vdm_3/fs_pci/Analytics,rfsw_pci
/root_vdm_4/fs_camnt01/camnt01/AV,rfsw_camnt01
/root_vdm_1/fs_stcloud01/sfa,rfss_stcloud01
/root_vdm_3/fs_stcloud03/data4,rfss_stcloud03
/root_vdm_2/fs_stcloud02/depart1,rfss_stcloud02
The difficulty I saw initially was how to replace /root_vdm_2/fs_clsnymigration with rfsn_clsnymigration, and I allowed myself to think that a regexp was the best approach.
Although far less eloquent, this gets the job done:
foreach $line (#lines) {
chop $line;
($orig,$replica) = split /\,/, $line;
chop substr $orig, 0, 1;
#pathparts = split /\//, $orig;
$rootvdm = shift #pathparts;
#pathparts[0] = $replica;
$newpath = "/" . join ('/', #pathparts);
print " here's \$newpath:$newpath\n";
}
... which yields something like this:
here's $newpath:/rfsn_clsnymigration/CLSNYMIGRATION/NY_HQ_S1/LISU
here's $newpath:/rfsw_users/users/Marketing
here's $newpath:/rfsw_sandi/sandi_users
here's $newpath:/rfsw_pci/Analytics
here's $newpath:/rfsw_camnt01/camnt01/AV
here's $newpath:/rfss_stcloud01/sfa
here's $newpath:/rfss_stcloud03/data4
here's $newpath:/rfss_stcloud02/depart1

How to grep the word between two symbols in Perl?

I need to grep the word between the symbols as shown below in an array.
my $string = "?hi how r u?what is your name?what is your age?";
It's to be converted to array where array should be like:
my $array[0]="hi how r u";
my $array[1]="what is your name";
my $array[2]="what is your age";
To ignore empty results you can match the input with regex and store matched results in an array:
use strict;
use warnings;
my $string = "?hi how r u?what is your name?what is your age?";
my #matches = ( $string =~ /(?<=\?)[^?]+/g );
foreach my $i (#matches) {
print $i . "\n";
}
Output:
hi how r u
what is your name
what is your age
You can use the split function, however you have to escape the ? character, so that it won't get special treatment as part of a regular expression control character.
my #array = split '\\?', $string;

How to find the largest repeating string with overlap in a line

I have a series of lines such as
my $string = "home test results results-apr-25 results-apr-251.csv";
#str = $string =~ /(\w+)\1+/i;
print "#str";
How do I find the largest repeating string with overlap which are separated by whitespace?
In this case I'm looking for the output :
results-apr-25
It looks like you need the String::LCSS_XS which calculates Longest Common SubStrings. Don't try it's Perl-only twin brother String::LCSS because there are bugs in that one.
use strict;
use warnings;
use String::LCSS_XS;
*lcss = \&String::LCSS_XS::lcss; # Manual import of `lcss`
my $var = 'home test results results-apr-25 results-apr-251.csv';
my #words = split ' ', $var;
my $longest;
my ($first, $second);
for my $i (0 .. $#words) {
for my $j ($i + 1 .. $#words) {
my $lcss = lcss(#words[$i,$j]);
unless ($longest and length $lcss <= length $longest) {
$longest = $lcss;
($first, $second) = #words[$i,$j];
}
}
}
printf qq{Longest common substring is "%s" between "%s" and "%s"\n}, $longest, $first, $second;
output
Longest common substring is "results-apr-25" between "results-apr-25" and "results-apr-251.csv"
my $var = "home test results results-apr-25 results-apr-251.csv";
my #str = split " ", $var;
my %h;
my $last = pop #str;
while (my $curr = pop #str ) {
if(($curr =~/^$last/) || $last=~/^$curr/) {
$h{length($curr)}= $curr ;
}
$last = $curr;
}
my $max_key = max(keys %h);
print $h{$max_key},"\n";
If you want to make it without a loop, you will need the /g regex modifier.
This will get you all the repeating string:
my #str = $string =~ /(\S+)(?=\s\1)/ig;
I have replaced \w with \S (in your example, \w doesn't match -), and used a look-ahead: (?=\s\1) means match something that is before \s\1, without matching \s\1 itself—this is required to make sure that the next match attempt starts after the first string, not after the second.
Then, it is simply a matter of extracting the longest string from #str:
my $longest = (sort { length $b <=> length $a } #str)[0];
(Do note that this is a legible but far from being the most efficient way of finding the longest value, but this is the subject of a different question.)
How about:
my $var = "home test results results-apr-25 results-apr-251.csv";
my $l = length $var;
for (my $i=int($l/2); $i; $i--) {
if ($var =~ /(\S{$i}).*\1/) {
say "found: $1";
last;
}
}
output:
found: results-apr-25

In regular expression matching of Perl, is it possible to know number of matches in a{n,}?

What I mean is:
For example, a{3,} will match 'a' at least three times greedly. It may find five times, 10 times, etc. I need this number. I need this number for the rest of the code.
I can do the rest less efficiently without knowing it, but I thought maybe Perl has some built-in variable to give this number or is there some trick to get it?
Just capture it and use length.
if (/(a{3,})/) {
print length($1), "\n";
}
Use #LAST_MATCH_END and #LAST_MATCH_START
my $str = 'jlkjmkaaaaaamlmk';
$str =~ /a{3,}/;
say $+[0]-$-[0];
Output:
6
NB: This will work only with a one-character pattern.
Here's an idea (maybe this is what you already had?) assuming the pattern you're interested in counting has multiple characters and variable length:
capture the substring which matches the pattern{3,} subpattern
then match the captured substring globally against pattern (note the absence of the quantifier), and force a list context on =~ to get the number of matches.
Here's a sample code to illustrate this (where $patt is the subpattern you're interested in counting)
my $str = "some catbratmatrattatblat thing";
my $patt = qr/b?.at/;
if ($str =~ /some ((?:$patt){3,}) thing/) {
my $count = () = $1 =~ /$patt/g;
print $count;
...
}
Another (admittedly somewhat trivial) example with 2 subpatterns
my $str = "some catbratmatrattatblat thing 11,33,446,70900,";
my $patt1 = qr/b?.at/;
my $patt2 = qr/\d+,/;
if ($str =~ /some ((?:$patt1){3,}) thing ((?:$patt2){2,})/) {
my ($substr1, $substr2) = ($1, $2);
my $count1 = () = $substr1 =~ /$patt1/g;
my $count2 = () = $substr2 =~ /$patt2/g;
say "count1: " . $count1;
say "count2: " . $count2;
}
Limitation(s) of this approach:
Fails miserably with lookarounds. See amon's example.
If you have a pattern of type /AB{n,}/ where A and B are complex patterns, we can split the regex into multiple pieces:
my $string = "ABABBBB";
my $n = 3;
my $count = 0;
TRY:
while ($string =~ /A/gc) {
my $pos = pos $string; # remember position for manual backtracking
$count++ while $string =~ /\GB/g;
if ($count < $n) {
$count = 0;
pos($string) = $pos; # restore previous position
} else {
last TRY;
}
}
say $count;
Output: 4
However, embedding code into the regex to do the counting may be more desirable, as it is more general:
my $string = "ABABBBB";
my $count;
$string =~ /A(?{ $count = 0 })(?:B(?{ $count++ })){3,}/ and say $count;
Output: 4.
The downside is that this code won't run on older perls. (Code was tested on v14 & v16).
Edit: The first solution will fail if the B pattern backtracks, e.g. $B = qr/BB?/. That pattern should match the ABABBBB string three times, but the strategy will only let it match two times. The solution using embedded code allows proper backtracking.

Perl regex replace count

Is it possible to specify the maximum number of matches to replace. For instance if matching 'l' in "Hello World", would it be possible to replace the first 2 'l' characters, but not the third without looping?
$str = "Hello world!";
$str =~ s/l/r/ for (1,2);
print $str;
I don't see what's so bad about looping.
Actually, here's a way:
$str="Hello world!";
$str =~ s/l/$i++ >= 2 ? "l": "r"/eg;
print $str;
It's a loop, of sorts, since s///g works in a loopy way when you do this. But not a traditional loop.
Here is one way. This requires an external counter to be updated within the RE using a (?{code}) block inside of a (?(condition)true-sub-expression|false-sub-expression) construct. See perldoc perlre for an explanation.
use Modern::Perl;
use re qw/eval/; # Considered experimental.
my $string = 'Hello world!';
my $count = 2;
my $re = qr/
(l)
(?(?{$count--})|(*FAIL))
/x;
say "Looking for $count instances of 'l' in $string.";
my ( #found ) = $string =~ m/$re/g;
say "Found ", scalar #found, " instances of 'l': #found";
The output is:
Looking for 2 instances of 'l' in Hello world!
Found 2 instances of 'l': l l
Here's another test of the same regexp, but this time we're keeping track of the position of the matches just to prove it's matching the first two occurrences.
use Modern::Perl;
use strict;
use warnings;
use re qw/eval/; # Considered experimental.
my $string = 'Hello world!';
my $count = 2;
my $position = 0;
my $re = qr/
(l)(?{$position=pos})
(?(?{$count--})|(*FAIL))
/x;
while( $string =~ m/$re/g ) {
say "Found $1 at ", $position;
}
And this time the output is:
Found l at 3
Found l at 4
I don't think I would recommend any of this. If I were considering constraining matches to only one portion of a string, I would match against a substr() of the string. But if you like to live on the edge, go ahead and have fun with this snippet.
Here it is in a substitution:
use Modern::Perl;
use strict;
use warnings;
use re qw/eval/; # Considered experimental.
my $string = 'Hello world!';
say "Before substitution $string";
my $count = 2;
my $re = qr/
(l)
(?(?{$count--})|(*FAIL))
/x;
$string =~ s/$re/L/g;
say "After substitution $string";
And the output:
Before substitution Hello world!
After substitution HeLLo world!
Short answer: no. You will need to perform the substitutions in a loop of some kind.