Specify number of matching regex groups using Perl - regex

Let's say I have the following string
my $val = "3.4 -22.352 4.0"
The goal is to extract each decimal number by itself. There can be any number of spaces on each side or in between. It is also important to make sure that there is exactly 3 numbers present, and no other junk. I have something like this, but it doesn't work:
my #parts = ($val =~ /((\s*[-+]?\d{1,3}\.\d{1,3}\s*)){3}/)
if (scalar(#parts) == 3) {
print "Validated!\n";
for my $i (#parts) {
print "$i\n";
}
}
For some reason I get the last one twice.

Each capturing group gets you only one value, even if you apply a quantifier on it. If you want 3 values you have to repeat the capturing group 3 times. For example:
my $num = qr/[-+]?\d{1,3}\.\d{1,3}/;
my #nums = $val =~ /^\s*($num)\s+($num)\s+($num)\s*$/;
if(#nums){
print "Valid, and no need to check the number of elements.\n";
}

Instead of fighting regular expressions, use split and looks_like_number:
use warnings;
use strict;
use Scalar::Util qw(looks_like_number);
my $val = "3.4 -22.352 4.0";
my #parts = split /\s+/, $val;
if (scalar(#parts) == 3) {
my $ok = 0;
for (#parts) {
$ok++ if looks_like_number($_);
}
if ($ok == 3) {
print "Validated!\n";
for my $i (#parts) {
print "$i\n";
}
}
}

There are several issues here:
1) If you want three and only three numbers, you should anchor the start (^) and end ($) of the line in the regex.
2) Why are there two sets of parentheses? As written the second pair are redundant.
3) When you have a regex, the number of values returned are usually counted by the left parentheses (unless you use ?: or some other modifier). In this example, you have two, so it only returns two values. Because of the redundant parentheses, you get the same values twice each.

You have two sets of parens, so two values are returned. Both sets surround the same part of the regex, so both values will be the same.
Validating and extracting at not necessarily possible to do at the same time.
Doing it in two steps, extracting first, is quite simple:
my #nums = split ' ', $val;
die "Invalid\n" if #parts != 3;
for (#nums) {
die "Invalid\n" if !/^[-+]?[0-9]{1,3}\.[0-9]{1,3}\z/;
}
You can do it in one step, but there's some redundancy involved:
my $num_pat = qr/[-+]?[0-9]{1,3}\.[0-9]{1,3}/;
my #nums = $val =~ /^($num_pat)\s+($num_pat)\s+($num_pat)\z/
or die "Invalid\n";

my $val = "3.4 -22.352 4.0";
my $length = $val =~ s/((^|\s)\S)/$1/g;
#determines the number of tokens
if ($length == 3)
{
while($val=~/([-+]?[0-9]{1,3}\.[0-9]{1,3})/g)
{
print "$1\n";
}
}
The /g allows you to loop through the string and extract values conforming to your restrictions (one at a time). It will do this until all of the "tokens" matching your pattern are iterated through. I like this solution because it's concise and doesn't require you to create an auxiliary array. It's also a more general answer than using three extractions in one's regex.

With Regex Only
This will require 3 chunks of numbers delimited by space each number will be popluated into it's respective group.
(?:(?:^)([-]?[0-9]*?[.]?[0-9]*?)(?=\s))(?:(?:\s)([-]?[0-9]*?[.]?[0-9]*?)(?=\s))(?:(?:\s)([-]?[0-9]*?[.]?[0-9]*?)(?=$))
Example
PHP Code Example:
<?php
$sourcestring="3.4 -22.352 4.0";
preg_match_all('/(?:(?:^)([-]?[0-9]*?[.]?[0-9]*?)(?=\s))(?:(?:\s)([-]?[0-9]*?[.]?[0-9]*?)(?=\s))(?:(?:\s)([-]?[0-9]*?[.]?[0-9]*?)(?=$))/i',$sourcestring,$matches);
echo "<pre>".print_r($matches,true);
?>
$matches Array:
(
[0] => Array
(
[0] => 3.4 -22.352 4.0
)
[1] => Array
(
[0] => 3.4
)
[2] => Array
(
[0] => -22.352
)
[3] => Array
(
[0] => 4.0
)
)

Related

Regex to find and display count of every word in a string [duplicate]

i need to implement a program to count the occurrence of a substring in a string in perl. i have implemented it as follows
sub countnmstr
{
$count =0;
$count++ while $_[0] =~ /$_[1]/g;
return $count;
}
$count = countnmstr("aaa","aa");
print "$count\n";
now this is what i would normally do. however, in the implementation above i want to count occurrence of 'aa' in 'aaa'. here i get answer as 1 which seems reasonable but i need to consider the overlapping cases as well. hence the above case should give an answer as 2 since there are two 'aa's if we consider overlap.
can anyone suggest how to implement such a function??
Everyone is getting pretty complicated in their answers (d'oh! daotoad should have made his comment an answer!), perhaps because they are afraid of the goatse operator. I didn't name it, that's just what people call it. It uses the trick that the result of a list assignment is the number of elements in the righthand list.
The Perl idiom for counting matches is then:
my $count = () = $_[0] =~ /($pattern)/g;
The goatse part is the = () =, which is an empty list in the middle of two assignments. The lefthand part of the goatse gets the count from the righthand side of the goatse. Note the you need a capture in the pattern because that's the list the match operator will return in list context.
Now, the next trick in your case is that you really want a positive lookbehind (or lookahead maybe). The lookarounds don't consume characters, so you don't need to keep track of the position:
my $count = () = 'aaa' =~ /((?<=a)a)/g;
Your aaa is just an example. If you have a variable-width pattern, you have to use a lookahead. Lookbehinds in Perl have to be fixed width.
See ysth's answer ... I failed to realize that the pattern could consist solely of a zero width assertion and still work for this purpose.
You can use positive lookahead as suggested by others, and write the function as:
sub countnmstr {
my ($haystack, $needle) = #_;
my ($first, $rest) = $needle =~ /^(.)(.*)$/;
return scalar (() = $haystack =~ /(\Q$first\E(?=\Q$rest\E))/g);
}
You can also use pos to adjust where the next search picks up from:
#!/usr/bin/perl
use strict; use warnings;
sub countnmstr {
my ($haystack, $needle) = #_;
my $adj = length($needle) - 1;
die "Search string cannot be empty!" if $adj < 0;
my $count = 0;
while ( $haystack =~ /\Q$needle/g ) {
pos $haystack -= $adj;
$count += 1;
}
return $count;
}
print countnmstr("aaa","aa"), "\n";
Output:
C:\Temp> t
2
sub countnmstr
{
my ($string, $substr) = #_;
return scalar( () = $string =~ /(?=\Q$substr\E)/g );
}
$count = countnmstr("aaa","aa");
print "$count\n";
A few points:
//g in list context matches as many times as possible.
\Q...\E is used to auto-escape any meta characters, so that you are doing a substring count, not a subpattern count.
Using a lookahead (?= ... ) causes each match to not "consume" any of the string, allowing the following match to be attempted at the very next character.
This uses the same feature where a list assignment (in this case, to an empty list) in scalar context returns the count of elements on the right of the list assignment as the goatse/flying-lentil/spread-eagle/whatever operator, but uses scalar() instead of a scalar assignment to provide the scalar context.
$_[0] is not used directly, but instead copied to a lexical; a naive use of $_[0] in place of $string would cause the //g to start partway through the string instead of at the beginning if the passed string had a stored pos().
Update: s///g is faster, though not as fast as using index:
sub countnmstr
{
my ($string, $substr) = #_;
return scalar( $string =~ s/(?=\Q$substr\E)//g );
}
You could use a lookahead assertion in the regular expression:
sub countnmstr {
my #matches = $_[0] =~ /(?=($_[1]))/g;
return scalar #matches;
}
I suspect Sinan's suggestion will be quicker though.
you can try this, no more regex than needed.
$haystack="aaaaabbbcc";
$needle = "aa";
while ( 1 ){
$ind = index($haystack,$needle);
if ( $ind == -1 ) {last};
$haystack = substr($haystack,$ind+1);
$count++;
}
print "Total count: $count\n";
output
$ ./perl.pl
Total count: 4
If speed is an issue, the index approach suggested by ghostdog74 (with cjm's improvement) is likely to be considerably faster than the regex solutions.
use strict;
use warnings;
sub countnmstr_regex {
my ($haystack, $needle) = #_;
return scalar( () = $haystack =~ /(?=\Q$needle\E)/g );
}
sub countnmstr_index {
my ($haystack, $needle) = #_;
my $i = 0;
my $tally = 0;
while (1){
$i = index($haystack, $needle, $i);
last if $i == -1;
$tally ++;
$i ++;
}
return $tally;
}
use Benchmark qw(cmpthese);
my $size = 1;
my $h = 'aaa aaaaaa' x $size;
my $n = 'aa';
cmpthese( -2, {
countnmstr_regex => sub { countnmstr_regex($h, $n) },
countnmstr_index => sub { countnmstr_index($h, $n) },
} );
__END__
# Benchmarks run on Windows.
# Result using a small haystack ($size = 1).
Rate countnmstr_regex countnmstr_index
countnmstr_regex 93701/s -- -66%
countnmstr_index 271893/s 190% --
# Result using a large haystack ($size = 100).
Rate countnmstr_regex countnmstr_index
countnmstr_regex 929/s -- -81%
countnmstr_index 4960/s 434% --

Regular expressions, matching operator using a string variable in Perl

I am using a regex but am getting some odd, unexpected "matches". "Names" are sent to a subroutine to be compared to an array called #ASlist, which contains multiple rows. The first element of each row is also a name, followed by 0 to several synonyms. The goal is to match the incoming "name" to any row in #ASlist that has a matching cell.
Sample input, from which $names is derived for the comparison against #ASlist:
13 1 13 chr7 7 70606019 74345818 Otud7a Klf13 E030018B13Rik Trpm1 Mir211 Mtmr10 Fan1 Mphosph10 Mcee Apba2 Fam189a1 Ndnl2 Tjp1 Tarsl2 Tm2d3 1810008I18Rik Pcsk6 Snrpa1 H47 Chsy1 Lrrk1 Aldh1a3 Asb7 Lins Lass3 Adamts17
Sample lines from #ASlist:
HSPA5 BIP FLJ26106 GRP78 MIF2
NDUFA5 B13 CI-13KD-B DKFZp781K1356 FLJ12147 NUFM UQOR13
ACAN AGC1 AGCAN CSPG1 CSPGCP MSK16 SEDK
The code:
my ($name) = #_; ## this comes in from another loop elsewhere in code I did not include
chomp $name;
my #collectmatches = (); ## container to collect matches
foreach my $ASline ( #ASlist ){
my #synonyms = split("\t", $ASline );
for ( my $i = 0; $i < scalar #synonyms; $i++ ){
chomp $synonyms[ $i ];
#print "COMPARE $name TO $synonyms[ $i ]\n";
if ( $name =~m/$synonyms[$i]/ ){
print "\tname $name from block matches\n\t$synonyms[0]\n\tvia $synonyms[$i] from AS list\n";
push ( #collectmatches, $synonyms[0], $synonyms[$i] );
}
else {
# print "$name does not match $synonyms[$i]\n";
}
}
}
The script is working but also reports weird matches. Such as, when $name is "E030018B13Rik" it matches "NDUFA5" when it occurs in #ASlist. These two should not be matched up.
If I change the regex from ~m/$synonyms[$i]/ to ~m/^$synonyms[$i]$/, the "weird" matches go away, BUT the script misses the vast majority of matches.
The NDUFA5 record contains B13 as a pattern, which will match E030018<B13>Rik.
If you want to be more literal, then add boundary conditions to your regular expression /\b...\b/. Also should probably escape regular expression special characters using quotemeta.
if ( $name =~ m/\b\Q$synonyms[$i]\E\b/ ) {
Or if you want to test straight equality, then just use eq
if ( $name eq $synonyms[$i] ) {
Another, more Perlish way to test for string equality is to use a hash.
You don't show any real test data, but this short Perl program builds a hash from your array #ASlist of lines of match strings. After that, most of the work is done.
The subsequent for loop tests just E030018B13Rik to see if it is one of the keys of the new %ASlist and prints an appropriate message
use strict;
use warnings;
my #ASlist = (
'HSPA5 BIP FLJ26106 GRP78 MIF2',
'NDUFA5 B13 CI-13KD-B DKFZp781K1356 FLJ12147 NUFM UQOR13',
'ACAN AGC1 AGCAN CSPG1 CSPGCP MSK16 SEDK',
);
my %ASlist = map { $_ => 1 } map /\S+/g, #ASlist;
for (qw/ E030018B13Rik /) {
printf "%s %s\n", $_, $ASlist{$_} ? 'matches' : 'doesn\'t match';
}
output
E030018B13Rik doesn't match
Since you only need to compare two strings, you can simply use eq:
if ( $name eq $synonyms[$i] ){
You are using B13 as the regular expression. As none of the characters has a special meaning, any string containing the substring B13 matches the expression.
E030018B13Rik
^^^
If you want the expression to match the whole string, use anchors:
if ($name =~m/^$synonyms[$i]$/) {
Or, use index or eq to detect substrings (or identical strings, respectively), as your input doesn't seem to use any features of regular expressions.

How to write a perl if statement that tests how many times a string is found using regex?

I am trying to write a perl script that contains an if statement, and I want this if statement to check if a string is found via regex a certain number of times in a saved string. I would like to do this in a single line if possible, imagined like so:
$saved_string = "This abc is my abc test abc";
if( #something_to_denote_counting ($saved_string =~ /abc/) == 3)
{
print "abc was found in the saved string exactly 3 times";
}
else
{
print "abc wasn't found exactly 3 times";
}
...But I don't know what I need to do in that if statement to check for the number of times the regex matches. Can someone please tell me if this is possible? Thanks!
if ( 3 == ( () = $saved_string =~ /abc/g ) ) {
print "abc was found in the saved string exactly 3 times";
}
To get the count, you need to use /g in list context. So you could do:
#matches = $saved_string =~ /abc/g;
if ( #matches == 3 ) {
but perl provides a little help to make it easier; a list assignment, placed in scalar context (such as is provided by ==), returns the count of elements on the right side of the assignment. This enables code like:
while ( my ($key, $value) = each %hash ) {
So you could do:
if ( 3 == ( #matches = $saved_string =~ /abc/g ) ) {
but using an array isn't even necessary; assigning into an empty list is sufficient (and has become an idiom wherever you need to execute code in list context but only get a count of results).
Save matches to anon array reference, dereference it using #{} and compare to number,
if( #{[ $saved_string =~ /abc/g ]} == 3) {
print "abc was found in the saved string exactly 3 times";
}

In regular expression matching of Perl, is it possible to know number of matches in a{n,}?

What I mean is:
For example, a{3,} will match 'a' at least three times greedly. It may find five times, 10 times, etc. I need this number. I need this number for the rest of the code.
I can do the rest less efficiently without knowing it, but I thought maybe Perl has some built-in variable to give this number or is there some trick to get it?
Just capture it and use length.
if (/(a{3,})/) {
print length($1), "\n";
}
Use #LAST_MATCH_END and #LAST_MATCH_START
my $str = 'jlkjmkaaaaaamlmk';
$str =~ /a{3,}/;
say $+[0]-$-[0];
Output:
6
NB: This will work only with a one-character pattern.
Here's an idea (maybe this is what you already had?) assuming the pattern you're interested in counting has multiple characters and variable length:
capture the substring which matches the pattern{3,} subpattern
then match the captured substring globally against pattern (note the absence of the quantifier), and force a list context on =~ to get the number of matches.
Here's a sample code to illustrate this (where $patt is the subpattern you're interested in counting)
my $str = "some catbratmatrattatblat thing";
my $patt = qr/b?.at/;
if ($str =~ /some ((?:$patt){3,}) thing/) {
my $count = () = $1 =~ /$patt/g;
print $count;
...
}
Another (admittedly somewhat trivial) example with 2 subpatterns
my $str = "some catbratmatrattatblat thing 11,33,446,70900,";
my $patt1 = qr/b?.at/;
my $patt2 = qr/\d+,/;
if ($str =~ /some ((?:$patt1){3,}) thing ((?:$patt2){2,})/) {
my ($substr1, $substr2) = ($1, $2);
my $count1 = () = $substr1 =~ /$patt1/g;
my $count2 = () = $substr2 =~ /$patt2/g;
say "count1: " . $count1;
say "count2: " . $count2;
}
Limitation(s) of this approach:
Fails miserably with lookarounds. See amon's example.
If you have a pattern of type /AB{n,}/ where A and B are complex patterns, we can split the regex into multiple pieces:
my $string = "ABABBBB";
my $n = 3;
my $count = 0;
TRY:
while ($string =~ /A/gc) {
my $pos = pos $string; # remember position for manual backtracking
$count++ while $string =~ /\GB/g;
if ($count < $n) {
$count = 0;
pos($string) = $pos; # restore previous position
} else {
last TRY;
}
}
say $count;
Output: 4
However, embedding code into the regex to do the counting may be more desirable, as it is more general:
my $string = "ABABBBB";
my $count;
$string =~ /A(?{ $count = 0 })(?:B(?{ $count++ })){3,}/ and say $count;
Output: 4.
The downside is that this code won't run on older perls. (Code was tested on v14 & v16).
Edit: The first solution will fail if the B pattern backtracks, e.g. $B = qr/BB?/. That pattern should match the ABABBBB string three times, but the strategy will only let it match two times. The solution using embedded code allows proper backtracking.

When there is a similar pattern in an expression, how to extract the occurence of the last instance in perl?

The value of $s is dynamic. I need to extract the values that occur after the last | in between each [].
my $s = "[0|0|{A=145,B=2,C=12,D=18}|!][0|0|{A=167,B=2,C=67,D=17}|.1iit][196|0|{A=244,B=6,C=67,D=12}|10:48AM][204|0|{A=9,B=201,C=61,D=11}|Calculator][66|0|{A=145,B=450,C=49,D=14}|Coffee]";
my #parts = split(/\]/, $s);
foreach my $part (#parts)
{
# Need to extract the values that occur after the last '|'
# (for example: !, .1iit, 10:48AM, Calculator, Coffee)
# and store each of the values separately in a hash
}
Could someone help me out in this?
Thanks,
Best to transform the string into a more useful data structure, then take the needed elements. Why is this best? Because right now you need the last element, but perhaps next time you will need some other part. Since its not harder to do it right, why not?
#!/usr/bin/perl
use strict;
use warnings;
# Only needed for Dumper
use Data::Dumper;
my $s = "[0|0|{A=145,B=2,C=12,D=18}|!][0|0|{A=167,B=2,C=67,D=17}|.1iit][196|0|{A=244,B=6,C=67,D=12}|10:48AM][204|0|{A=9,B=201,C=61,D=11}|Calculator][66|0|{A=145,B=450,C=49,D=14}|Coffee]";
# Extract each group between []
# Then transform each group into an array reference by splitting on |
my #groups = map { [ split /\|/ ] } ($s =~ /\[([^\]]*)\]/g);
# Inspect the data structure
print Dumper \#groups;
# Print only the last element of each sub-array
print "$_\n" for map {$_->[-1]} #groups;
If needed the third elements of the sub-arrays could be transformed into hashrefs quite easily too. ,however since that wasn't needed, I leave that as an exercise for the reader (I always love saying that when I get the chance!).
Edit: since I found it interesting I ended up creating these hashrefs, here is the code that would replace the my #groups line:
my #groups = map { [ map { /\{([^\}]*)\}/ ? { split /(?:=|,)/, $1 } : $_ } (split /\|/) ] } ($s =~ /\[([^\]]*)\]/g);
or more properly commented (map commands are read from the back, so the comments start at the bottom and follow by number, comments like #/N pair with those like #N)
my #groups = map { #/1
[ #/2
map { #/3
/\{([^\}]*)\}/ #4 ... and if any element (separated by pipes in #3)
# is surrounded by curly braces
? { #5 ... then return a hash ref
split /(?:=|,)/, $1 #6 ... whose elements are given
# pairwise between '=' or ',' signs
} #/5
: $_ #7 ... otherwise (from 'if' in #4 ) return the element as is
} (split /\|/) #3 ... where each element is separated by pipes (i.e. |)
] #2 ... return an array ref
} ($s =~ /\[([^\]]*)\]/g); #1 For each element between sqr braces (i.e. [])
The generic way:
#subparts = split /\|/, $part;
$tail = $subparts[$#subparts];
If you only ever need the last part separately:
$part =~ /([^\|]*)$/ and $tail = $1;
my ($value) = $part =~ m/[^|]\|(.+)$/;
print "$part => $value\n";
and another way:
my $s =
"[0|0|{A=145,B=2,C=12,D=18}|!][0|0|{A=167,B=2,C=67,D=17}|.1iit][196|0|{A=244,B=6,C=67,D=12}|10:48AM][204|0|{A=9,B=201,C=61,D=11}|Calculator][66|0|{A=145,B=450,C=49,D=14}|Coffee]";
my #parts = $s =~ m/\|([^|]+)]/g;
print join( "\n", #parts );
Since you insist on a regex:
#matches = $s =~ /\|([^|]+?)]/g
Using /g will dump all matches into the array #matches
You really don't need a regex... just use split(). The results are stored in %results
my $s = "[0|0|{A=145,B=2,C=12,D=18}|!][0|0|{A=167,B=2,C=67,D=17}|.1iit][196|0|{A=244,B=6,C=67,D=12}|10:48AM][204|0|{A=9,B=201,C=61,D=11}|Calculator][66|0|{A=145,B=450,C=49,D=14}|Coffee]";
foreach my $part (split(/\]/, $s))
{
#pieces = split(/\|/, $part);
$results{$pieces[-1]} = $pieces[-1];
}
With regexes, when you think “I want the last of,” you should immediately think of the pattern .* because regex greed does just what you want.
For example, matching /^(.*)a(.*)$/ chops up "abababab" into
ababab in $1
a matched by the literal in the pattern
b in $2
Let's think through the process of the match. Imagine .* as Augustus Gloop.
Augustus: Ausgezeichnet! The ^ anchor means I get to start at the beginning. From there, I shall eat all the candies!
Willie Wonka: But, my dear Augustus, you must share with the other children.
Augustus: Fine, I get "abababa" and they get "b". Happy?
Willie Wonka: But the next child in line doesn't like b candies.
Augustus: Then I shall keep "ababab" for myself and leave "ab" for the others.
At this point, Augustus has his big pile, humble little Charlie Bucket gets his single a, and Veruca Salt—although scowling about the meager quantity—gets at least something now.
In other words, $2 contains everything after the last a. To be persnickety, the ^ and $ anchors are redundant, but I like keeping them for added emphasis.
Putting this into action, you could write
#! /usr/bin/env perl
use strict;
use warnings;
sub last_fields {
local($_) = #_;
my #last;
push #last, $1 =~ /^.*\|(.+)$/ ? $1 : undef
while /\[(.*?)\]/g;
wantarray ? #last : \#last;
}
The outer while breaks up the string into [...] chunks and assumes that right square-bracket cannot occur inside a chunk. Within each chunk, we use /^.*\|(.+)$/ to capture in $1 everything after the last pipe.
Testing it with your example looks like
my $s = "[0|0|{A=145,B=2,C=12,D=18}|!]" .
"[0|0|{A=167,B=2,C=67,D=17}|.1iit]" .
"[196|0|{A=244,B=6,C=67,D=12}|10:48AM]" .
"[204|0|{A=9,B=201,C=61,D=11}|Calculator]" .
"[66|0|{A=145,B=450,C=49,D=14}|Coffee]";
use Test::More tests => 6;
my #lasts = last_fields $s;
# yes, is_deeply could do this in a single call,
# but it's laid out explicitly here for expository benefit
is $lasts[0], "!";
is $lasts[1], ".1iit";
is $lasts[2], "10:48AM";
is $lasts[3], "Calculator";
is $lasts[4], "Coffee";
is scalar #lasts, 5;
All the tests pass:
$ ./match-last-of
1..6
ok 1
ok 2
ok 3
ok 4
ok 5
ok 6
The output of prove is nicer. Run it yourself to see the color coding.
$ prove ./match-last-of
./match-last-of .. ok
All tests successful.
Files=1, Tests=6, 0 wallclock secs ( 0.02 usr 0.01 sys + 0.02 cusr 0.00 csys = 0.05 CPU)
Result: PASS