How to obtain the captured groups in perl - regex

I'm currently working on a simple script to detect the relations between people. I define the $np to be the regex of a Proper Noun.
In my first task, the goal is to find all the proper nouns that appear left of another proper noun in which the number of words between them is less than a fixed value. I have written the following code:
$/ = '';
my $PM = qr{\b[A-Z][\w-]*\w};
my $de = qr{d[aoe]s?};
my $s = qr{[\n ]};
my $np = qr{$PM (?: $s $PM | $s $de $s $PM )*}x;
while(<>){
while(/($np)/g){
print("$1 : ");
my #x = m/(?=(?: $s+ (?: [\w-]+ | ($np)) ){1,7})/gx;
my $y = join(", ", #x);
print("$y\n");
}
}
I expected it to print all the proper nouns in the file and, for each one of them, the proper nouns that are in the window. However, this is not happening.
How can I make it work like I want?
PS: I'm a perl newbie
EDIT: Some people recommended to add samples of input and output expected.
If I had a file with the following text
John asked Mary to meet Anna.
then, I would like my script to print
John : Mary, Anna
Mary : Anna
Anna :
However, in its current state, I get commas printed in an infinite loop.

You get an infinite loop because the position of match for $_ is changed again after you do the second match. You can assign it to a named variable before the match to avoid this:
while( <> ) {
my $line = $_;
while( $line =~/($np)/g ) {
...
}
}
See Using regular expressions in Perl and pos() for details.

Is it a requirement to use regex? Doesn't seem to be the best way to me.
I'd just split the string into words and go through that. Something like the code below.
Not sure I understand correctly your specification. What my code does is to find proper names that appear within certain distance to the left of other proper name.
my $window = 3;
my %result;
while(<DATA>){
my #words = map {$_ =~ s/[[:punct:]]$//; $_} split;
my $index = $#words;
for (my $index = $#words; $index > 0; $index--) {
my $word = $words[$index];
next unless is_name($word);
my $start_index = $index - 3;
$start_index = 0 if $start_index < 0;
my $end_index = $index - 1;
$end_index = 0 if $end_index < 0;
my #neigbours = grep {is_name($_)} #words[$start_index .. $end_index];
$result{$word} = [#neigbours] if #neigbours;
}
}
sub is_name {
shift =~ /^[A-Z][\w-]*\w$/;
}
__DATA__
John asked Mary to meet Anna.

Related

How can I properly stop and start metacharacter interpolation in regexp in Perl

Editing to be more concise, pardon.
I need to be able to grep from an array using a string that may contain one of the following characters: '.', '+', '/', '-'. The string will be captured via from the user. The array contains each line of the file I'm searching through (I'm chomping the file into the array to avoid keeping it open while the user is interfacing with the program because it is on a cron and I do not want to have it open when the cron runs), and each line has a unique identifier within it which is the basis for the search string used in the regexp. The code below shows the grep statement I am using, and I use OUR and MY in my programs to make the variables I want access to in all namespaces available, and the ones I use only in subroutines not. If you do want to try and replicate the issue
#!/usr/bin/perl -w
use strict;
use Switch;
use Data::Dumper;
our $pgm_path = "/tmp/";
our $device_info = "";
our #new_filetype1 = ();
our #new_filetype2 = ();
our #dev_info = ();
our #pgm_files = ();
our %arch_rtgs = ();
our $file = "/path/file.csv";
open my $fh, '<', $file or die "Couldn't open $file!\n";
chomp(our #source_file = <$fh>);
close $fh;
print "Please enter the device name:\n";
chomp(our $dev = <STDIN>);
while ($device_info eq "") {
# Grep the device info from the sms file
my #sms_device = grep(/\Q$dev\E/, #source_file);
if (scalar(#sms_device) > 1) {
my $which_dup = find_the_duplicate(\#sms_device);
if ($which_dup eq "program") {
print "\n-> $sms_dev <- must be a program name instead of a device name." .
"\nChoose the device from the list you are working on, specifically.\n";
foreach my $fix(#sms_device) {
my #fix_array = split(',', $fix);
print "$fix_array[1]\n";
undef #fix_array;
}
chomp($sms_dev = <STDIN>);
} else { $device_info = $which_dup; }
} elsif (scalar(#sms_device) == 1) {
($device_info) = #sms_device;
#sms_device = ();
}
}
When I try the code with an anchor:
my #sms_device = grep(/\Q$dev\E^/, #source_file);
No more activity from the program is noticed. It just sits there like it's waiting on some more input from the user. This is not what I expected to happen. The reason I would like to anchor the search pattern is because there are many, many examples of similarly named devices that have the same character order as the search pattern, but also include additional characters that are ignored in the regexp evaluation. I don't want them to be ignored, in the sense that they are included in matches. I want to force an exact match of the string in the variable.
Thanks in advance for wading through my terribly inexperienced code and communication attempts at detailing my problem.
The device id followed by the start of the string? /\Q$dev\E^/ makes no sense. You want the device id to be preceded by the start of the string and followed by the end of the string.
grep { /^\Q$dev\E\z/ }
Better yet, let's avoid spinning up the regex engine for nothing.
grep { $_ eq $dev }
For example,
$ perl -e'my $dev = "ccc"; CORE::say for grep { /^\Q$dev\E\z/ } qw( accc ccc ccce );'
ccc
$ perl -e'my $dev = "ccc"; CORE::say for grep { $_ eq $dev } qw( accc ccc ccce );'
ccc
I would use quotemeta. Here is an example of how it compares:
my $regexp = '\t';
my $metaxp = quotemeta ($regexp);
while (<DATA>) {
print "match \$regexp - $_" if /$regexp/;
print "match \$metaxp - $_" if /$metaxp/;
}
__DATA__
This \t is not a tab
This is a tab
(there is literally a tab in the second line)
The meta version will match line 1, as it turned "\t" into essentially "\t," and the non-meta (original) version will match line 2, which assumes you are looking for a tab.
match $metaxp - This \t is not a tab
match $regexp - This is a tab
Hopefully you get my meaning.
I think adding $regexp = quotemeta ($regexp) (or doing it when you capture the standard input) should meet your need.

Identifying pseudo-duplicates with Perl

I have a list that contains names. There are multiples of the same name. I want to catch the first instance of these pseudo-dupes and anchor them.
Example input
Josh Smith
Josh Smith0928340938
Josh Smith and friends
hello
hello1223
hello and goodbye.
What I want to do is identify the first occurrence of Josh Smith or hello and put an anchor such as a pipe | in front of it to validate. These are also wildcards as the list is large, so I cannot specifically look for the first match of Josh Smith and so on.
My desired output would be this:
|Josh Smith
Josh Smith0928340938
Josh Smith and friends
|hello
hello1223
hello and goodbye.
I did not provide any code. I am a little in the dark on how to go about this and was hoping maybe someone had been in a similar situation using regex or Perl.
I think based on what I understand of your requirements you are looking for something like this:
$prefix = '';
$buffered = '';
$count = 0;
while ($line = <>) {
$linePrefix = substr($line,0,length($prefix));
if ($buffered ne '' && $linePrefix eq $prefix) {
$buffered .= $line;
$count++;
} else {
if ($buffered ne '') {
print "|" if ($count > 1);
print $buffered;
}
$buffered = $line;
$prefix = $line;
chomp $prefix;
$count = 1;
}
}
if ($buffered ne '') {
if ($count > 1) {
print "|";
}
print $buffered;
}
Actually, IMO this is a rather interesting question, because you can be creative. As you do not know how to identify the root name, I have to ask if you have to? I have a feeling that you do not need a perfect solution. Therefore, I would go for something simple:
#!/usr/bin/perl -wn
$N = 4;
if (#prev) {
$same_start = length $_ >= $N &&
substr($prev[0], 0, $N) eq substr($_, 0, $N);
unless ($same_start) {
print "|", shift #prev if $#prev;
#prev = grep { print;0 } #prev;
}
}
push #prev, $_;
}{ print for #prev
edit: fixed bug: <print "|", shift #prev;> to <print "|", shift #prev if $#prev;>
Sample output:
$ perl josh.pl <josh-input.txt
|Josh Smith
Josh Smith0928340938
Josh Smith and friends
|hello
hello1223
hello and goodbye.

In regular expression matching of Perl, is it possible to know number of matches in a{n,}?

What I mean is:
For example, a{3,} will match 'a' at least three times greedly. It may find five times, 10 times, etc. I need this number. I need this number for the rest of the code.
I can do the rest less efficiently without knowing it, but I thought maybe Perl has some built-in variable to give this number or is there some trick to get it?
Just capture it and use length.
if (/(a{3,})/) {
print length($1), "\n";
}
Use #LAST_MATCH_END and #LAST_MATCH_START
my $str = 'jlkjmkaaaaaamlmk';
$str =~ /a{3,}/;
say $+[0]-$-[0];
Output:
6
NB: This will work only with a one-character pattern.
Here's an idea (maybe this is what you already had?) assuming the pattern you're interested in counting has multiple characters and variable length:
capture the substring which matches the pattern{3,} subpattern
then match the captured substring globally against pattern (note the absence of the quantifier), and force a list context on =~ to get the number of matches.
Here's a sample code to illustrate this (where $patt is the subpattern you're interested in counting)
my $str = "some catbratmatrattatblat thing";
my $patt = qr/b?.at/;
if ($str =~ /some ((?:$patt){3,}) thing/) {
my $count = () = $1 =~ /$patt/g;
print $count;
...
}
Another (admittedly somewhat trivial) example with 2 subpatterns
my $str = "some catbratmatrattatblat thing 11,33,446,70900,";
my $patt1 = qr/b?.at/;
my $patt2 = qr/\d+,/;
if ($str =~ /some ((?:$patt1){3,}) thing ((?:$patt2){2,})/) {
my ($substr1, $substr2) = ($1, $2);
my $count1 = () = $substr1 =~ /$patt1/g;
my $count2 = () = $substr2 =~ /$patt2/g;
say "count1: " . $count1;
say "count2: " . $count2;
}
Limitation(s) of this approach:
Fails miserably with lookarounds. See amon's example.
If you have a pattern of type /AB{n,}/ where A and B are complex patterns, we can split the regex into multiple pieces:
my $string = "ABABBBB";
my $n = 3;
my $count = 0;
TRY:
while ($string =~ /A/gc) {
my $pos = pos $string; # remember position for manual backtracking
$count++ while $string =~ /\GB/g;
if ($count < $n) {
$count = 0;
pos($string) = $pos; # restore previous position
} else {
last TRY;
}
}
say $count;
Output: 4
However, embedding code into the regex to do the counting may be more desirable, as it is more general:
my $string = "ABABBBB";
my $count;
$string =~ /A(?{ $count = 0 })(?:B(?{ $count++ })){3,}/ and say $count;
Output: 4.
The downside is that this code won't run on older perls. (Code was tested on v14 & v16).
Edit: The first solution will fail if the B pattern backtracks, e.g. $B = qr/BB?/. That pattern should match the ABABBBB string three times, but the strategy will only let it match two times. The solution using embedded code allows proper backtracking.

Expanding [optionals], groupings, and the | or operator in text

I am trying to expand sentences that incorporate [ ] to indicate optionals, ( ) to indicate grouping, and | to indicate the or operator and enumerate all possibilities. So for example:
"Hey [there] you [hood]." should return four sentences:
Hey there you hood.
Hey there you.
Hey you hood.
Hey you.
The end goal would look like:
Input: "(His|Her) dog was [very|extremely] confused."
Output: His dog was very confused.
His dog was extremely confused.
His dog was confused.
Her dog was very confused.
Her dog was extremely confused.
Her dog was confused.
I am doing it using regex matching and recursion. I have searched both CPAN and SO under the phrases:
Expanding text
expanding sentences
expanding conditionals
expanding optionals
expanding groupings
with no luck.
Thanks.
I have edited this question largely to better reflect its evolution and removed large portions which were made obsolete as the question evolved. The question above is the question that most of the answers below are attempting to address.
My current state is the following:
After wrestling with the problem above for a day I have two solutions very close to what I want. One is my own and the second is PLT's below. However, I have decided to try a fundamentally different approach.
Using regular expressions and manually parsing these sentences seems like a very ugly way of doing things. So I have decided to instead write a grammar for my "language" and use a parser-generator to parse it for me.
This gives me an additional layer of abstraction and avoids the following scenario described by Damian Conway in Perl Best Practices: [about regexps]
cut-and-paste-and-modify-slightly-and-oh-now-it-doesn't-work-at-all-so-let's-modify-it-some-more-and-see-if-that-helps-no-it-didn't-but-we're-commited-now-so-maybe-if-we-change-that-bit-instead-hmmmm-that's-closer-but-still-not-quite-right-maybe-if-I-made-that-third-repetition-non-greedy-instead-oops-now-it's-back-to-not-matching-at-all-perhaps-I-should-just-post-it-to-PerlMonks.org-and-see-if-they-know-what's-wrong
It also makes it much easier if the grammar of these expressions were to change and I needed to support other constructs later on.
Last update:
I solved my problem using an open source toolkit. This will transcribe a JSGF version of my input and generate a finite-state transducer. From there you can walk through the FST to generate all possible outcomes.
Ok, another complete revision of the answer. This will work as intended. :) It now also expands nested parens. Newline is still the delimeter, but I added a way to quickly change it to something more complicated if the need arises.
Basically, I started with replacing brackets with parens + pipe, since [word ] and (|word ) are equivalent.
I then extracted all the encapsulating parens, e.g. both (you |my friend) and (you |my (|friendly ) friend ). I then expanded the nested parens into regular parens, e.g. (you |my (|friendly ) friend ) was replaced with (you |my friendly friend |my friend ).
With that done, the words could be processed with the original subroutine.
Remains to be tested on more complicated expansions, but it works fine during my testing.
Here's the revised code:
use strict;
use warnings;
sub addwords {
my ($aref, #words) = #_;
my #total;
for my $start (#$aref) {
for my $add (#words) {
push #total, $start . $add;
}
}
return #total;
}
sub expand_words {
my $str = shift;
my #sentences = ('');
for my $word (word_split($str)) {
if ($word =~ /^([(])([^)]+)[)]$/) {
my #options = split /\|/, $2;
push #options, '' if ($1 eq '[');
#sentences = addwords(\#sentences, #options);
} else {
#sentences = addwords(\#sentences, $word);
}
}
return #sentences;
}
sub fix_parens {
my $str = shift;
$str =~ s/\[/(|/g;
$str =~ s/\]/)/g;
return $str;
}
sub fix_nested {
my #array = #_;
my #return;
for (my $i=0; $i <= $#array; ) {
my $inc = 1;
my ($co, $cc);
do {
$co = () = $array[$i] =~ /\(/g;
$cc = () = $array[$i] =~ /\)/g;
if ( $co > $cc ) {
$array[$i] .= $array[$i + $inc++];
}
} while ( $co > $cc );
push #return, expand_nest($array[$i]);
$i += $inc;
}
return #return;
}
sub expand_nest {
my $str = shift;
my $co = () = $str =~ /\(/g;
return $str unless ($co > 1);
while ($str =~ /([^|(]+\([^)]+\)[^|)]+)/) {
my $match = $1;
my #match = expand_words($match);
my $line = join '|', #match;
$match =~ s/([()|])/"\\" . $1/ge;
$str =~ s/$match/$line/ or die $!;
}
return $str;
}
sub word_split {
my $str = shift;
my $delimeter = "\n";
$str = fix_parens($str);
$str =~ s/([[(])/$delimeter$1/g;
$str =~ s/([])])/$1$delimeter/g;
my #tot = split /$delimeter/, $str;
#tot = fix_nested(#tot);
return #tot;
}
my $str = "Hey [there ](you|my [friendly ]friend) where's my [red|blue]berry?";
my #sentences = expand_words($str);
print "$_\n" for (#sentences);
print scalar #sentences . " sentences\n";
Will produce the output:
Hey you where's my berry?
Hey you where's my redberry?
Hey you where's my blueberry?
Hey my friend where's my berry?
Hey my friend where's my redberry?
Hey my friend where's my blueberry?
Hey my friendly friend where's my berry?
Hey my friendly friend where's my redberry?
Hey my friendly friend where's my blueberry?
Hey there you where's my berry?
Hey there you where's my redberry?
Hey there you where's my blueberry?
Hey there my friend where's my berry?
Hey there my friend where's my redberry?
Hey there my friend where's my blueberry?
Hey there my friendly friend where's my berry?
Hey there my friendly friend where's my redberry?
Hey there my friendly friend where's my blueberry?
18 sentences
Data::Generate. I found this while searching for combination which is the mathematical term of what you're doing with your sets of words there.
Here is a rather simple solution, if you get past some of the ugly regexps, due to collisions between your syntax and the regexp syntax. It allows for both the [] and the () syntax, which in fact are very similar, [foo] is the same as (foo| ).
The basis is to replace each alternation by a marker #0, #1, #2... while storing them in an array. then replace the last marker, generating several phrases, then replace the next-to last marker in each of those phrases... until all markers have been replaced. Attentive readers of Higher-order Perl will no doubt find a more elegant way to do this.
#!/usr/bin/perl
use strict;
use warnings;
while( my $phrase=<DATA>)
{ my $original= $phrase;
$phrase=~s{\[([^\]]*)\]}{($1| )}g; # replace [c|d] by (c|d| )
my $alts=[]; my $i=0;
while( $phrase=~ s{\(([^)]*)\)}{#$i}) # replace (a|b) ... (c|d| ) by #0 ... #1
{ push #$alts, [ split /\|/, $1 ]; $i++; # store [ ['a', 'b'], [ 'c', 'd', ' '] ]
}
my $expanded=[$phrase]; # seed the expanded list with the phrase
while( #$alts) { expand( $alts, $expanded); } # expand each alternation, until none left
print "$original - ", join( " - ", #$expanded), "\n\n";
}
exit;
# expand the last #i of the phrase in all the phrases in $expanded
sub expand
{ my( $alts, $expanded)=#_;
my #these_alts= #{pop(#$alts)}; # the last alternations
my $i= #$alts; # the corresponding index in the phrases
#$expanded= map { my $ph= $_;
map { my $ph_e= $ph;
$ph_e=~ s{#$i}{$_}; # replace the marker #i by one option
$ph_e=~ s{ +}{ }; # fix double spaces
$ph_e;
} #these_alts # for all options
} #$expanded # for all phrases stored so far
}
__DATA__
(His|Her) dog was [very|extremely

How can I find the first occurrence of a pattern in a string from some starting position?

I have a string of arbitrary length, and starting at position p0, I need to find the first occurrence of one of three 3-letter patterns.
Assume the string contain only letters. I need to find the count of triplets starting at position p0 and jumping forward in triplets until the first occurrence of either 'aaa' or 'bbb' or 'ccc'.
Is this even possible using just a regex?
Moritz says this might be faster than a regex. Even if it's a little slower, it's easier to understand at 5 am. :)
#0123456789.123456789.123456789.
my $string = "alsdhfaaasccclaaaagalkfgblkgbklfs";
my $pos = 9;
my $length = 3;
my $regex = qr/^(aaa|bbb|ccc)/;
while( $pos < length $string )
{
print "Checking $pos\n";
if( substr( $string, $pos, $length ) =~ /$regex/ )
{
print "Found $1 at $pos\n";
last;
}
$pos += $length;
}
$string=~/^ # from the start of the string
(?:.{$p0}) # skip (don't capture) "$p0" occurrences of any character
(?:...)*? # skip 3 characters at a time,
# as few times as possible (non-greedy)
(aaa|bbb|ccc) # capture aaa or bbb or ccc as $1
/x;
(Assuming p0 is 0-based).
Of course, it's probably more efficient to use substr on the string to skip forward:
substr($string, $p0)=~/^(?:...)*?(aaa|bbb|ccc)/;
You can't really count with regexes, but you can do something like this:
pos $string = $start_from;
$string =~ m/\G # anchor to previous pos()
((?:...)*?) # capture everything up to the match
(aaa|bbb|ccc)
/xs or die "No match"
my $result = length($1) / 3;
But I think it's a bit faster to use substr() and unpack() to split into triple and walk the triples in a for-loop.
(edit: it's length(), not lenght() ;-)
The main part of this is split /(...)/. But at the end of this, you'll have your positions and occurrence data.
my #expected_triplets = qw<aaa bbb ccc>;
my $data_string
= 'fjeidoaaaivtrxxcccfznaaauitbbbfzjasdjfncccftjtjqznnjgjaaajeitjgbbblafjan'
;
my $place = 0;
my #triplets = grep { length } split /(...)/, $data_string;
my %occurrence_for = map { $_, [] } #expected_triplets;
foreach my $i ( 0..#triplets ) {
my $triplet = $triplets[$i];
push( #{$occurrence_for{$triplet}}, $i ) if exists $occurrence_for{$triplet};
}
Or for simple counting by regex (it uses Experimental (??{}))
my ( $count, %count );
my $data_string
= 'fjeidoaaaivtrxxcccfznaaauitbbbfzjasdjfncccftjtjqznnjgjaaajeitjgbbblafjan'
;
$data_string =~ m/(aaa|bbb|ccc)(??{ $count++; $count{$^N}++ })/g;
If speed is a serious concern, you can, depending on what the 3 strings are, get really fancy by creating a tree (e.g. Aho-Corasick algorithm or similar).
A map for every possible state is possible, e.g. state[0]['a'] = 0 if no strings begin with 'a'.