Why does my Perl regex cause an infinite loop? - regex

I have some code that grabs the "between" of some text;
specifically, between a foo $someword and the next foo $someword.
However, what happens is it gets stuck at the first "between" and somehow the internal string position doesn't get incremented.
The input data is a text file with newlines here and there: they are rather irrelevant, but make printing easier.
my $component = qr'foo (\w+?)\s*?{';
while($text =~ /$component/sg)
{
push #baz, $1; #grab the $someword
}
my $list = join( "|", #baz);
my $re = qr/$list/; #create a list of $somewords
#Try to grab everything between the foo $somewords;
# or if there's no $foo someword, grab what's left.
while($text=~/($re)(.+?)foo ($re|\z|\Z)/ms)
#if I take out s, it doesn't repeat, but nothing gets grabbed.
{
# print pos($text), "\n"; #this is undef...that's a clue I'm certain.
print $1, ":", $2; #prints the someword and what was grabbed.
print "\n", '-' x 20, "\n";
}

Update: One more update to deal with 'foo' occurring inside the text you want to extract:
use strict;
use warnings;
use File::Slurp;
my $text = read_file \*DATA;
my $marker = 'foo';
my $marker_re = qr/$marker\s+\w+\s*?{/;
while ( $text =~ /$marker_re(.+?)($marker_re|\Z)/gs ) {
print "---\n$1\n";
pos $text -= length $2;
}
__DATA__
foo one {
one1
one2
one3
foo two
{ two1 two2
two3 two4 }
that was the second one
foo three { 3
foo 3 foo 3
foo 3
foo foo
foo four{}
Output:
---
one1
one2
one3
---
two1 two2
two3 two4 }
that was the second one
---
3
foo 3 foo 3
foo 3
foo foo
---
}

Related

How to find the largest repeating string with overlap in a line

I have a series of lines such as
my $string = "home test results results-apr-25 results-apr-251.csv";
#str = $string =~ /(\w+)\1+/i;
print "#str";
How do I find the largest repeating string with overlap which are separated by whitespace?
In this case I'm looking for the output :
results-apr-25
It looks like you need the String::LCSS_XS which calculates Longest Common SubStrings. Don't try it's Perl-only twin brother String::LCSS because there are bugs in that one.
use strict;
use warnings;
use String::LCSS_XS;
*lcss = \&String::LCSS_XS::lcss; # Manual import of `lcss`
my $var = 'home test results results-apr-25 results-apr-251.csv';
my #words = split ' ', $var;
my $longest;
my ($first, $second);
for my $i (0 .. $#words) {
for my $j ($i + 1 .. $#words) {
my $lcss = lcss(#words[$i,$j]);
unless ($longest and length $lcss <= length $longest) {
$longest = $lcss;
($first, $second) = #words[$i,$j];
}
}
}
printf qq{Longest common substring is "%s" between "%s" and "%s"\n}, $longest, $first, $second;
output
Longest common substring is "results-apr-25" between "results-apr-25" and "results-apr-251.csv"
my $var = "home test results results-apr-25 results-apr-251.csv";
my #str = split " ", $var;
my %h;
my $last = pop #str;
while (my $curr = pop #str ) {
if(($curr =~/^$last/) || $last=~/^$curr/) {
$h{length($curr)}= $curr ;
}
$last = $curr;
}
my $max_key = max(keys %h);
print $h{$max_key},"\n";
If you want to make it without a loop, you will need the /g regex modifier.
This will get you all the repeating string:
my #str = $string =~ /(\S+)(?=\s\1)/ig;
I have replaced \w with \S (in your example, \w doesn't match -), and used a look-ahead: (?=\s\1) means match something that is before \s\1, without matching \s\1 itself—this is required to make sure that the next match attempt starts after the first string, not after the second.
Then, it is simply a matter of extracting the longest string from #str:
my $longest = (sort { length $b <=> length $a } #str)[0];
(Do note that this is a legible but far from being the most efficient way of finding the longest value, but this is the subject of a different question.)
How about:
my $var = "home test results results-apr-25 results-apr-251.csv";
my $l = length $var;
for (my $i=int($l/2); $i; $i--) {
if ($var =~ /(\S{$i}).*\1/) {
say "found: $1";
last;
}
}
output:
found: results-apr-25

Perl match multiple strings on same line (anything inside double and single quotes)

I thought this should be simple, matching strings in double/single quotes on same line
for example, following string all on same line
"hello" 'world' 'foo' "bar"
I have
print /(".*?")|('.*?')/g;
but I got following errors
Use of uninitialized value in print at ...
The following will return the warnings you mention:
use strict;
use warnings;
my $str = q{"hello" 'world' 'foo' "bar"};
print $str =~ /(".*?")|('.*?')/g;
That is because your regex will only match either one or the other of capture groups. The other one will not match and so will return undef.
The following will demonstrate:
while ($str =~ /(".*?")|('.*?')/g) {
print "one = " . (defined $1 ? $1 : 'undef') . "\n";
print "two = " . (defined $2 ? $2 : 'undef') . "\n";
print "\n";
}
Outputs:
one = "hello"
two = undef
one = undef
two = 'world'
one = undef
two = 'foo'
one = "bar"
two = undef
To get your desired behavior, just put the capture group around the entire expression.
print $str =~ /(".*?"|'.*?')/g;
You might want to check Text::ParseWords
use Text::ParseWords;
my $s = q{"hello" 'world' 'foo' "bar"};
my #words = quotewords('\s+', 0, $s);
use Data::Dumper; print Dumper \#words;
output
$VAR1 = [
'hello',
'world',
'foo',
'bar'
];
anoher option using backreference:
use strict;
use warnings;
my $str = q{"hello" 'world' 'foo' "bar"};
while ($str =~ /(["']).*?\1/g) {
print $& . "\n";
}

Perl regex to remove empty strings

I'm trying to write regex to remove empty strings on a line (and doesn't care about whitespace between list items), for example: baz foo, "","", bar, "" becomes baz foo, bar
So far I'm trying
$newLine =~ s/""\s*?,//g;
$newLine =~ s/,\s*?""//g;
but given baz "", foo, "" it is returning baz foo, "", but I want it to return baz foo.
Could anyone explain what's going wrong/how I can fix it?
Thanks
Your code works for me:
$string = 'baz "", foo, ""';
$string =~ s/""\s*?,//g;
$string =~ s/,\s*?""//g;
print "$string\n";
Returns
baz foo
for me.
Edit: As stated in the commentary below, it won't work for the string baz "", "". That's because the first regex consumes the , right before the second "", causing the second regex to not match.
An alternative for the regexes would be to use map.
$string = 'baz "", "", foo';
$string = join(" ", map { $_ =~ s/\s*""\s*//g; $_; } (split(/\s*,\s*/, $string)));
That will set $string to baz foo
It's easier to split the string, remove elements that don't contain anything apart from "" (and possibly surrounding spaces) and join those back.
The following might work for you:
#foo = grep { !/\s?""\s?/ } split /,/, $newLine;
$newLine = join(',', #foo);
Example:
$ cat mmm
$newLine = 'baz foo, "","", bar, ""';
#foo = grep { !/\s?""\s?/ } split /,/, $newLine;
$newLine = join(',', #foo);
print $newLine . "\n";
$ perl mmm
baz foo, bar

Matching simple keyword and keyword with spaces

I'm currently working on a function which takes a list of keywords and a string(a looong string) as arguments, and i want it to return a list of each matched keyword. Problem is that a keyword can be in 2 words.
For exemple - keyword1 : foobar, keyword2 : foo bar, keyword3 : barfoo)
string:
hi this is foobar, have you seen my foo bar, he is very fooBar ?
i want a list with (foobar, foo bar);
For the moment i got:
#matches = $string =~ m/\b(?:foobar|foo bar)\b/gi ;
This works fine for simple words, but not for composed words :/
any idea ?
Thank you for your help.
sub myfunc {
my ($str, #kw) = #_;
my ($re) = map qr/\b ($_) \b/x, join "|", #kw;
return $str =~ /$re/gi;
}
my #kwords = ("foobar", "foo bar", "barfoo");
my #arr = myfunc("hi this is foobar, have you seen my foo bar, he is very fooBar ?", #kwords);
This returns the correct results:
sub match {
my #keywords=#_;
my $s=pop #keywords;
return grep {$s=~/\b\Q$_\E\b/i} #keywords;
}
my #matches=match('foobar','foo bar','barfoo)','hi this is foobar, have you seen my foo bar, he is very fooBar?'); #this returns (foobar, foo bar)
BTW your code #matches = $string =~ m/\b(?:foobar|foo bar)\b/gi; is working great too, if you remove the /i modifier it returns (foobar, foo bar)

Perl regex: how to know number of matches

I'm looping through a series of regexes and matching it against lines in a file, like this:
for my $regex (#{$regexs_ref}) {
LINE: for (#rawfile) {
/#$regex/ && do {
# do something here
next LINE;
};
}
}
Is there a way for me to know how many matches I've got (so I can process it accordingly..)?
If not maybe this is the wrong approach..? Of course, instead of looping through every regex, I could just write one recipe for each regex. But I don't know what's the best practice?
If you do your matching in list context (i.e., basically assigning to a list), you get all of your matches and groupings in a list. Then you can just use that list in scalar context to get the number of matches.
Or am I misunderstanding the question?
Example:
my #list = /$my_regex/g;
if (#list)
{
# do stuff
print "Number of matches: " . scalar #list . "\n";
}
You will need to keep track of that yourself. Here is one way to do it:
#!/usr/bin/perl
use strict;
use warnings;
my #regexes = (
qr/b/,
qr/a/,
qr/foo/,
qr/quux/,
);
my %matches = map { $_ => 0 } #regexes;
while (my $line = <DATA>) {
for my $regex (#regexes) {
next unless $line =~ /$regex/;
$matches{$regex}++;
}
}
for my $regex (#regexes) {
print "$regex matched $matches{$regex} times\n";
}
__DATA__
foo
bar
baz
In CA::Parser's processing associated with matches for /$CA::Regex::Parser{Kills}{all}/, you're using captures $1 all the way through $10, and most of the rest use fewer. If by the number of matches you mean the number of captures (the highest n for which $n has a value), you could use Perl's special #- array (emphasis added):
#LAST_MATCH_START
#-
$-[0] is the offset of the start of the last successful match. $-[n] is the offset of the start of the substring matched by n-th subpattern, or undef if the subpattern did not match.
Thus after a match against $_, $& coincides with substr $_, $-[0], $+[0] - $-[0]. Similarly, $n coincides with
substr $_, $-[n], $+[n] - $-[n]
if $-[n] is defined, and $+ coincides with
substr $_, $-[$#-], $+[$#-] - $-[$#-]
One can use $#- to find the last matched subgroup in the last successful match. Contrast with $#+, the number of subgroups in the regular expression. Compare with #+.
This array holds the offsets of the beginnings of the last successful submatches in the currently active dynamic scope. $-[0] is the offset into the string of the beginning of the entire match. The n-th element of this array holds the offset of the nth submatch, so $-[1] is the offset where $1 begins, $-[2] the offset where $2 begins, and so on.
After a match against some variable $var:
$` is the same as substr($var, 0, $-[0])
$& is the same as substr($var, $-[0], $+[0] - $-[0])
$' is the same as substr($var, $+[0])
$1 is the same as substr($var, $-[1], $+[1] - $-[1])
$2 is the same as substr($var, $-[2], $+[2] - $-[2])
$3 is the same as substr($var, $-[3], $+[3] - $-[3])
Example usage:
#! /usr/bin/perl
use warnings;
use strict;
my #patterns = (
qr/(foo(bar(baz)))/,
qr/(quux)/,
);
chomp(my #rawfile = <DATA>);
foreach my $pattern (#patterns) {
LINE: for (#rawfile) {
/$pattern/ && do {
my $captures = $#-;
my $s = $captures == 1 ? "" : "s";
print "$_: got $captures capture$s\n";
};
}
}
__DATA__
quux quux quux
foobarbaz
Output:
foobarbaz: got 3 captures
quux quux quux: got 1 capture
How about below code:
my $string = "12345yx67hjui89";
my $count = () = $string =~ /\d/g;
print "$count\n";
It prints 9 here as expected.