Perl regex - How to make it less greedy? - regex

How do I count the number of empty 'fields' in the following string ?
Empty fields are indicated by -| or |-| or |-
The regex I have cooked up seems to be working except when I have consecutive empty fields ? How do I make it less greedy ?
my $string = 'P|CHNA|string-string|-|-|25.75|-|2562000|-0.06';
my $count = () = ($string=~/(?:^-\||\|-$|\|-\|)/g);
printf("$count\n");
The above code prints 2 instead of 3 which I want.

I'd avoid the regex route entirely for this and instead treat this like a list, because it is one:
my $count = grep { /^-$/ } split /\|/, $string;

The problem actually doesn't have anything to do with greediness/lazyness (which only applies to repetition operators like * or +).
The problem is the two empty fields right next to each other: |-|-|. The first one is being matched, but then the second one fails because the opening | has already been consumed, but because you have the beining-of-line marker in the rule ^-|, it doesn't match that one.
I think a much easier approach would be to split your input on | and then look for any fields consisting of only a -:
my $count = 0;
foreach (split(/\|/,$string)) { if( /^-$/ ) { $count++; } }
There's really no way to robustly implement this with a regex since Perl doesn't support variable-length lookbehinds (at least not to my knowledge). One way to "cheat" would be to append a | at the beginning and end, then you could successfully use lookbehind/lookahead assertions:
$string = "|$string|";
my $count = () = $string=~/(?<=\|)-(?=\|)/g;
(ikegama's answer below has an alternative solution that does use non-variable lookaround assertions without modifying the string, so I was wrong when I said there was "no way to implement this with a regex". Props to ikegama. I still think splitting on | is the best way to go for this problem, though.)

The trick is to use lookarounds. Someone's first attempt might be the following:
my $count = () = $string =~ /
(?<\|) # Preceded by "|"
(-)
(?=\|) # Followed by "|"
/xg;
But that doesn't work. The problem with the above is that it doesn't detect if the first field or last field is empty. Two ways to fix that:
my $count = () = "|$string|" =~ /
(?<\|) # Preceded by "|"
(-)
(?=\|) # Followed by "|"
/xg;
or
my $count = () = $string =~ /
(?<![^|]) # Not preceded by a char other than "|"
(-)
(?![^|]) # Not followed by a char other than "|"
/xg;

Related

Perl: Method to convert regexp with greedy quantifiers to non-greedy

My user gives a regexp with quantifiers that default to being greedy. He can give any valid regexp. So the solution will have to deal with anything that the user can throw at me.
How do I convert the regexp so any greedy quantifier will be non-greedy?
Does Perl have a (?...:regexp) construct that forces the greedy default for quantifiers into a non-greedy one?
If not: Is there a different way I can force a regexp with greedy quantifiers into a non-greedy one?
E.g., a user may enter:
.*
[.*]
[.*]{4,10}
[.*{4,10}]{4,10}
While these four examples may look similar, they have completely different meanings.
If you simply add ? after every */} you will change the character sets in the last three examples.
Instead they should be changed to/behave like:
.*?
[.*]
[.*]{4,10}?
[.*{4,10}]{4,10}?
but where the matched string is the minimal match, and not first-match, that Perl will default to:
$a="aab";
$a=~/(a.*?b)$/;
# Matches aab, not ab
print $1;
But given the non-greedy regexp, the minimal match can probably be obtained by prepending .*:
$a="aab";
$a=~/.*(a.*?b)$/;
# Matches ab
print $1;
"Greedyness" is not a property of the whole regular expression. It's a property of a quantifier.
It can be controlled for each quantifier separately. Just add a ? after a quantifier to make it non-greedy, e.g.
[a-z]*?
a{2,3}?
[0-9]??
\s+?
And no, there isn't any built-in way to turn the whole regex to some "default-non-greedy" mode. You need to parse the regex, detect all quantifiers and change them accordingly. Maybe there's a regex-parsing library somewhere on CPAN.
The closest I've found so far is the Regexp::Parser module. I didn't try it, but looks like it could parse the regex, walk the tree, make appropriate changes and then build a modified regex. Please take a look.
You can use a state machine:
#!/usr/bin/perl
use strict;
use warnings;
my #regexes = ( ".*", "[.*]", "[.*]{4,10}", "[.*{4,10}]{4,10}" );
for (#regexes) {
print "give: $_\n";
my $ungreedy = make_ungreedy($_,0);
print "got: $ungreedy\n";
print "============================================\n"
}
sub make_ungreedy {
my $regex = shift;
my $class_state = 0;
my $escape_state = 0;
my $found = 0;
my $ungreedy = "";
for (split (//, $regex)) {
if ($found) {
$ungreedy .= "?" unless (/\?/);
$found = 0;
}
$ungreedy .= $_;
$escape_state = 0, next if ($escape_state);
$escape_state = 1, next if (/\\/);
$class_state = 1, next if (/\[/);
if ($class_state) {
$class_state = 0 if (/\]/);
next;
}
$found = 1 if (/[*}+]/);
}
$ungreedy .= '?' if $found;
return $ungreedy;
}

Regular expression to match exactly and only n times

If I have the lines:
'aslkdfjcacttlaksdjcacttlaksjdfcacttlskjdf'
'asfdcacttaskdfjcacttklasdjf'
'cksjdfcacttlkasdjf'
I want to match them by the number of times a repeating subunit (cactt) occurs. In other words, if I ask for n repeats, I want matches that contain n and ONLY n instances of the pattern.
My initial attempt was implemented in perl and looks like this:
sub MATCHER {
print "matches with $_ CACTT's\n";
my $pattern = "^(.*?CACTT.+?){$_}(?!.*?CACTT).*\$";
my #grep_matches = grep(/$pattern/, #matching);
print "$_\n" for #grep_matches;
my #copy = #grep_matches;
my $squashed = #copy;
print "number of rows total: $squashed\n";
}
for (2...6) {
MATCHER($_);
}
Notes:
#matching contains the strings from 1, 2, and 3 in an array.
the for loop is set from integers 2-6 because I have a separate regex that works to forbid duplicate occurrences of the pattern.
This loop ALMOST works except that for n=2, matches containing 3 occurrences of the "cactt" pattern are returned. In fact, for any string containing n+1 matches (where n>=2), lines with n+1 occurrences are also returned by the match. I though the negative lookahead could prevent this behavior in perl. If anyone could give me thoughts, I would be appreciative.
Also, I have thought of getting a count per line and separating them by count; I dislike the approach because it requires two steps when one should accomplish what I want.
I would be okay with a:
foreach (#matches) { $_ =~ /$pattern/; push(#selected_by_n, $1);}
The regex seems like it should be similar, but for whatever reason in practice the results differ dramatically.
Thanks in advance!
Your code is sort of strange. This regex
my $pattern = "^(.*?CACTT.+?){$_}(?!.*?CACTT).*\$";
..tries to match first beginning of string ^, then a minimal match of any character .*?, followed by your sequence CACTT, followed by a minimal match (but slightly different from .*?) .+?. And you want to match these $_ times. You assume $_ will be correct when calling the sub (this is bad). Then you have a look-ahead assumption that wants to make sure that there is no minimal match of any char .*? followed by your sequence, followed by any char of any length followed by end of line $.
First off, this is always redundant: ^.*. Beginning of line anchor followed by any character any number of times. This actually makes the anchor useless. Same goes for .*$. Why? Because any match that will occur, will occur anyway at the first possible time. And .*$ matches exactly the same thing that the empty string does: Anything.
For example: the regex /^.*?foo.*?$/ matches exactly the same thing as /foo/. (Excluding cases of multiline matching with strings that contain newlines).
In your case, if you want to count the occurrences of a string inside a string, you can just match them like this:
my $count = () = $str =~ /CACTT/gi;
This code:
my #copy = #grep_matches;
my $squashed = #copy;
Is completely redundant. You can just do my $squashed = #grep_matches. It makes little to no sense to first copy the array.
This code:
MATCHER($_);
Does the same as this: MATCHER("foo") or MATCHER(3.1415926536). You are not using the subroutine argument, you are ignoring it, and relying on the fact that $_ is global and visible inside the sub. What you want to do is
sub MATCHER {
my $number = shift; # shift argument from #_
Now you have encapsulated the code and all is well.
What you want to do in your case, I assume, is to count the occurrences of the substring inside your strings, then report them. I would do something like this
use strict;
use warnings;
use Data::Dumper;
my %data;
while (<DATA>) {
chomp;
my $count = () = /cactt/gi; # count number of matches
push #{ $data{$count} }, $_; # store count and original
}
print Dumper \%data;
__DATA__
aslkdfjcacttlaksdjcacttlaksjdfcacttlskjdf
asfdcacttaskdfjcacttklasdjf
cksjdfcacttlkasdjf
This will print
$VAR1 = {
'2' => [
'asfdcacttaskdfjcacttklasdjf'
],
'3' => [
'aslkdfjcacttlaksdjcacttlaksjdfcacttlskjdf'
],
'1' => [
'cksjdfcacttlkasdjf'
]
};
This is just to demonstrate how to create the data structure. You can now access the strings in the order of matches. For example:
for (#$data{3}) { # print strings with 3 matches
print;
}
Would you just do something like this:
use warnings;
use strict;
my $n=2;
my $match_line_cnt=0;
my $line_cnt=0;
while (<DATA>) {
my $m_cnt = () = /cactt/g;
if ($m_cnt>=$n){
print;
$match_line_cnt++;
}
$line_cnt++;
}
print "total lines: $line_cnt\n";
print "matched lines: $match_line_cnt\n";
print "squashed: ",$line_cnt-$match_line_cnt;
__DATA__
aslkdfjcacttlaksdjcacttlaksjdfcacttlskjdf
asfdcacttaskdfjcacttklasdjf
cksjdfcacttlkasdjf
prints:
aslkdfjcacttlaksdjcacttlaksjdfcacttlskjdf
asfdcacttaskdfjcacttklasdjf
total lines: 3
matched lines: 2
squashed: 1
I think you're unintentionally asking two seperate questions.
If you want to directly capture the number of times a pattern matches in a string, this one liner is all you need.
$string = 'aslkdfjcacttlaksdjcacttlaksjdfcacttlskjdf';
$pattern = qr/cactt/;
print $count = () = $string =~ m/$pattern/g;
-> 3
That last line is as if you had written $count = #junk = $string =~ m/$pattern/g; but without needing an intermediate array variable. () = is the null list assignment and it throws away whatever is assigned to it just like scalar undef = throws away its right hand side. But, the null list assignment still returns the number of things thrown away when its left hand side is in scalar context. It returns an empty list in list context.
If you want to match strings that only contain some number of pattern matches, then you want to stop matching once too many are found. If the string is large (like a document) then you would waste a lot of time counting past n.
Try this.
sub matcher {
my ($string, $pattern, $n) = #_;
my $c = 0;
while ($string =~ m/$pattern/g) {
$c++;
return if $c > $n;
}
return $c == $n ? 1 : ();
}
Now there is one more option but if you call it over and over again it gets inefficient. You can build a custom regex that matches only n times on the fly. If you only build this once however, it's just fine and speedy. I think this is what you originally had in mind.
$regex = qr/^(?:(?:(?!$pattern).)*$pattern){$n}(?:(?!$pattern).)*$/;
I'll leave the rest of that one to you. Check for n > 1 etc. The key is understanding how to use lookahead. You have to match all the NOT THINGS before you try to match THING.
https://perldoc.perl.org/perlre

How do I check if a string has exactly one of a certain character

I'm trying to scan strings to see if they have exactly one of a certain character.
For example if I'm looking for a question mark
Hello? I'm here
Will match the regex however
Hello? Are you listening?
Will not
I've tried ?{1} and ?{1}[^?]+ but they both don't work. Can anyone point me in the right direction?
Why not do:
(\?)
and count the number of matches.
Or even more simply, count number of ? in string using tr///
my $c = $string1 =~ tr/?//;
You could do something like
my $cnt = () = $str =~ m/\Q$pat/g;
if ($cnt == 1) {
# matched
}
else {
# failed
}
$pat is the pattern (character in this case) you want to match, such as '?'.
If you're looking for a particular character only, you can use the transliteration operator, tr///:
my $count = $string =~ tr/?/?/;
if( $count == 1 ) {
...
}
With the transliteration operator, I can leave off the replacement side and any characters not lined up with a replacement character will use the previous replacement character. If there isn't a previous replacement character, it makes no replacement. I just leave out the second part of the tr///:
my $count = $string =~ tr/?//;
if( $count == 1 ) {
...
}
This won't work for patterns though. This is strictly for character-to-character replacements. For a pattern, you do the same thing with Lee Duhem's answer
You can use this regex:
^[^?]*\?[^?]*$
Online Demo

Matching numbers for substitution in Perl

I have this little script:
my #list = ('R3_05_foo.txt','T3_12_foo_bar.txt','01.txt');
foreach (#list) {
s/(\d{2}).*\.txt$/$1.txt/;
s/^0+//;
print $_ . "\n";
}
The expected output would be
5.txt
12.txt
1.txt
But instead, I get
R3_05.txt
T3_12.txt
1.txt
The last one is fine, but I cannot fathom why the regex gives me the string start for $1 on this case.
Try this pattern
foreach (#list) {
s/^.*?_?(?|0(\d)|(\d{2})).*\.txt$/$1.txt/;
print $_ . "\n";
}
Explanations:
I use here the branch reset feature (i.e. (?|...()...|...()...)) that allows to put several capturing groups in a single reference ( $1 here ). So, you avoid using a second replacement to trim a zero from the left of the capture.
To remove all from the begining before the number, I use :
.*? # all characters zero or more times
# ( ? -> make the * quantifier lazy to match as less as possible)
_? # an optional underscore
Note that you can ensure that you have only 2 digits adding a lookahead to check if there is not a digit that follows:
s/^.*?_?(?|0(\d)|(\d{2}))(?!\d).*\.txt$/$1.txt/;
(?!\d) means not followed by a digit.
The problem here is that your substitution regex does not cover the whole string, so only part of the string is substituted. But you are using a rather complex solution for a simple problem.
It seems that what you want is to read two digits from the string, and then add .txt to the end of it. So why not just do that?
my #list = ('R3_05_foo.txt','T3_12_foo_bar.txt','01.txt');
for (#list) {
if (/(\d{2})/) {
$_ = "$1.txt";
}
}
To overcome the leading zero effect, you can force a conversion to a number by adding zero to it:
$_ = 0+$1 . ".txt";
I would modify your regular expression. Try using this code:
my #list = ('R3_05_foo.txt','T3_12_foo_bar.txt','01.txt');
foreach (#list) {
s/.*(\d{2}).*\.txt$/$1.txt/;
s/^0+//;
print $_ . "\n";
}
The problem is that the first part in your s/// matches, what you think it does, but that the second part isn't replacing what you think it should. s/// will only replace what was previously matched. Thus to replace something like T3_ you will have to match that too.
s/.*(\d{2}).*\.txt$/$1.txt/;

Perl - Regex to extract only the comma-separated strings

I have a question I am hoping someone could help with...
I have a variable that contains the content from a webpage (scraped using WWW::Mechanize).
The variable contains data such as these:
$var = "ewrfs sdfdsf cat_dog,horse,rabbit,chicken-pig"
$var = "fdsf iiukui aawwe dffg elephant,MOUSE_RAT,spider,lion-tiger hdsfds jdlkf sdf"
$var = "dsadp poids pewqwe ANTELOPE-GIRAFFE,frOG,fish,crab,kangaROO-KOALA sdfdsf hkew"
The only bits I am interested in from the above examples are:
#array = ("cat_dog","horse","rabbit","chicken-pig")
#array = ("elephant","MOUSE_RAT","spider","lion-tiger")
#array = ("ANTELOPE-GIRAFFE","frOG","fish","crab","kangaROO-KOALA")
The problem I am having:
I am trying to extract only the comma-separated strings from the variables and then store these in an array for use later on.
But what is the best way to make sure that I get the strings at the start (ie cat_dog) and end (ie chicken-pig) of the comma-separated list of animals as they are not prefixed/suffixed with a comma.
Also, as the variables will contain webpage content, it is inevitable that there may also be instances where a commas is immediately succeeded by a space and then another word, as that is the correct method of using commas in paragraphs and sentences...
For example:
Saturn was long thought to be the only ringed planet, however, this is now known not to be the case.
^ ^
| |
note the spaces here and here
I am not interested in any cases where the comma is followed by a space (as shown above).
I am only interested in cases where the comma DOES NOT have a space after it (ie cat_dog,horse,rabbit,chicken-pig)
I have a tried a number of ways of doing this but cannot work out the best way to go about constructing the regular expression.
How about
[^,\s]+(,[^,\s]+)+
which will match one or more characters that are not a space or comma [^,\s]+ followed by a comma and one or more characters that are not a space or comma, one or more times.
Further to comments
To match more than one sequence add the g modifier for global matching.
The following splits each match $& on a , and pushes the results to #matches.
my $str = "sdfds cat_dog,horse,rabbit,chicken-pig then some more pig,duck,goose";
my #matches;
while ($str =~ /[^,\s]+(,[^,\s]+)+/g) {
push(#matches, split(/,/, $&));
}
print join("\n",#matches),"\n";
Though you can probably construct a single regex, a combination of regexs, splits, grep and map looks decently
my #array = map { split /,/ } grep { !/^,/ && !/,$/ && /,/ } split
Going from right to left:
Split the line on spaces (split)
Leave only elements having no comma at the either end but having one inside (grep)
Split each such element into parts (map and split)
That way you can easily change the parts e.g. to eliminate two consecutive commas add && !/,,/ inside grep.
I hope this is clear and suits your needs:
#!/usr/bin/perl
use warnings;
use strict;
my #strs = ("ewrfs sdfdsf cat_dog,horse,rabbit,chicken-pig",
"fdsf iiukui aawwe dffg elephant,MOUSE_RAT,spider,lion-tiger hdsfds jdlkf sdf",
"dsadp poids pewqwe ANTELOPE-GIRAFFE,frOG,fish,crab,kangaROO-KOALA sdfdsf hkew",
"Saturn was long thought to be the only ringed planet, however, this is now known not to be the case.",
"Another sentence, although having commas, should not confuse the regex with this: a,b,c,d");
my $regex = qr/
\s #From your examples, it seems as if every
#comma separated list is preceded by a space.
(
(?:
[^,\s]+ #Now, not a comma or a space for the
#terms of the list
, #followed by a comma
)+
[^,\s]+ #followed by one last term of the list
)
/x;
my #matches = map {
$_ =~ /$regex/;
if ($1) {
my $comma_sep_list = $1;
[split ',', $comma_sep_list];
}
else {
[]
}
} #strs;
$var =~ tr/ //s;
while ($var =~ /(?<!, )\b[^, ]+(?=,\S)|(?<=,)[^, ]+(?=,)|(?<=\S,)[^, ]+\b(?! ,)/g) {
push (#arr, $&);
}
the regular expression matches three cases :
(?<!, )\b[^, ]+(?=,\S) : matches cat_dog
(?<=,)[^, ]+(?=,) : matches horse & rabbit
(?<=\S,)[^, ]+\b(?! ,) : matches chicken-pig