Regex to match only innermost delimited sequence - regex

I have a string that contains sequences delimited by multiple characters: << and >>. I need a regular expression to only give me the innermost sequences. I have tried lookaheads but they don't seem to work in the way I expect them to.
Here is a test string:
'do not match this <<but match this>> not this <<BUT NOT THIS <<this too>> IT HAS CHILDREN>> <<and <also> this>>'
It should return:
but match this
this too
and <also> this
As you can see with the third result, I can't just use /<<[^>]+>>/ because the string may have one character of the delimiters, but not two in a row.
I'm fresh out of trial-and-error. Seems to me this shouldn't be this complicated.

#matches = $string =~ /(<<(?:(?!<<|>>).)*>>)/g;
(?:(?!PAT).)* is to patterns as [^CHAR]* is to characters.

$string = 'do not match this <<but match this>> not this <<BUT NOT THIS <<this too>> IT HAS CHILDREN>> <<and <also> this>>';
#matches = $string =~ /(<<(?:[^<>]+|<(?!<)|>(?!>))*>>)/g;

Here's a way to use split for the job:
my $str = 'do not match this <<but match this>> not this <<BUT NOT THIS <<this too>> IT HAS CHILDREN>> <<and <also> this>>';
my #a = split /(?=<<)/, $str;
#a = map { split /(?<=>>)/, $_ } #a;
my #match = grep { /^<<.*?>>$/ } #a;
Keeps the tags in there, if you want them removed, just do:
#match = map { s/^<<//; s/>>$//; $_ } #match;

Related

Matching a variable in a string in Perl from the end

I want to match a variable character in a given string, but from the end.
Ideas on how to do this action?
for example:
sub removeCharFromEnd {
my $string = shift;
my $char = shift;
if($string =~ m/$char/){ // I want to match the char, searching from the end, $doesn't work
print "success";
}
}
Thank you for your assistance.
There is no regex modifier that would force Perl regex engine to parse the string from right to left. Thus, the most convenient way to achieve that is via a negative lookahead:
m/$char(?!.*$char)/
The (?!.*$char) negative lookahead will require the absence (=will fail the match if found) of a $char after any 0+ chars other than linebreak chars (use s modifier if you are running the regex against a multiline string input).
The regex engine works from left to right.
You can use the natural greediness of quantifiers to reach the end of the string and find the last char with the backtracking mechanism:
if($string =~ m/.*\K$char/s) { ...
\K marks the position of the match result beginning.
Other ways:
you can also reverse the string and use your previous pattern.
you can search all occurrences and take the last item in the list
I'm having trouble understanding what you want. Your subroutine is called removeCharFromEnd, so perhaps you want to remove $char from $string if it appears at the end of the string
You can do that like this
sub removeCharFromEnd {
my ( $string, $char ) = #_;
if ( $string =~ s/$char\z// ) {
print "success";
}
$string;
}
Or perhaps you want to remove the last occurrence of $char wherever it is. You can do that with
s/.*\K$char//
The subroutine I have written returns the modified string, so you would have to assign the result to a variable to save it. You can write
my $s = 'abc';
$s = removeCharFromEnd($s, 'c');
say $s;
output
ab
If you just want to modify the string in place then you should write
$ARGV[0] =~ s/$char\z//
using whichever substitution you choose. Then you can do this
my $s = 'abc';
removeCharFromEnd($s, 'c');
say $s;
This produces the same output
To get Perl to search from the end of a string, reverse the string.
sub removeCharFromEnd {
my $string = reverse shift #_;
my $char = quotemeta reverse shift #_;
$string =~ s/$char//;
$string = reverse $string;
return $string;
}
print removeCharFromEnd(qw( abcabc b )), "\n";
print removeCharFromEnd(qw( abcdefabcdef c )), "\n";
print removeCharFromEnd(qw( !"/$%?&*!"/$%?&* $ )), "\n";

Group of Regular expression matches, return nth match and assign to array

$content = "<p><strong>haha</p> .. .. .. <p><strong> hihi </p>";
my #eachTopic = ($content =~ /(<p><strong>)(.*?)(<\/p>)/g);
#I only want to capture the $2 and add it to array
print $_."\n" foreach(#eachTopic);
The resulting array should have 2 values $eachTopic[0] = "haha" and $eachTopic[1] = "hihi".
I know i could do this using for loop and using a if statement, but just wondering, is there anything in regular expression that i can do something like
#arr = ($content =~ /(x)(.*?)(x)/g$2/)
Instead of trying to do this with a match group, use zero-width assertions to match what comes before and after the part you want to match. That way you're matching only that part, and a global match will return an array of all occurences.
my #eachTopic = ($content =~ /(?<=<p><strong>).*?(?=<\/p>)/g);
Have you tried split?
like so:
my #topics = map{ $_ =~ s/<\/p>$//; $_ }split( /<p><strong>/, $txt );
This of course fails if your string contains something other than <p><strong>...</p> at some point...

Perl idiom for quickly searching file with elements in array

what is the Perl idiom to search a string or a whole file for array elements occurrences? E.g.:
my #array = qw(word, test, ...);
my $string = ".......";
I want to search for word or test (can also be words, tester, etc.) inside $string and return whatever is found (i.e. group match).
I searched the docs, seems like map + grep is what I need but I just can’t come up with the code for it. Perl is such fun that I am totally clueless sometimes. :)
Using one example from map:
my #squares = map { $_ * $_ } grep { $_ > 5 } #numbers;
I suppose I can split the string into array and grep. Am I right?
grep { #array } #string; # something like grep {/(word|test)/} #string but I want to use array
my #word_roots = qw( word test );
my $pat = join '|', map quotemeta, #word_roots;
my $re = qr/\b(?:$pat)\w+\b/;
my #matches = $string =~ /($re)/g;
How about something like this from a re.pl session:
$ my #array = qw(word test)
$VAR1 = 'word';
$VAR2 = 'test';
$ my $string = ' the word is test, I said'
the word is test, I said
$ my #match_array = map { $string =~ /\b($_)\b/ } #array
$VAR1 = 'word';
$VAR2 = 'test';
The parenthesis around \b$_\b capture the match in the regex inside of map.
The \b ensures that we only match is the word is found on its own (like "test" or "word") and not words that contain the characters "test", or "word" in them like "coward" or "brightest". See http://www.regular-expressions.info/wordboundaries.html for more details on \b.

In regular expression matching of Perl, is it possible to know number of matches in a{n,}?

What I mean is:
For example, a{3,} will match 'a' at least three times greedly. It may find five times, 10 times, etc. I need this number. I need this number for the rest of the code.
I can do the rest less efficiently without knowing it, but I thought maybe Perl has some built-in variable to give this number or is there some trick to get it?
Just capture it and use length.
if (/(a{3,})/) {
print length($1), "\n";
}
Use #LAST_MATCH_END and #LAST_MATCH_START
my $str = 'jlkjmkaaaaaamlmk';
$str =~ /a{3,}/;
say $+[0]-$-[0];
Output:
6
NB: This will work only with a one-character pattern.
Here's an idea (maybe this is what you already had?) assuming the pattern you're interested in counting has multiple characters and variable length:
capture the substring which matches the pattern{3,} subpattern
then match the captured substring globally against pattern (note the absence of the quantifier), and force a list context on =~ to get the number of matches.
Here's a sample code to illustrate this (where $patt is the subpattern you're interested in counting)
my $str = "some catbratmatrattatblat thing";
my $patt = qr/b?.at/;
if ($str =~ /some ((?:$patt){3,}) thing/) {
my $count = () = $1 =~ /$patt/g;
print $count;
...
}
Another (admittedly somewhat trivial) example with 2 subpatterns
my $str = "some catbratmatrattatblat thing 11,33,446,70900,";
my $patt1 = qr/b?.at/;
my $patt2 = qr/\d+,/;
if ($str =~ /some ((?:$patt1){3,}) thing ((?:$patt2){2,})/) {
my ($substr1, $substr2) = ($1, $2);
my $count1 = () = $substr1 =~ /$patt1/g;
my $count2 = () = $substr2 =~ /$patt2/g;
say "count1: " . $count1;
say "count2: " . $count2;
}
Limitation(s) of this approach:
Fails miserably with lookarounds. See amon's example.
If you have a pattern of type /AB{n,}/ where A and B are complex patterns, we can split the regex into multiple pieces:
my $string = "ABABBBB";
my $n = 3;
my $count = 0;
TRY:
while ($string =~ /A/gc) {
my $pos = pos $string; # remember position for manual backtracking
$count++ while $string =~ /\GB/g;
if ($count < $n) {
$count = 0;
pos($string) = $pos; # restore previous position
} else {
last TRY;
}
}
say $count;
Output: 4
However, embedding code into the regex to do the counting may be more desirable, as it is more general:
my $string = "ABABBBB";
my $count;
$string =~ /A(?{ $count = 0 })(?:B(?{ $count++ })){3,}/ and say $count;
Output: 4.
The downside is that this code won't run on older perls. (Code was tested on v14 & v16).
Edit: The first solution will fail if the B pattern backtracks, e.g. $B = qr/BB?/. That pattern should match the ABABBBB string three times, but the strategy will only let it match two times. The solution using embedded code allows proper backtracking.

Perl regex digit

Consider my regex in this code section:
use strict;
my #list = ("1", "2", "123");
&chk(#list);
sub chk {
my #num = split (" ", "#_");
foreach my $chk (#num) {
chomp $chk;
if ($chk =~ m/\d{1,2}?/) {
print "$chk\n";
}
}
}
The \d{4} will print nothing. The \d{3} will print only 123. But if I change to \d{1,2}? it will print all. I thought, according to all the sources I read so far, that {1,2} mean: one digit but no more than two. So it should have printed only 1 and 2, correct?
What do I need to extract items that contains only one to two digits?
\d{1,2} succeeds if it finds 1 or 2 digits anywhere in the string provided. Additional string content is does not cause the match to fail. If you want to match only when the string contains exactly 1 or 2 digits, do this: ^\d{1,2}$
You should anchor your regular expression for the desired effect. The built-in function grep suits better here since it is a selection from an array that is to be done:
#!/usr/bin/env perl
use strict;
use warnings;
my #list = ( 1, 2, 123 );
print join "\n", grep /^\d{1,2}$/, #list;
It appears to be working perfectly!
Here's a hint: Use the Perl variables $`, $&, and $'. These variables are special regular expression variables that show the part of the string before the match, what was matched, and the post matched string.
Here's a sample program:
#! /usr/bin/env perl
use strict;
use warnings;
use feature qw(say);
use Scalar::Util;
my #list = ("1", "2", "123");
foreach my $string (#list) {
if ($string =~ /\d{1,2}?/) {
say qq(We have a match for "string"!);
say qq("$`" "$&" "$'");
}
else {
say "No match makes David Sad";
}
}
The output will be:
We have a match for "1"!
"" "1" ""
We have a match for "2"!
"" "2" ""
We have a match for "123"!
"" "1" "23"
What this does is divide up the string into three sections: The section of the string before the regular expression match, the section of the string that matches the regular expression, and the section of the string after the regular expression match.
In each case, there was no pre-match because the regular expression matches from the start of the string. We also see that \d{1,2}? matches a single digit in each case even through 123 could have matched two digits. Why? Because the question mark on the end of the match specifier tells the regular expression not to be greedy. In this case, we tell the regular expression to match either one or two characters. Fine, it matches on one. Remove the question mark, and the last line would have looked like this:
We have a match for "123"!
"" "12" "3"
If you want to match on one or two digits, but not three or more digits, you'll have to specify the part of your string before and after the one or two digits. Something like this:
/\D\d{1,2}\D/
This would match your string foo12bar, but not foo123bar. But what if the string is 12? In that case, we want to say that either we have the beginning of the string, or a non-digit before our one or two character match, and we either have a non-digit or the end of the string at the end of our one or two character match:
/(\D|^)\d{1,2}(/D|$)/
A quick explanation:
(\D|^): A non-digit or the beginning of the string (The ^ anchor)
d{1,2}: One or two digits
(\D|$): A non-digit or the end of the string (The $ anchor)
Now, this will match 12, but not 123, and it will match foo12 and foo12bar, but not foo123 or foo123bar.
Just looking for a one or two digit number, we can simply specify the anchors:
/^\d{1,2}$/;
Now, that will match 1, 12, but not foo12 or 123.
The main thing is to use the $`, $&, and $' variables in order to help see exactly what your regular expression is matching on and what's before and after your match.
No, because while the regex only matches two digits, $chk still contains 123. If you want to only print the part that is matched, use
if ($chk =~ m/(\d{1,2})/) {
print "$1\n";
}
Note the parentheses and the $1. This causes it to print only that which is in the parentheses.
Also, this code doesn't make much sense:
sub chk {
my #num = split (" ", "#_");
Because #_ already is an array it makes no sense to make it into a string and then split it. Simply do:
sub chk {
foreach my $chk (#_) {
You also do not need to use chomp for data that is not coming from user input, as it is intended to remove the trailing newline. There is no newline in any of this data.
#!/usr/bin/perl
use strict;
my #list = ("1", "2", "123");
&chk(\#list);
sub chk {
foreach my $chk (#{$_[0]}) {
print "$chk\n" if $chk =~ m/^\d{1,2}$/ ;
}
}
#!/usr/bin/perl
use strict;
use warnings;
my #list = ("1", "2", "123");
&chk(#list);
sub chk {
my #num = split (" ", "#_");
foreach my $chk (#num) {
chomp $chk;
if ($chk =~ m/\d{1,2}/ && length($chk) <= 2) {
print "$chk\n";
}
}
}