Perl: regex for conditional replace? - regex

in this string
ab<(CN)cdXYlm<(CI)efgXYop<(CN)zXYklmn<(CI)efgXYuvw<
I want to replace each substring between XY and < by either ONE or TWO depending on characters between previous brackets:
if XY after (CN) replace substring by ONE
if XY after (CI) replace substring by TWO
So the result should be:
ab<(CN)cdONE<(CI)efgTWO<(CN)zONE<(CI)efgTWO<
XY and following characters should be replaced but not angle bracket <.
This is for modifying HTML and arbitrary characters can occur between XY and <.
I guess I need two regex for (CN) and (CI).
# This one replaces just all XY:
my $s = 'ab<(CN)cdXYlm<(CI)efgXYop<(CN)zXYklmn<(CI)efgXYuvw<';
$s =~ s/(XY(.*?))</ONE/g;
# But how to add the conditions to the regex?

You don't need two regexes. Capture the C[NI] and retrieve the corresponding replacement value from a hash:
#!/usr/bin/perl
use warnings;
use strict;
my $s = 'ab<(CN)cdXYlm<(CI)efgXYop<(CN)zXYklmn<(CI)efgXYuvw<';
my %replace = (CN => 'ONE', CI => 'TWO');
$s =~ s/(\((C[NI])\).*?)XY.*?</$1$replace{$2}</g;
my $exp = 'ab<(CN)cdONE<(CI)efgTWO<(CN)zONE<(CI)efgTWO<';
use Test::More tests => 1;
is $s, $exp;

My guess is that this expression or maybe a modified version of that might work, not sure though:
([a-z]{2}<\([A-Z]{2}\)[a-z]{2})([^<]+)(<\([A-Z]{2}\)[a-z]{3})([^<]+)(<\([A-Z]{2}\)[a-z])([^<]+)(<\([A-Z]{2}\)[a-z]{3})([^<]+)<
Test
use strict;
use warnings;
my $str = 'ab<(CN)cdXYlm<(CI)efgXYop<(CN)zXYklmn<(CI)efgXYuvw<';
my $regex = qr/([a-z]{2}<\([A-Z]{2}\)[a-z]{2})([^<]+)(<\([A-Z]{2}\)[a-z]{3})([^<]+)(<\([A-Z]{2}\)[a-z])([^<]+)(<\([A-Z]{2}\)[a-z]{3})([^<]+)</mp;
my $subst = '"$1ONE$3TWO$5ONE$7TWO<"';
my $result = $str =~ s/$regex/$subst/rgee;
print $result;
The expression is explained on the top right panel of this demo, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs step by step, if you like.

This can be done in one line regex using /e and ternary operator ? in the /replace/.
/r option returns the resulting string, in effect this would keep the original string $s unmodified.
use strict;
use warnings;
my $s ='ab<(CN)cdXYlm<(CI)efgXYop<(CN)zXYklmn<(CI)efgXYuvw<';
print (($s=~s/\(([^)]+)\)([^(]+)XY[^(]+</"($1)$2".(($1 eq CN)?ONE:TWO)."<"/gre)."\n");
Output:
ab<(CN)cdONE<(CI)efgTWO<(CN)zONE<(CI)efgTWO<

Related

Extract only pattern matched text

I have written a basic program using regular expression.
However the entire line is being returned instead of the matched part.
I want to extract the number only.
use strict;
use warnings;
my $line = "ABMA 1234";
$line =~ /(\s)(\d){4}/;
print $line; #prints *ABMA 1234*
Is my regular expression incorrect?
If you want to print 1234, you need to change your regex and print the 2nd match:
use strict;
use warnings;
my $line = "ABMA 1234";
$line =~ /(\s)(\d{4})/;
print $2;
You can replace the exact value with the corresponding values. And your are not removing the text \w;
use strict;
use warnings;
my $line = "ABMA 1234";
$line=~s/([A-z]*)\s+(\d+)/$2/;
print $line; #prints only 1234
If you want to store the value in the new string then
(my $newstring = $line)=~s/([A-z]*)\s+(\d+)/$2/;
print $newstring; #prints only 1234
Just try this:
I don't know how you output the match in perl but you can use below regex for output the full match in your regex, you might getting space appended with your result in your current regex.
\b[\d]{4}
DEMO

command-line-argument deduction as a special variable

My simple script in Perl
#!/usr/bin/perl
use v5.20.1;
use strict;
use warnings;
$| = 1;
my $string = $ARGV[ 0 ];
my $match = $ARGV[ 1 ];
my $substitute = $ARGV[ 2 ];
$string =~ m/match/g; # just for capturing the pattern
$string =~ s/$match/only $substitute/g;
say "\$1[itself] == $1";
say "\$substitute == $substitute";
say "\$string == $string";
say "prefix == $`";
say "match == $&";
say "suffix == $'";
Input and Output:
$ perl temp 'I have 9 dollars' '(\d+)' '$1'
$1[itself] == 9
$substitute == $1
$string == I have only $1 dollars
prefix == I have
match == 9
suffix == dollars
A thing that I am trying to do is a simple substitution by initializing the match and substitution variable from #ARGV. After that a simple match for initializing the special character such as $1. But when I want to pass the special character $1 from command-line to the script, it is deducted as a regular string and not the special variable $1 that I need for substitution
If I change the code from:
$string =~ s/$match/only $substitute/g;
to:
$string =~ s/$match/only $1/g;
now it works!
1. Why $1 of the command line is different from $1 in the script?
2. Is there any way to solve it?
A screenshot of my console:
Edit
As ikegami suggested I installed and used the String::Substitution and could by using gsub_copy($line, $match, $substitute); deduced the substitution.
But still I need to use this substitution itself for printing it on the screen and colorize it for illustration that what happens.
In fact this purpose, is part of a rename script that read all file and by call rename function, changes the name of the files.
As you can see on the screenshot:
instead of for_$1 it should be for_Level on the screen. So does this module can return what it has deduced?
I am not talking about the sub match of the match group parentheses!
Also the author of the module has said:
This module does not save or interpolate $& to avoid the "considerable
performance penalty"
May this screenshot clarify the subject:
Why $1 of the command line is different from $1 in the script?
You're actually asking why
s/.../...$substitute.../
in your script is different than
s/.../...$1.../
in your script. For the same reason that
print($substitute);
is different than
print($1);
and that is that the value of $1 (the captured text) is different than the value of $substitute ($1).
This is what you need:
use String::Substitution qw( gsub_modify );
gsub_modify($string, $match, $substitue);
That assumes you don't actually need $1 outside of the replacement expression, and that you don't actually need $& and friends at all. If you do need them, then you can use the following:
use String::Substitution qw( interpolate_match_vars last_match_vars );
$string =~ s/$match/ interpolate_match_vars($substitute, last_match_vars()) /eg;
Obligatory discouragement: don't do this. Basically this is just running eval on a user-supplied string, which is really a bad idea.
Actual answer: use the ee modifer (see perlop) to s/// to eval the string in the right hand part of the expression:
$string =~ s/$match/"\"only $substitute\""/gee
By the way, you don't need to capture the pattern separately, just put the $match in a capturing group in the s/// expression:
$string =~ s/($match)/"\"only $substitute\""/gee

Perl split by regexp issue

I'm writing some parser on Perl and here is a problem with split. Here is my code:
my $str = 'a,b,"c,d",e';
my #arr = split(/,(?=([^\"]*\"[^\"]*\")*[^\"]*$)/, $str);
# try to split the string by comma delimiter, but only if comma is followed by the even or zero number of quotes
foreach my $val (#arr) {
print "$val\n"
}
I'm expecting the following:
a
b
"c,d"
e
But this is what am I really received:
a
b,"c,d"
b
"c,d"
"c,d"
e
I see my string parts are in array, their indices are 0, 2, 4, 6. But how to avoid these odd b,"c,d" and other rest string parts in the resulting array? Is there any error in my regexp delimiter or is there some special split options?
You need to use a non-capturing group:
my #arr = split(/,(?=(?:[^\"]*\"[^\"]*\")*[^\"]*$)/, $str);
^^
See IDEONE demo
Otherwise, the captured texts are output as part of the resulting array.
See perldoc reference:
If the regex has groupings, then the list produced contains the matched substrings from the groupings as well
What's tripping you up is a feature in split in that if you're using a group, and it's set to capture - it returns the captured 'bit' as well.
But rather than using split I would suggest the Text::CSV module, that already handles quoting for you:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new();
my $fields = $csv->getline( \*DATA );
print join "\n", #$fields;
__DATA__
a,b,"c,d",e
Prints:
a
b
c,d
e
My reasoning is fairly simple - you're doing quote matching and may have things like quoted/escaped quotes, etc. mean you're trying to do a recursive parse, which is something regex simply isn't well suited to doing.
You can use parse_line() of Text::ParseWords, if you are not really bounded for regex:
use Text::ParseWords;
my $str = 'a,b,"c,d",e';
my #arr = parse_line(',', 1, $str);
foreach (#arr)
{
print "$_\n";
}
Output:
a
b
"c,d"
e
Do matching instead of splitting.
use strict; use warnings;
my $str = 'a,b,"c,d",e';
my #matches = $str =~ /"[^"]*"|[^,]+/g;
foreach my $val (#matches) {
print "$val\n"
}

Match the nth longest possible string in Perl

The pattern matching quantifiers of a Perl regular expression are "greedy" (they match the longest possible string). To force the match to be "ungreedy", a ? can be appended to the pattern quantifier (*, +).
Here is an example:
#!/usr/bin/perl
$string="111s11111s";
#-- greedy match
$string =~ /^(.*)s/;
print "$1\n"; # prints 111s11111
#-- ungreedy match
$string =~ /^(.*?)s/;
print "$1\n"; # prints 111
But how one can find the second, third and .. possible string match in Perl? Make a simple example of yours --if need a better one.
Utilize a conditional expression, a code expression, and backtracking control verbs.
my $skips = 1;
$string =~ /^(.*)s(?(?{$skips-- > 0})(*FAIL))/;
The above will use greedy matching, but will cause the largest match to intentionally fail. If you wanted the 3rd largest, you could just set the number of skips to 2.
Demonstrated below:
#!/usr/bin/perl
use strict;
use warnings;
my $string = "111s11111s11111s";
$string =~ /^(.*)s/;
print "Greedy match - $1\n";
$string =~ /^(.*?)s/;
print "Ungreedy match - $1\n";
my $skips = 1;
$string =~ /^(.*)s(?(?{$skips-- > 0})(*FAIL))/;
print "2nd Greedy match - $1\n";
Outputs:
Greedy match - 111s11111s11111
Ungreedy match - 111
2nd Greedy match - 111s11111
When using such advanced features, it is important to have a full understanding of regular expressions to predict the results. This particular case works because the regex is fixed on one end with ^. That means that we know that each subsequent match is also one shorter than the previous. However, if both ends could shift, we could not necessarily predict order.
If that were the case, then you find them all, and then you sort them:
use strict;
use warnings;
my $string = "111s11111s";
my #seqs;
$string =~ /^(.*)s(?{push #seqs, $1})(*FAIL)/;
my #sorted = sort {length $b <=> length $a} #seqs;
use Data::Dump;
dd #sorted;
Outputs:
("111s11111s11111", "111s11111", 111)
Note for Perl versions prior to v5.18
Perl v5.18 introduced a change, /(?{})/ and /(??{})/ have been heavily reworked, that enabled the scope of lexical variables to work properly in code expressions as utilized above. Before then, the above code would result in the following errors, as demonstrated in this subroutine version run under v5.16.2:
Variable "$skips" will not stay shared at (re_eval 1) line 1.
Variable "#seqs" will not stay shared at (re_eval 2) line 1.
The fix for older implementations of RE code expressions is to declare the variables with our, and for further good coding practices, to localize them when initialized. This is demonstrated in this modified subroutine version run under v5.16.2, or as put below:
local our #seqs;
$string =~ /^(.*)s(?{push #seqs, $1})(*FAIL)/;
Start by getting all possible matches.
my $string = "111s1111s11111s";
local our #matches;
$string =~ /^(.*)s(?{ push #matches, $1 })(?!)/;
This finds
111s1111s11111
111s1111
111
Then, it's just a matter of finding out which one is the second longuest and filtering out the others.
use List::MoreUtils qw( uniq );
my $target_length = ( sort { $b <=> $a } uniq map length, #matches )[1];
#matches = uniq grep { length($_) == $target_length } #matches
if $target_length;

Perl regex return matches from substitution

I am trying to simultaneously remove and store (into an array) all matches of some regex in a string.
To return matches from a string into an array, you could use
my #matches = $string=~/$pattern/g;
I would like to use a similar pattern for a substitution regex. Of course, one option is:
my #matches = $string=~/$pattern/g;
$string =~ s/$pattern//g;
But is there really no way to do this without running the regex engine over the full string twice? Something like
my #matches = $string=~s/$pattern//g
Except that this will only return the number of subs, regardless of list context. I would also take, as a consolation prize, a method to use qr// where I could simply modify the quoted regex to to a sub regex, but I don't know if that's possible either (and that wouldn't preclude searching the same string twice).
Perhaps the following will be helpful:
use warnings;
use strict;
my $string = 'I thistle thing am thinking this Thistle a changed thirsty string.';
my $pattern = '\b[Tt]hi\S+\b';
my #matches;
$string =~ s/($pattern)/push #matches, $1; ''/ge;
print "New string: $string; Removed: #matches\n";
Output:
New string: I am a changed string.; Removed: thistle thing thinking this Thistle thirsty
Here is another way to do it without executing Perl code inside the substitution. The trick is that the s///g will return one capture at a time and undef if it does not match, thus quitting the while loop.
use strict;
use warnings;
use Data::Dump;
my $string = "The example Kenosis came up with was way better than mine.";
my #matches;
push #matches, $1 while $string =~ s/(\b\w{4}\b)\s//;
dd #matches, $string;
__END__
(
"came",
"with",
"than",
"The example Kenosis up was way better mine.",
)