Pattern binding operator on assignment - regex

I am working into uncommented perl code. I came across a passage, that looks too perl-ish to me as a perl beginner. This is a simplified adaption:
my $foo;
my $bar = "x|y|z|";
$bar =~ s{\|$}{};
($foo = $bar) =~ s{ }{}gs;
I understand that $bar =~ s{\|$}{} applies the regular expression on the right to the string inside $bar.
But what does the expression ($foo = $bar) =~ s{ }{}gs; mean? I am not asking about the regular expression but on the expression it is apllied to.

Just follow the precedence that the parentheses dictate and solve each statement one at the time:
($a = $b) =~ s{ }{}gs;
#^^^^^^^^--- executed first
($a = $b) # set $a to the value contained in $b
$a =~ s{ }{}gs; # perform the regex on $a
The /g global modifier causes the regex to match as many times as possible, the /s modifier makes the wildcard . match newline as well (so it now really matches everything). The /s modifier is redundant for this regex, since there are no wildcards . in it.
Note that $a and $b are predeclared variables which are used by sort, and you should avoid using them.
When in doubt, you can always print the variables and see how they change. For example:
use Data::Dumper;
my $x = 'foo bar';
(my $y = $x) =~ s{ }{}gs;
print Dumper $x, $y;
Output:
$VAR1 = 'foo bar';
$VAR2 = 'foobar';

A scalar assignment in scalar context returns its left-hand-side operand (as shown here). That means
$a = $b
assigns the value of $b to $a and returns $a. That means
($a = $b) =~ s{ }{}gs;
is short for
$a = $b; $a =~ s{ }{}gs;
and long for
$a = $b =~ s{ }{}gsr; # Requires 5.14+

But what does the expression ($a = $b) =~ s{ }{}gs; mean?
It is same as
$a = $b;
$a =~ s{ }{}gs;
s{ }{}gs is substitution s/ //gs regex with {} as delimiters

Related

Perl - string matching issue

I have a problem I cannot understand. I have this string:
gene_id "siRNA_Z27kG1_20543"transcript_id "siRNA_Z27kG1_20543_X_1";tss_id "TSS124620"
And I want to change the gene_id. So, I have the following code:
if ($line =~ /;transcript_id "([A-Za-z0-9:\-._]*)(_[oxOX][_.][0-9]*)";/) {
$num = $2;
$line =~ s/gene_id "([A-Za-z0-9:\-._]*)";/gene_id "$1$num";/g;
print $new $line."\n";
}
The aim of my code is to change siRNA_Z27kG1_20543 for siRNA_Z27kG1_20543_X_1. However, my code does not produce that output. Why? I can't understand that.
My regex needs to be as it is because I match other strings (this time with success).
#!/usr/bin/perl
use strict;
use warnings;
my $string = q{gene_id "siRNA_Z27kG1_20543"transcript_id "siRNA_Z27kG1_20543_X_1";tss_id "TSS124620"};
if($string =~ m|transcript_id "([A-Za-z0-9:\-._]*)(_[oxOX][_.][0-9]*)"|){
my $replace_with = qq{gene_id "$1$2"};
$string =~ s/gene_id (\"\w+\")/$replace_with/g;
}
print "$string";
Output: gene_id "siRNA_Z27kG1_20543_X_1"transcript_id "siRNA_Z27kG1_20543_X_1";tss_id "TSS124620"
Demo
Remove the semicolon at the start of the pattern as it is not present in the string :-
if ($line =~ /transcript_id "([A-Za-z0-9:\-._]*)(_[oxOX][_.][0-9]*)";/) {
$num = $2;
$line =~ s/gene_id "([A-Za-z0-9:\-._]*)";/gene_id "$1$num";/g;
print $new $line."\n";
}

How to remove the last two occurrences of "-" character of a string?

I have an array of strings that are formatted as such:
Ado-trastuzumab emtansine(HER2)02-22-2013
I would like to remove the last two "-" symbols only (from the date part of the original string) so that the name of the drug (Ado-trastuzumab emtansine) is not altered. Right now my regex removes all "-" symbols:
foreach my $string (#array) {
$string =~ tr/-//d;
}
I would like the output to instead be the following:
Ado-trastuzumab emtansine(HER2)02222013
Thanks for the help!
You can use substr as an lvalue to only apply the transliteration to a particular part of your string:
substr($string, -10, 10) =~ tr/-//d;
In this case, on the last 10 letters of the string.
foreach my $string (#array) {
$string =~ s/(\d{2})-(\d{2})-(\d{4})$/$1$2$3/;
}
To do what you say literally - remove the last two hyphens from a string - you could write this
$string =~ s/-([^-]*)-([^-]*)\z/$1$2/;
But in this case you could simply remove all hyphens that follow a digit:
$string =~ s/\d\K-//g;
If nothing should be done when there's only one -:
$s =~ s/-([^-]*)-([^-]*)\z/$1$2/;
$s = reverse($s);
$s =~ s/^([^-]*)-([^-]*)-/$1$2/;
$s = reverse($s);
$s = reverse( reverse($s) =~ s/^([^-]*)-([^-]*)-/$1$2/r ); # 5.14+
All these work even if there is only one -:
$s =~ s/-([^-]*)(?:-([^-]*))?\z/$1$2/;
$s =~ s/-([^-]*)\z/$1/ for 1..2;
$s =~ s/^.*\K-//s for 1..2;
$s = reverse($s);
$s =~ s/-// for 1..2;
$s = reverse($s);
$s = reverse($s);
$s =~ s/^([^-]*)-(?:([^-]*)-)?/$1$2/;
$s = reverse($s);
$s = reverse( reverse($s) =~ s/^([^-]*)-(?:([^-]*)-)?/$1$2/r ); # 5.14+
For long strings, the reverse solutions should be much faster. For the short strings, go for readability.

How to replace a variable with another variable in PERL?

I am trying to replace all words from a text except some that I have in an array. Here's my code:
my $text = "This is a text!And that's some-more text,text!";
while ($text =~ m/([\w']+)/g) {
next if $1 ~~ #ignore_words;
my $search = $1;
my $replace = uc $search;
$text =~ s/$search/$replace/e;
}
However, the program doesn't work. Basically I am trying to make all words uppercase but skip the ones in #ignore_words. I know it's a problem with the variables being used in the regular expression, but I can't figure the problem out.
#!/usr/bin/perl
my $text = "This is a text!And that's some-more text,text!";
my #ignorearr=qw(is some);
my %h1=map{$_ => 1}#ignorearr;
$text=~s/([\w']+)/($h1{$1})?$1:uc($1)/ge;
print $text;
On running this,
THIS is A TEXT!AND THAT'S some-MORE TEXT,TEXT!
You can figure the problem out of your code if instead of applying an expression to the same control variable of a while loop, just let s/../../eg do it globally for you:
my $text = "This is a text!And that's some-more text,text!";
my #ignore_words = qw{ is more };
$text =~ s/([\w']+)/$1 ~~ #ignore_words ? $1 : uc($1)/eg;
print $text;
And on running:
THIS is A TEXT!AND THAT'S SOME-more TEXT,TEXT!

Replace only up to N matches on a line

In Perl, how to write a regular expression that replaces only up to N matches per string?
I.e., I'm looking for a middle ground between s/aa/bb/; and s/aa/bb/g;. I want to allow multiple substitutions, but only up to N times.
I can think of three reliable ways. The first is to replace everything after the Nth match with itself.
my $max = 5;
$s =~ s/(aa)/ $max-- > 0 ? 'bb' : $1 /eg;
That's not very efficient if there are far more than N matches. For that, we need to move the loop out of the regex engine. The next two methods are ways of doing that.
my $max = 5;
my $out = '';
$out .= $1 . 'bb' while $max-- && $in =~ /\G(.*?)aa/gcs;
$out .= $1 if $in =~ /\G(.*)/gcs;
And this time, in-place:
my $max = 5;
my $replace = 'bb';
while ($max-- && $s =~ s/\G.*?\Kaa/$replace/s) {
pos($s) = $-[0] + length($replace);
}
You might be tempted to do something like
my $max = 5;
$s =~ s/aa/bb/ for 1..$max;
but that approach will fail for other patterns and/or replacement expressions.
my $max = 5;
$s =~ s/aa/ba/ for 1..$max; # XXX Turns 'aaaaaaaa'
# into 'bbbbbaaa'
# instead of 'babababa'
And of course, starting from the beginning of the string every time could be expensive.
What you want is not posible in regular expressions. But you can put the replacement in a for-loop:
my $i;
my $aa = 'aaaaaaaaaaaaaaaaaaaa';
for ($i=0;$i<4;$i++) {
$aa =~ s/aa/bb/;
}
print "$aa\n";
result:
bbbbbbbbaaaaaaaaaaaa
You can use the /e flag which evaluates the right side as an expression:
my $n = 3;
$string =~ s/(aa)/$n-- > 0 ? "bb" : $1/ge;
Here's a solution using the /e modifier, with which you can use
perl code to generate the replacement string:
my $count = 0;
$string =~ s{ $pattern }
{
$count++;
if ($count < $limit ) {
$replace;
} else {
$&; # faking a no-op, replacing with the original match.
}
}xeg;
With perl 5.10 or later you can drop the $& (which has weird
performance complications) and use ${^MATCH} via the /p modifier
$string =~ s{ $pattern }
{
$count++;
if ($count < $limit ) {
$replace;
} else {
${^MATCH};
}
}xegp;
It's too bad you can't just do this, but you can't:
last if $count >= $limit;

What is the scope of $1 through $9 in Perl?

What is the scope of $1 through $9 in Perl? For instance, in this code:
sub bla {
my $x = shift;
$x =~ s/(\d*)/$1 $1/;
return $x;
}
my $y;
# some code that manipulates $y
$y =~ /(\w*)\s+(\w*)/;
my $z = &bla($2);
my $w = $1;
print "$1 $2\n";
What will $1 be? Will it be the first \w* from $x or the first \d* from the second \w* in $x?
from perldoc perlre
The numbered match variables ($1, $2, $3, etc.) and the related punctuation set ($+ , $& , $` , $' , and $^N ) are all dynamically scoped until the end of the enclosing block or until the next successful match, whichever comes first. (See ""Compound Statements"" in perlsyn.)
This means that the first time you run a regex or substitution in a scope a new localized copy is created. The original value is restored (à la local) when the scope ends. So, $1 will be 10 up until the regex is run, 20 after the regex, and 10 again when the subroutine is finished.
But I don't use regex variables outside of substitutions. I find much clearer to say things like
#!/usr/bin/perl
use strict;
use warnings;
sub bla {
my $x = shift;
$x =~ s/(\d*)/$1 $1/;
return $x;
}
my $y = "10 20";
my ($first, $second) = $y =~ /(\w*)\s+(\w*)/;
my $z = &bla($second);
my $w = $first;
print "$first $second\n";
where $first and $second have better names that describe their contents.
By making a couple of small alterations to your example code:
sub bla {
my $x = shift;
print "$1\n";
$x =~ s/(\d+)/$1 $1/;
return $x;
}
my $y = "hello world9";
# some code that manipulates $y
$y =~ /(\w*)\s+(\w*)/;
my $z = &bla($2);
my $w = $1;
print "$1 $2\n$z\n";
we get the following output:
hello
hello world9
world9 9
showing that the $1 is limited to the dynamic scope (ie the $1 assigned within bla ceases to exist at the end of that function (but the $1 assigned from the $y regex is accessible within bla until it is overwritten))
The variables will be valid until the next time they are written to in the flow of execution.
But really, you should be using something like:
my ($match1, match2) = $var =~ /(\d+)\D(\d+)/;
Then use $match1 and $match2 instead of $1 and $2, it's much less ambiguous.