What is the scope of $1 through $9 in Perl? - regex

What is the scope of $1 through $9 in Perl? For instance, in this code:
sub bla {
my $x = shift;
$x =~ s/(\d*)/$1 $1/;
return $x;
}
my $y;
# some code that manipulates $y
$y =~ /(\w*)\s+(\w*)/;
my $z = &bla($2);
my $w = $1;
print "$1 $2\n";
What will $1 be? Will it be the first \w* from $x or the first \d* from the second \w* in $x?

from perldoc perlre
The numbered match variables ($1, $2, $3, etc.) and the related punctuation set ($+ , $& , $` , $' , and $^N ) are all dynamically scoped until the end of the enclosing block or until the next successful match, whichever comes first. (See ""Compound Statements"" in perlsyn.)
This means that the first time you run a regex or substitution in a scope a new localized copy is created. The original value is restored (à la local) when the scope ends. So, $1 will be 10 up until the regex is run, 20 after the regex, and 10 again when the subroutine is finished.
But I don't use regex variables outside of substitutions. I find much clearer to say things like
#!/usr/bin/perl
use strict;
use warnings;
sub bla {
my $x = shift;
$x =~ s/(\d*)/$1 $1/;
return $x;
}
my $y = "10 20";
my ($first, $second) = $y =~ /(\w*)\s+(\w*)/;
my $z = &bla($second);
my $w = $first;
print "$first $second\n";
where $first and $second have better names that describe their contents.

By making a couple of small alterations to your example code:
sub bla {
my $x = shift;
print "$1\n";
$x =~ s/(\d+)/$1 $1/;
return $x;
}
my $y = "hello world9";
# some code that manipulates $y
$y =~ /(\w*)\s+(\w*)/;
my $z = &bla($2);
my $w = $1;
print "$1 $2\n$z\n";
we get the following output:
hello
hello world9
world9 9
showing that the $1 is limited to the dynamic scope (ie the $1 assigned within bla ceases to exist at the end of that function (but the $1 assigned from the $y regex is accessible within bla until it is overwritten))

The variables will be valid until the next time they are written to in the flow of execution.
But really, you should be using something like:
my ($match1, match2) = $var =~ /(\d+)\D(\d+)/;
Then use $match1 and $match2 instead of $1 and $2, it's much less ambiguous.

Related

Perl string replace not working with $1 and $2

Search and replace is not working when I use $1 and $2 defined earlier.
It works when I store it in a new variable.
Does not work as intended.
perl -e'
my $name = "start middle end";
my $rep = "";
my $orig = "";
if ($name =~ /sta(.*?)\s\w+\s(.*)/) {
$orig = $1;
$rep = $2;
$name =~ s/$1/$2/;
print "$name\n";
}
'
sta middle end
Is it because $1 and $2 are getting replaced in the new $name =~ I am doing?
Works as intended.
perl -e'
my $name = "start middle end";
my $rep = "";
my $orig = "";
if ($name =~ /sta(.*?)\s\w+\s(.*)/) {
$orig = $1;
$rep = $2;
$name =~ s/$orig/${rep}/;
print "$name\n";
}
'
staend middle end
Is there a better one liner to do this? I do not want to define new variables.
The capture variables are reset by running the match in the first part of the s/// operator, for the replacement to use. The m// operator in list context will return the captured values so you can easily assign them there. Also you may want to use \Q (quotemeta) if your search string is not a regex.
perl -e'
my $name = "start middle end";
if (my ($orig, $rep) = $name =~ /sta(.*?)\s\w+\s(.*)/) {
$name =~ s/\Q$orig/$rep/;
print "$name\n";
}
'
sta middle end
Yes, the new successful regex match replaces $1 and $2.
You could avoid the global vars entirely as follows:
perl -e'
my $name = "start middle end";
if ( my ($orig, $rep) = $name =~ /sta(.*?)\s\w+\s(.*)/ ) {
$name =~ s/\Q$orig/$rep/;
CORE::say $name;
}
'
Better yet, you could avoid doing two matches as follows:
perl -e'
my $name = "start middle end";
if ( $name =~ s/sta\K.*?(?=\s\w+\s(.*))/$1/ ) {
CORE::say $name;
}
'
However, I'd use the following:
perl -e'
my $name = "start middle end";
if ( (my ($prefix, $suffix, $foo) = $name =~ /^(.*?sta).*?(\s\w+\s(.*))/ ) {
CORE::say "$prefix$foo$suffix";
}
'
Note that your code suffered from a code injection bug which I fixed using quotemeta (as \Q).
Here, just in case, we'd have had unexpected extra spaces, we could also try this expression:
(sta)([a-z]*)\s+(\w+)\s+(.+)
It's just another option.
TEST
perl -e'
my $name = "start middle end";
$name =~ s/(sta)([a-z]*)\s+(\w+)\s+(.+)/$1$4 $3 $4/;
print "$name\n";
'
OUTPUT
staend middle end
Please see the demo here
$2 in the replacement part refers to the capture group from the pattern part of the same substitution. Therefore, you only need one variable to remember $2.
perl -lwe '$_ = "start middle end" ; if (/sta(.*?)\s\w+\s(.*)/) {my $rep = $2; s/$1/$rep/; print}'
staend middle end
You can avoid other variables by using the last match start and end global arrays #- and #+ and just doing a substring replace:
my $name = "start middle end";
if ($name =~ /sta(.*?)\s\w+\s(.*)/) {
substr($name, $-[1], $+[1]-$-[1], $2);
print "$name\n";
}
See the entry for #- in perldoc perlvar
The regex capture variables exhibit strange behavior depending on code
flow, function calls and other stuff.
To fully explain and wrap a head around this requires a few pages
of explanation.
As for now, avoid the whole mess and just use a single regex
perl -e'
my $name = "start middle end";
$name =~ s/^(sta)(.*?)(\s\w+\s)(.*)/$1$4$3$4/;
print "$name\n";
'

Counting number of pattern matches in Perl

I am VERY new to perl, and to programming in general.
I have been searching for the past couple of days on how to count the number of pattern matches; I have had a hard time understanding others solutions and applying them to the code I have already written.
Basically, I have a sequence and I need to find all the patterns that match [TC]C[CT]GGAAGC
I believe I have that part down. but I am stuck on counting the number of occurrences of each pattern match. Does anyone know how to edit the code I already have to do this? Any advice is welcomed. Thanks!
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
# open fasta file for reading
unless( open( FASTA, "<", '/scratch/Drosophila/dmel-all-chromosome- r6.02.fasta' )) {
die "Can't open dmel-all-chromosome-r6.02.fasta for reading:", $!;
}
#split the fasta record
local $/ = ">";
#scan through fasta file
while (<FASTA>) {
chomp;
if ( $_ =~ /^(.*?)$(.*)$/ms) {
my $header = $1;
my $seq = $2;
$seq =~ s/\R//g; # \R removes line breaks
while ( $seq =~ /([TC]C[CT]GGAAGC)/g) {
print $1, "\n";
}
}
}
Update, I have added in
my #matches = $seq =~ /([TC]C[CT]GGAAGC)/g;
print scalar #matches;
In the code below. However, it seems to be outputting 0 in front of each pattern match, instead of outputting the total sum of all pattern matches.
while (<FASTA>) {
chomp;
if ( $_ =~ /^(.*?)$(.*)$/ms) {
my $header = $1;
my $seq = $2;
$seq =~ s/\R//g; # \R removes line breaks
while ( $seq =~ /([TC]C[CT]GGAAGC)/g) {
print $1, "\n";
my #matches = $seq =~ /([TC]C[CT]GGAAGC)/g;
print scalar #matches;
}
}
}
Edit: I need the output to list ever pattern match found. I also need it to find the total number of matches found. For example:
CCTGGAAGC
TCTGGAAGC
TCCGGAAGC
3 matches found
counting the number of occurrences of each pattern match
my #matches = $string =~ /pattern/g
#matches array will contain all the matched parts. You can then do below to get the count.
print scalar #matches
Or you could directly write
my $matches = () = $string =~ /pattern/
I would suggest you to use the former as you might need to check "what was matched" in future (perhaps for debugging?).
Example 1:
use strict;
use warnings;
my $string = 'John Doe John Done';
my $matches = () = $string =~ /John/g;
print $matches; #prints 2
Example 2:
use strict;
use warnings;
my $string = 'John Doe John Done';
my #matches = $string =~ /John/g;
print "#matches"; #prints John John
print scalar #matches; #prints 2
Edit:
while ( my #matches = $seq =~ /([TC]C[CT]GGAAGC)/g) {
print $1, "\n";
print "Count of matches:". scalar #matches;
}
As you have written the code, you have to count the matches yourself:
local $/ = ">";
my $count = 0;
#scan through fasta file
while (<FASTA>) {
chomp;
if ( $_ =~ /^(.*?)$(.*)$/ms) {
my $header = $1;
my $seq = $2;
$seq =~ s/\R//g; # \R removes line breaks
while ( $seq =~ /([TC]C[CT]GGAAGC)/g) {
print $1, "\n";
$count = $count +1;
}
}
}
print "Fount $count matches\n";
should do the job.
HTH Georg
my #count = ($seq =~ /([TC]C[CT]GGAAGC)/g);
print scalar #count ;

Use of uninitialised value in concatenation (.) or string at test.pl line 29

I wrote a Perl script that basically does this: It opens 2 text files that are very similar.
Actually one is the older version of the other. The older version has footnotes and marginal notes that I want to transfer to the newer version of the text.
To do so I make a regex-search for the footnote or marginal note and I try to get the word before the note to use it as a search word, so that I can find the correct position for the notes the new version of the text. Finding a searchword, that I can use is not in every case possible.
So in case my search word search fails, I still need to transfer the note to the newer file (adding an errortoken = &xQUADER;).
The following is more or less only the problem part of the script. It fails with the warning message in the title of this question at line 29 (marked: FEHLERzeile).
LABEL:
while ($oldscovers =~ /<(f|m)n id="[bkvl0-9]+"\/>/) {
$oldscovers =~ s/( [a-zA-Z0-9äöüÄÖÜßï<>()"\-]+[.:,;!?]["]|
[a-zA-Z0-9äöüÄÖÜßï<>()"\-]+ [".:,;!?]| [a-zA-Z0-9äöüÄÖÜßï<>()"\-]+|
[a-zA-Z0-9äöüÄÖÜßï<>(),"\-]+)(<(f|m)n( id="[bkvl0-9]+"\/>))/$1/;
my $searchword = $1;
my $transfn = $2;
#
unless (defined $searchword) {
$oldscovers =~ s/(<(f|m)n( id="[bkvl0-9]+"\/>))//;
my $transfnfailsearch = $&;
#
$newelbchap =~ s/(<verse num="${oldscoversnr}">[^ÿ]+<\/verse>)/$1/;
my $newelbvers = $1;
$newelbvers =~ s/<verse num="${oldscoversnr}">\n//;
my $cuttag = $&;
#
$newelbvers =~ s/<s>/<s>\&xQUADER\;${transfnfailsearch}/;
#
$newelbvers =~ s/(.+)/${cuttag}$1/; # hier wird $cuttag wieder eingefügt
$newelbchap =~ s/<verse num="${oldscoversnr}">[^ÿ]+<\/verse>/${newelbvers}/;
#
next LABEL;
}
#
#
$newelbchap =~ s/(<verse num="${oldscoversnr}">[^ÿ]+<\/verse>)/$1/;
my $newelbvers = $1;
$newelbvers =~ s/<verse num="${oldscoversnr}">\n//;
my $cuttag = $&;
if ($newelbvers =~ /${searchword}([ .?!:;,])/) {
$newelbvers =~ s/(${searchword})([ .?!:;,])/${searchword}${transfn}$2/; ## FEHLERzeile
}
elsif ($newelbvers =~ /${searchword}/) {
$newelbvers =~ s/${searchword}/${searchword}${transfn}/;
}
else {
$newelbvers =~ s/<s>/<s>\&QUADER\;${transfn}/;
}
$newelbvers =~ s/(.+)/${cuttag}$1/; # hier wird $cuttag wieder eingefügt
$newelbchap =~ s/<verse num="${oldscoversnr}">[^ÿ]+<\/verse>/${newelbvers}/;
}
I think I get this message because the script is not restarting at LABEL, called at the end of the unless block.
Based on the latest comments, I understand what went wrong. This line:
while ($oldscovers =~ /<(f|m)n id="[bkvl0-9]+"\/>/) {
executes a regular expression with a single set of capturing parentheses. When it succeeds, $1 is set to the f or m that was matched, $2, $3, $4, etc. are set to undef, and the loop body is entered.
The first thing in the loop body is the big s/// which contains 2 sets of capturing parentheses. When it succeeds, it sets $1 and $2 to the captured strings and sets $3, $4, etc. to undef.
When the big s/// fails to match, it leaves all of those capture variables as they were before. The f or m is still in $1 and the undef is still in $2. Since you didn't test for success or failure of the big s/// the f or m from $1 goes into $searchword and the undef from $2 goes into $transfn.
What you should do to fix this is think of how you want the script to behave when the big s/// doesn't find a match, and write the code to do that if the s/// yields a false value.
For example you might choose one of these:
$oldscovers =~ s/.../.../ or next;
$oldscovers =~ s/.../.../ or last;
$oldscovers =~ s/.../.../ or die "something bad happened with this string: $oldscovers";
or even this:
my $searchword;
my $transfn;
if($oldscovers =~ s/.../.../) {
$searchword = $1;
$transfn = $2;
}
which would make sure that $searchword and $transfn are undef if there was no match. And here's another way of doing that:
my ($searchword, $transfn) = $oldscovers =~ s/.../.../ ? ($1,$2) : ();

Pattern binding operator on assignment

I am working into uncommented perl code. I came across a passage, that looks too perl-ish to me as a perl beginner. This is a simplified adaption:
my $foo;
my $bar = "x|y|z|";
$bar =~ s{\|$}{};
($foo = $bar) =~ s{ }{}gs;
I understand that $bar =~ s{\|$}{} applies the regular expression on the right to the string inside $bar.
But what does the expression ($foo = $bar) =~ s{ }{}gs; mean? I am not asking about the regular expression but on the expression it is apllied to.
Just follow the precedence that the parentheses dictate and solve each statement one at the time:
($a = $b) =~ s{ }{}gs;
#^^^^^^^^--- executed first
($a = $b) # set $a to the value contained in $b
$a =~ s{ }{}gs; # perform the regex on $a
The /g global modifier causes the regex to match as many times as possible, the /s modifier makes the wildcard . match newline as well (so it now really matches everything). The /s modifier is redundant for this regex, since there are no wildcards . in it.
Note that $a and $b are predeclared variables which are used by sort, and you should avoid using them.
When in doubt, you can always print the variables and see how they change. For example:
use Data::Dumper;
my $x = 'foo bar';
(my $y = $x) =~ s{ }{}gs;
print Dumper $x, $y;
Output:
$VAR1 = 'foo bar';
$VAR2 = 'foobar';
A scalar assignment in scalar context returns its left-hand-side operand (as shown here). That means
$a = $b
assigns the value of $b to $a and returns $a. That means
($a = $b) =~ s{ }{}gs;
is short for
$a = $b; $a =~ s{ }{}gs;
and long for
$a = $b =~ s{ }{}gsr; # Requires 5.14+
But what does the expression ($a = $b) =~ s{ }{}gs; mean?
It is same as
$a = $b;
$a =~ s{ }{}gs;
s{ }{}gs is substitution s/ //gs regex with {} as delimiters

How to replace a variable with another variable in PERL?

I am trying to replace all words from a text except some that I have in an array. Here's my code:
my $text = "This is a text!And that's some-more text,text!";
while ($text =~ m/([\w']+)/g) {
next if $1 ~~ #ignore_words;
my $search = $1;
my $replace = uc $search;
$text =~ s/$search/$replace/e;
}
However, the program doesn't work. Basically I am trying to make all words uppercase but skip the ones in #ignore_words. I know it's a problem with the variables being used in the regular expression, but I can't figure the problem out.
#!/usr/bin/perl
my $text = "This is a text!And that's some-more text,text!";
my #ignorearr=qw(is some);
my %h1=map{$_ => 1}#ignorearr;
$text=~s/([\w']+)/($h1{$1})?$1:uc($1)/ge;
print $text;
On running this,
THIS is A TEXT!AND THAT'S some-MORE TEXT,TEXT!
You can figure the problem out of your code if instead of applying an expression to the same control variable of a while loop, just let s/../../eg do it globally for you:
my $text = "This is a text!And that's some-more text,text!";
my #ignore_words = qw{ is more };
$text =~ s/([\w']+)/$1 ~~ #ignore_words ? $1 : uc($1)/eg;
print $text;
And on running:
THIS is A TEXT!AND THAT'S SOME-more TEXT,TEXT!