Perl regular expression with "[]" - regex

My data looks like:
NC_004415 NC_010199 ([T(trnH ,trnS1 trnL1 ,)])
NC_006131 NC_010199 ([T(trnH ,trnS1 trnL1 ,)])
NC_006355 NC_007231 ([T(trnM ,trnQ ,)])
I want to capture everything between []:
while( my $line = <crex> )
{ $t=$line=~m/(\[.*\])/;
print $t;
}
The output of $t is 1. Why is it not working?

Since you're using a capturing group, you can just use $1 after the match succeeds:
if($line =~ m/(\[.*\])/) {
print $1;
}

$line =~ m/(\[.*\])/ returns a list of the matches in a list context, but you are using it in a scalar context. In a scalar context, the match operator returns a Boolean that indicates whether the match was successful or not. Therefore you get 1. You can use
my ($t) = $line =~ m/(\[.*\])/;
to create a list context, or you can use $1 instead of using $t.

Use parentheses around $t:
($t) = $line =~m/(\[.*\])/;
Refer to perldoc perlretut (Extracting matches).
I believe you are using the match operator (m//) in a scalar context and storing the result in $t. Since the match is successful, m// returns 1. Refer to perldoc perlop.

Related

extract string between two dots

I have a string of the following format:
word1.word2.word3
What are the ways to extract word2 from that string in perl?
I tried the following expression but it assigns 1 to sub:
#perleval $vars{sub} = $vars{string} =~ /.(.*)./; 0#
EDIT:
I have tried several suggestions, but still get the value of 1. I suspect that the entire expression above has a problem in addition to parsing. However, when I do simple assignment, I get the correct result:
#perleval $vars{sub} = $vars{string} ; 0#
assigns word1.word2.word3 to variable sub
. has a special meaning in regular expressions, so it needs to be escaped.
.* could match more than intended. [^.]* is safer.
The match operator (//) simply returns true/false in scalar context.
You can use any of the following:
$vars{sub} = $vars{string} =~ /\.([^.]*)\./ ? $1 : undef;
$vars{sub} = ( $vars{string} =~ /\.([^.]*)\./ )[0];
( $vars{sub} ) = $vars{string} =~ /\.([^.]*)\./;
The first one allows you to provide a default if there's no match.
Try:
/\.([^\.]+)\./
. has a special meaning and would need to be escaped. Then you would want to capture the values between the dots, so use a negative character class like ([^\.]+) meaning at least one non-dot. if you use (.*) you will get:
word1.stuff1.stuff2.stuff3.word2 to result in:
stuff1.stuff2.stuff3
But maybe you want that?
Here is my little example, I do find the perl one liners a little harder to read at times so I break it out:
use strict;
use warnings;
if ("stuff1.stuff2.stuff3" =~ m/\.([^.]+)\./) {
my $value = $1;
print $value;
}
else {
print "no match";
}
result
stuff2
. has a special meaning: any character (see the expression between your parentheses)
Therefore you have to escape it (\.) if you search a literal dot:
/\.(.*)\./
You've got to make sure you're asking for a list when you do the search.
my $x= $string =~ /look for (pattern)/ ;
sets $x to 1
my ($x)= $string =~ /look for (pattern)/ ;
sets $x to pattern.

How does =~ behave in matching?

I am confused in the =~ operator. It seems that it returns a value that is true/false of a match. But when applied using a g it returns the actual matches.
Example:
~
$ perl -e '
my $var = "03824531449411615213441829503544272752010217443235";
my #zips = $var =~ /\d{5}/g;
print join "--", #zips;
'
03824--53144--94116--15213--44182--95035--44272--75201--02174--43235
$ perl -e '
my $var = "03824531449411615213441829503544272752010217443235";
my #zips = $var =~ /\d{5}/;
print join "--", #zips;
'
1
$ perl -e '
my $var = "03824531449411615213441829503544272752010217443235";
my $zips = $var =~ /\d{5}/;
print join "--", $zips;
'
1
So how does this work? Why does it return true/false in non-g mode?Or is it something else?
perlop already given a pretty clear explanation for this, so I will just copy & paste related part of it:
For =~ operator:
Binary "=~" binds a scalar expression to a pattern match. ... When used in scalar context, the return value generally indicates the success of the operation. ... Behavior in list context depends on the particular operator. See Regexp Quote-Like Operators for details and perlretut for examples using these operators.
For m// operator:
Searches a string for a pattern match, and in scalar context returns true if it succeeds, false if it fails.
For m// without /g modifier in list context:
If the /g option is not used, m// in list context returns a list consisting of the subexpressions matched by the parentheses in the pattern, that is, ($1, $2, $3 ...). When there are no parentheses in the pattern, the return value is the list (1) for success. With or without parentheses, an empty list is returned upon failure.
For m// with /g modifier in list context:
The /g modifier specifies global pattern matching--that is, matching as many times as possible within the string. How it behaves depends on the context. In list context, it returns a list of the substrings matched by any capturing parentheses in the regular expression. If there are no parentheses, it returns a list of all the matched strings, as if there were parentheses around the whole pattern.
In scalar context, each execution of m//g finds the next match, returning true if it matches, and false if there is no further match.
Context of expressions in OP:
#zips = $var =~ /\d{5}/g;
m//g in list context;
#zips = $var =~ /\d{5}/;
m// in list context;
$zips = $var =~ /\d{5}/;
m// in scalar context.
$var =~ /(\d{5})/; also returns match in list context, it's only that /g does grouping regardless of () braces.

Perl, Assign regex match to scalar

There's an example snippet in Mail::POP3Client in which theres a piece of syntax that I don't understand why or how it's working:
foreach ( $pop->Head( $i ) ) {
/^(From|Subject):\s+/i and print $_, "\n";
}
The regex bit in particular. $_ remains the same after that line but only the match is printed.
An additional question; How could I assign the match of that regex to a scalar of my own so I can use that instead of just print it?
This is actually pretty tricky. What it's doing is making use of perl's short circuiting feature to make a conditional statement. it is the same as saying this.
if (/^(From|Subject):\s+/i) {
print $_;
}
It works because perl stops evaluating and statements after something evaluates to 0. and unless otherwise specified a regex in the form /regex/ instead of $somevar =~ /regex/ will apply the regex to the default variable, $_
you can store it like this
my $var;
if (/^(From|Subject):\s+/i) {
$var = $_;
}
or you could use a capture group
/^((?:From|Subject):\s+)/i
which will store the whole thing into $1

Regular expression in index function

I am looking for occurrence of "CCGTCAATTC(A|C)TTT(A|G)AGT" in a text file.
$text = 'CCGTCAATTC(A|C)TTT(A|G)AGT';
if ($line=~/$text/){
chomp($line);
$pos=index($line,$text);
}
Searching is working, but I am not able to get the position of "text" in line.
It seems index does not accepts a regular expression as substring.
How can I make this work.
Thanks
The #- array holds the offsets of the starting positions of the last successful match. The first element is the offset of the whole matching pattern, and subsequent elements are offsets of parenthesized subpatterns. So, if you know there was a match, you can get its offset as $-[0].
You don't need to use index at all, just a regex. The portion of $line that comes before your regex match will be stored in $` (or $PREMATCH if you've chosen to use English;). You can get the index of the match by checking the length of $`, and you can get the match itself from the $& (or $MATCH) variable:
$text = 'CCGTCAATTC(A|C)TTT(A|G)AGT';
if ($line =~ /$text/) {
$pos = length($PREMATCH);
}
Assuming you want to get $pos to continue matching on the remaining part of $line, you can use the $' (or $POSTMATCH) variable to get the portion of $line that comes after the match.
See http://perldoc.perl.org/perlvar.html for detailed information on these special variables.
Based on your comments, it seems like what you are after is matching the 50 characters directly following the match. So, a simple solution would be:
my ($match) = $line =~ /CCGTCAATTC[AC]TTT[AG]AGT(.{50})/;
As you see, [AG] is equivalent to A|G. If you wish to match multiple times, you can use an array #matches, and the /g global option on the regex. E.g.
my #matches = $line =~ /CCGTCAATTC[AC]TTT[AG]AGT(.{50})/g;
You can do this to keep the matching pattern:
my ($pattern, $match) = $line =~ /(CCGTCAATTC[AC]TTT[AG]AGT)(.{50})/g;
Or in a loop:
while ($line =~ /(CCGTCAATTC[AC]TTT[AG]AGT)(.{50})/g;) {
my ($pattern, $match) = ($1, $2);
}
while ($line =~ /(CCGTCAATTC[AC]TTT[AG]AGT)(.{50})/g;) {
I like it, but no ; in while.
I had hard times to search for the reason of errors. T_T.

Why are the parentheses so important when assigning this regex match?

I have a piece of code:
$s = "<sekar kapoor>";
($name) = $s =~ /<([\S\s]*)>/;
print "$name\n"; # Output is 'sekar kapoor'
If the parentheses are removed in the second line of code like this, in the variable $name:
$name = $s =~ /<([\S\s]*)>/; # $name is now '1'
I don't understand why it behaves like this. Can anyone please explain why it is so?
In your first example you have a list context on the left-hand side (you used parentheses); in the second you have a scalar context - you just have a scalar variable.
See the Perl docs for quote-like ops, Matching in list context:
If the /g option is not used, m// in
list context returns a list consisting
of the subexpressions matched by the
parentheses in the pattern, i.e., ($1
, $2 , $3 ...).
(Note that here $1
etc. are also set, and that this
differs from Perl 4's behavior.)
When there are no parentheses in the
pattern, the return value is the list
(1) for success. With or without
parentheses, an empty list is returned
upon failure.