extract first word from a sentence and store it - regex

i am extracting first word from a line using regex in Perl
for my $source_line (#lines) {
$source_line =~ /^(.*?)\s/
}
But I want to store the first word into a variable
when I print the below code, I get correct output
print($source_line =~ /^(.*?)\s/)
when I want to store in $i and print it, I get output as 1.
my $i = ($source_line =~ /^(.*?)\s/);
print $i;
How do the store the first word into a temporary variable

You need to evaluate the match in list context.
my ($i) = $source_line =~ /^(.*?)\s/;
my ($i) is the same as (my $i), which "looks like a list", so it causes = to be the list assignment operator, and the list assignment operator evaluates its RHS in list context.
By the way, the following version works even if there's only one work and when there's leading whitespace:
my ($i) = $source_line =~ /(\S+)/;

It all comes down to context, this expression:
$source_line =~ /^(.*?)\s/
returns a list of matches.
When you evaluate a list in list context, you get the list itself back. When you evaluate a list in scalar context, you get the size of the list back; which is what is happening here.
So changing your lhs expression to be in list context:
my ($i) = $source_line =~ /^(.*?)\s/;
captures the word correctly.
There were recently a few articles on Perl Weekly related to context, here is one of them that was particularly good: http://perlhacks.com/2013/12/misunderstanding-context/

Related

extract string between two dots

I have a string of the following format:
word1.word2.word3
What are the ways to extract word2 from that string in perl?
I tried the following expression but it assigns 1 to sub:
#perleval $vars{sub} = $vars{string} =~ /.(.*)./; 0#
EDIT:
I have tried several suggestions, but still get the value of 1. I suspect that the entire expression above has a problem in addition to parsing. However, when I do simple assignment, I get the correct result:
#perleval $vars{sub} = $vars{string} ; 0#
assigns word1.word2.word3 to variable sub
. has a special meaning in regular expressions, so it needs to be escaped.
.* could match more than intended. [^.]* is safer.
The match operator (//) simply returns true/false in scalar context.
You can use any of the following:
$vars{sub} = $vars{string} =~ /\.([^.]*)\./ ? $1 : undef;
$vars{sub} = ( $vars{string} =~ /\.([^.]*)\./ )[0];
( $vars{sub} ) = $vars{string} =~ /\.([^.]*)\./;
The first one allows you to provide a default if there's no match.
Try:
/\.([^\.]+)\./
. has a special meaning and would need to be escaped. Then you would want to capture the values between the dots, so use a negative character class like ([^\.]+) meaning at least one non-dot. if you use (.*) you will get:
word1.stuff1.stuff2.stuff3.word2 to result in:
stuff1.stuff2.stuff3
But maybe you want that?
Here is my little example, I do find the perl one liners a little harder to read at times so I break it out:
use strict;
use warnings;
if ("stuff1.stuff2.stuff3" =~ m/\.([^.]+)\./) {
my $value = $1;
print $value;
}
else {
print "no match";
}
result
stuff2
. has a special meaning: any character (see the expression between your parentheses)
Therefore you have to escape it (\.) if you search a literal dot:
/\.(.*)\./
You've got to make sure you're asking for a list when you do the search.
my $x= $string =~ /look for (pattern)/ ;
sets $x to 1
my ($x)= $string =~ /look for (pattern)/ ;
sets $x to pattern.

Perl regular expression with "[]"

My data looks like:
NC_004415 NC_010199 ([T(trnH ,trnS1 trnL1 ,)])
NC_006131 NC_010199 ([T(trnH ,trnS1 trnL1 ,)])
NC_006355 NC_007231 ([T(trnM ,trnQ ,)])
I want to capture everything between []:
while( my $line = <crex> )
{ $t=$line=~m/(\[.*\])/;
print $t;
}
The output of $t is 1. Why is it not working?
Since you're using a capturing group, you can just use $1 after the match succeeds:
if($line =~ m/(\[.*\])/) {
print $1;
}
$line =~ m/(\[.*\])/ returns a list of the matches in a list context, but you are using it in a scalar context. In a scalar context, the match operator returns a Boolean that indicates whether the match was successful or not. Therefore you get 1. You can use
my ($t) = $line =~ m/(\[.*\])/;
to create a list context, or you can use $1 instead of using $t.
Use parentheses around $t:
($t) = $line =~m/(\[.*\])/;
Refer to perldoc perlretut (Extracting matches).
I believe you are using the match operator (m//) in a scalar context and storing the result in $t. Since the match is successful, m// returns 1. Refer to perldoc perlop.

Perl, Assign regex match to scalar

There's an example snippet in Mail::POP3Client in which theres a piece of syntax that I don't understand why or how it's working:
foreach ( $pop->Head( $i ) ) {
/^(From|Subject):\s+/i and print $_, "\n";
}
The regex bit in particular. $_ remains the same after that line but only the match is printed.
An additional question; How could I assign the match of that regex to a scalar of my own so I can use that instead of just print it?
This is actually pretty tricky. What it's doing is making use of perl's short circuiting feature to make a conditional statement. it is the same as saying this.
if (/^(From|Subject):\s+/i) {
print $_;
}
It works because perl stops evaluating and statements after something evaluates to 0. and unless otherwise specified a regex in the form /regex/ instead of $somevar =~ /regex/ will apply the regex to the default variable, $_
you can store it like this
my $var;
if (/^(From|Subject):\s+/i) {
$var = $_;
}
or you could use a capture group
/^((?:From|Subject):\s+)/i
which will store the whole thing into $1

Why are the parentheses so important when assigning this regex match?

I have a piece of code:
$s = "<sekar kapoor>";
($name) = $s =~ /<([\S\s]*)>/;
print "$name\n"; # Output is 'sekar kapoor'
If the parentheses are removed in the second line of code like this, in the variable $name:
$name = $s =~ /<([\S\s]*)>/; # $name is now '1'
I don't understand why it behaves like this. Can anyone please explain why it is so?
In your first example you have a list context on the left-hand side (you used parentheses); in the second you have a scalar context - you just have a scalar variable.
See the Perl docs for quote-like ops, Matching in list context:
If the /g option is not used, m// in
list context returns a list consisting
of the subexpressions matched by the
parentheses in the pattern, i.e., ($1
, $2 , $3 ...).
(Note that here $1
etc. are also set, and that this
differs from Perl 4's behavior.)
When there are no parentheses in the
pattern, the return value is the list
(1) for success. With or without
parentheses, an empty list is returned
upon failure.

How can I store regex captures in an array in Perl?

Is it possible to store all matches for a regular expression into an array?
I know I can use ($1,...,$n) = m/expr/g;, but it seems as though that can only be used if you know the number of matches you are looking for. I have tried my #array = m/expr/g;, but that doesn't seem to work.
If you're doing a global match (/g) then the regex in list context will return all of the captured matches. Simply do:
my #matches = ( $str =~ /pa(tt)ern/g )
This command for example:
perl -le '#m = ( "foo12gfd2bgbg654" =~ /(\d+)/g ); print for #m'
Gives the output:
12
2
654
Sometimes you need to get all matches globally, like PHP's preg_match_all does. If it's your case, then you can write something like:
# a dummy example
my $subject = 'Philip Fry Bender Rodriguez Turanga Leela';
my #matches;
push #matches, [$1, $2] while $subject =~ /(\w+) (\w+)/g;
use Data::Dumper;
print Dumper(\#matches);
It prints
$VAR1 = [
[
'Philip',
'Fry'
],
[
'Bender',
'Rodriguez'
],
[
'Turanga',
'Leela'
]
];
See the manual entry for perldoc perlop under "Matching in List Context":
If the /g option is not used, m// in list context returns a list consisting of the
subexpressions matched by the parentheses in the pattern, i.e., ($1 , $2 , $3 ...)
The /g modifier specifies global pattern matching--that is, matching as many times as
possible within the string. How it behaves depends on the context. In list context, it
returns a list of the substrings matched by any capturing parentheses in the regular expression. If there are no parentheses, it returns a list of all the matched strings, as if there were parentheses around the whole pattern.
You can simply grab all the matches by assigning to an array, or otherwise performing the evaluation in list context:
my #matches = ($string =~ m/word/g);
I think this is a self-explanatory example. Note /g modifier in the first regex:
$string = "one two three four";
#res = $string =~ m/(\w+)/g;
print Dumper(#res); # #res = ("one", "two", "three", "four")
#res = $string =~ m/(\w+) (\w+)/;
print Dumper(#res); # #res = ("one", "two")
Remember, you need to make sure the lvalue is in the list context, which means you have to surround scalar values with parenthesis:
($one, $two) = $string =~ m/(\w+) (\w+)/;
Is it possible to store all matches for a regular expression into an array?
Yes, in Perl 5.25.7, the variable #{^CAPTURE} was added, which holds "the contents of the capture buffers, if any, of the last successful pattern match". This means it contains ($1, $2, ...) even if the number of capture groups is unknown.
Before Perl 5.25.7 (since 5.6.0) you could build the same array using #- and #+ as suggested by #Jaques in his answer. You would have to do something like this:
my #capture = ();
for (my $i = 1; $i < #+; $i++) {
push #capture, substr $subject, $-[$i], $+[$i] - $-[$i];
}
I am surprised this is not already mentioned here, but perl documentation provides with the standard variable #+. To quote from the documentation:
This array holds the offsets of the beginnings of the last successful submatches in the currently active dynamic scope.
So, to get the value caught in first capture, one would write:
print substr( $str, $-[1], $+[1] - $-[1] ), "\n"; # equivalent to $1
As a side note, there is also the standard variable %- which is very nifty, because it not only contains named captures, but also allows for duplicate names to be stored in an array.
Using the example provided in the documentation:
/(?<A>1)(?<B>2)(?<A>3)(?<B>4)/
would yield an hash with entries such as:
$-{A}[0] : '1'
$-{A}[1] : '3'
$-{B}[0] : '2'
$-{B}[1] : '4'
Note that if you know the number of capturing groups you need per match, you can use this simple approach, which I present as an example (of 2 capturing groups.)
Suppose you have some 'data' like
my $mess = <<'IS_YOURS';
Richard Rich
April May
Harmony Ha\rm
Winter Win
Faith Hope
William Will
Aurora Dawn
Joy
IS_YOURS
With the following regex
my $oven = qr'^(\w+)\h+(\w+)$'ma; # skip the /a modifier if using perl < 5.14
I can capture all 12 (6 pairs, not 8...Harmony escaped and Joy is missing) in the #box below.
my #box = $mess =~ m[$oven]g;
If I want to "hash out" the details of the box I could just do:
my %hash = #box;
Or I just could have just skipped the box entirely,
my %hash = $mess =~ m[$oven]g;
Note that %hash contains the following. Order is lost and dupe keys (if any had existed) are squashed:
(
'April' => 'May',
'Richard' => 'Rich',
'Winter' => 'Win',
'William' => 'Will',
'Faith' => 'Hope',
'Aurora' => 'Dawn'
);