Is it possible to store all matches for a regular expression into an array?
I know I can use ($1,...,$n) = m/expr/g;, but it seems as though that can only be used if you know the number of matches you are looking for. I have tried my #array = m/expr/g;, but that doesn't seem to work.
If you're doing a global match (/g) then the regex in list context will return all of the captured matches. Simply do:
my #matches = ( $str =~ /pa(tt)ern/g )
This command for example:
perl -le '#m = ( "foo12gfd2bgbg654" =~ /(\d+)/g ); print for #m'
Gives the output:
12
2
654
Sometimes you need to get all matches globally, like PHP's preg_match_all does. If it's your case, then you can write something like:
# a dummy example
my $subject = 'Philip Fry Bender Rodriguez Turanga Leela';
my #matches;
push #matches, [$1, $2] while $subject =~ /(\w+) (\w+)/g;
use Data::Dumper;
print Dumper(\#matches);
It prints
$VAR1 = [
[
'Philip',
'Fry'
],
[
'Bender',
'Rodriguez'
],
[
'Turanga',
'Leela'
]
];
See the manual entry for perldoc perlop under "Matching in List Context":
If the /g option is not used, m// in list context returns a list consisting of the
subexpressions matched by the parentheses in the pattern, i.e., ($1 , $2 , $3 ...)
The /g modifier specifies global pattern matching--that is, matching as many times as
possible within the string. How it behaves depends on the context. In list context, it
returns a list of the substrings matched by any capturing parentheses in the regular expression. If there are no parentheses, it returns a list of all the matched strings, as if there were parentheses around the whole pattern.
You can simply grab all the matches by assigning to an array, or otherwise performing the evaluation in list context:
my #matches = ($string =~ m/word/g);
I think this is a self-explanatory example. Note /g modifier in the first regex:
$string = "one two three four";
#res = $string =~ m/(\w+)/g;
print Dumper(#res); # #res = ("one", "two", "three", "four")
#res = $string =~ m/(\w+) (\w+)/;
print Dumper(#res); # #res = ("one", "two")
Remember, you need to make sure the lvalue is in the list context, which means you have to surround scalar values with parenthesis:
($one, $two) = $string =~ m/(\w+) (\w+)/;
Is it possible to store all matches for a regular expression into an array?
Yes, in Perl 5.25.7, the variable #{^CAPTURE} was added, which holds "the contents of the capture buffers, if any, of the last successful pattern match". This means it contains ($1, $2, ...) even if the number of capture groups is unknown.
Before Perl 5.25.7 (since 5.6.0) you could build the same array using #- and #+ as suggested by #Jaques in his answer. You would have to do something like this:
my #capture = ();
for (my $i = 1; $i < #+; $i++) {
push #capture, substr $subject, $-[$i], $+[$i] - $-[$i];
}
I am surprised this is not already mentioned here, but perl documentation provides with the standard variable #+. To quote from the documentation:
This array holds the offsets of the beginnings of the last successful submatches in the currently active dynamic scope.
So, to get the value caught in first capture, one would write:
print substr( $str, $-[1], $+[1] - $-[1] ), "\n"; # equivalent to $1
As a side note, there is also the standard variable %- which is very nifty, because it not only contains named captures, but also allows for duplicate names to be stored in an array.
Using the example provided in the documentation:
/(?<A>1)(?<B>2)(?<A>3)(?<B>4)/
would yield an hash with entries such as:
$-{A}[0] : '1'
$-{A}[1] : '3'
$-{B}[0] : '2'
$-{B}[1] : '4'
Note that if you know the number of capturing groups you need per match, you can use this simple approach, which I present as an example (of 2 capturing groups.)
Suppose you have some 'data' like
my $mess = <<'IS_YOURS';
Richard Rich
April May
Harmony Ha\rm
Winter Win
Faith Hope
William Will
Aurora Dawn
Joy
IS_YOURS
With the following regex
my $oven = qr'^(\w+)\h+(\w+)$'ma; # skip the /a modifier if using perl < 5.14
I can capture all 12 (6 pairs, not 8...Harmony escaped and Joy is missing) in the #box below.
my #box = $mess =~ m[$oven]g;
If I want to "hash out" the details of the box I could just do:
my %hash = #box;
Or I just could have just skipped the box entirely,
my %hash = $mess =~ m[$oven]g;
Note that %hash contains the following. Order is lost and dupe keys (if any had existed) are squashed:
(
'April' => 'May',
'Richard' => 'Rich',
'Winter' => 'Win',
'William' => 'Will',
'Faith' => 'Hope',
'Aurora' => 'Dawn'
);
Related
I have a string of the following format:
word1.word2.word3
What are the ways to extract word2 from that string in perl?
I tried the following expression but it assigns 1 to sub:
#perleval $vars{sub} = $vars{string} =~ /.(.*)./; 0#
EDIT:
I have tried several suggestions, but still get the value of 1. I suspect that the entire expression above has a problem in addition to parsing. However, when I do simple assignment, I get the correct result:
#perleval $vars{sub} = $vars{string} ; 0#
assigns word1.word2.word3 to variable sub
. has a special meaning in regular expressions, so it needs to be escaped.
.* could match more than intended. [^.]* is safer.
The match operator (//) simply returns true/false in scalar context.
You can use any of the following:
$vars{sub} = $vars{string} =~ /\.([^.]*)\./ ? $1 : undef;
$vars{sub} = ( $vars{string} =~ /\.([^.]*)\./ )[0];
( $vars{sub} ) = $vars{string} =~ /\.([^.]*)\./;
The first one allows you to provide a default if there's no match.
Try:
/\.([^\.]+)\./
. has a special meaning and would need to be escaped. Then you would want to capture the values between the dots, so use a negative character class like ([^\.]+) meaning at least one non-dot. if you use (.*) you will get:
word1.stuff1.stuff2.stuff3.word2 to result in:
stuff1.stuff2.stuff3
But maybe you want that?
Here is my little example, I do find the perl one liners a little harder to read at times so I break it out:
use strict;
use warnings;
if ("stuff1.stuff2.stuff3" =~ m/\.([^.]+)\./) {
my $value = $1;
print $value;
}
else {
print "no match";
}
result
stuff2
. has a special meaning: any character (see the expression between your parentheses)
Therefore you have to escape it (\.) if you search a literal dot:
/\.(.*)\./
You've got to make sure you're asking for a list when you do the search.
my $x= $string =~ /look for (pattern)/ ;
sets $x to 1
my ($x)= $string =~ /look for (pattern)/ ;
sets $x to pattern.
How do you create a $scalar from the result of a regex match?
Is there any way that once the script has matched the regex that it can be assigned to a variable so it can be used later on, outside of the block.
IE. If $regex_result = blah blah then do something.
I understand that I should make the regex as non-greedy as possible.
#!/usr/bin/perl
use strict;
use warnings;
# use diagnostics;
use Win32::OLE;
use Win32::OLE::Const 'Microsoft Outlook';
my #Qmail;
my $regex = "^\\s\*owner \#";
my $sentence = $regex =~ "/^\\s\*owner \#/";
my $outlook = Win32::OLE->new('Outlook.Application')
or warn "Failed Opening Outlook.";
my $namespace = $outlook->GetNamespace("MAPI");
my $folder = $namespace->Folders("test")->Folders("Inbox");
my $items = $folder->Items;
foreach my $msg ( $items->in ) {
if ( $msg->{Subject} =~ m/^(.*test alert) / ) {
my $name = $1;
print " processing Email for $name \n";
push #Qmail, $msg->{Body};
}
}
for(#Qmail) {
next unless /$regex|^\s*description/i;
print; # prints what i want ie lines that start with owner and description
}
print $sentence; # prints ^\\s\*offense \ # not lines that start with owner.
One way is to verify a match occurred.
use strict;
use warnings;
my $str = "hello what world";
my $match = 'no match found';
my $what = 'no what found';
if ( $str =~ /hello (what) world/ )
{
$match = $&;
$what = $1;
}
print '$match = ', $match, "\n";
print '$what = ', $what, "\n";
Use Below Perl variables to meet your requirements -
$` = The string preceding whatever was matched by the last pattern match, not counting patterns matched in nested blocks that have been exited already.
$& = Contains the string matched by the last pattern match
$' = The string following whatever was matched by the last pattern match, not counting patterns matched in nested blockes that have been exited already. For example:
$_ = 'abcdefghi';
/def/;
print "$`:$&:$'\n"; # prints abc:def:ghi
The match of a regex is stored in special variables (as well as some more readable variables if you specify the regex to do so and use the /p flag).
For the whole last match you're looking at the $MATCH (or $& for short) variable. This is covered in the manual page perlvar.
So say you wanted to store your last for loop's matches in an array called #matches, you could write the loop (and for some reason I think you meant it to be a foreach loop) as:
my #matches = ();
foreach (#Qmail) {
next unless /$regex|^\s*description/i;
push #matches_in_qmail $MATCH
print;
}
I think you have a problem in your code. I'm not sure of the original intention but looking at these lines:
my $regex = "^\\s\*owner \#";
my $sentence = $regex =~ "/^\s*owner #/";
I'll step through that as:
Assign $regexto the string ^\s*owner #.
Assign $sentence to value of running a match within $regex with the regular expression /^s*owner $/ (which won't match, if it did $sentence will be 1 but since it didn't it's false).
I think. I'm actually not exactly certain what that line will do or was meant to do.
I'm not quite sure what part of the match you want: the captures, or something else. I've written Regexp::Result which you can use to grab all the captures etc. on a successful match, and Regexp::Flow to grab multiple results (including success statuses). If you just want numbered captures, you can also use Data::Munge
You can do the following:
my $str ="hello world";
my ($hello, $world) = $str =~ /(hello)|(what)/;
say "[$_]" for($hello,$world);
As you see $hello contains "hello".
If you have older perl on your system like me, perl 5.18 or earlier, and you use $ $& $' like codequestor's answer above, it will slow down your program.
Instead, you can use your regex pattern with the modifier /p, and then check these 3 variables: ${^PREMATCH}, ${^MATCH}, and ${^POSTMATCH} for your matching results.
I have a string like below
atom:link[#me="samiron" and #test1="t1" and #test2="t2"]
and I need a regular expression which will generate the following back references
#I would prefer to have
$1 = #test1
$2 = t1
$3 = #test2
$4 = t2
#Or at least. I will break these up in parts later on.
$1 = #test1="t1"
$2 = #test2="t2"
I've tried something like ( and [#\w]+=["\w]+)*\] which returns only last match and #test2="t2". Completely out of ideas. Any help?
Edit:
actually the number of #test1="t1" pattern is not fixed. And the regex must fit the situation. Thnx #Pietzcker.
This will give you hash which maps "#test1" => "t1" and so on:
my %matches = ($str =~ /and (\#\w+)="(\w+)"/g);
Explanation: /g global match will give you an array of matches like
"#test1", "t1", "#test2", "t2", ...
When hash %matches is assigned to this array, perl will automatically convert array to hash by treating it as key-value pairs.
As a result, hash %matches will contain what are you looking for in nice hash format.
You can do it like this:
my $text = 'atom:link[#me="samiron" and #test1="t1" and #test2="t2"]';
my #results;
while ($text =~ m/and (#\w+)="(\w+)"/g) {
push #results, $1, $2;
}
print Dumper \#results;
Result:
$VAR1 = [
'#me',
'samiron',
'#test1',
't1',
'#test2',
't2'
];
When you use a repeating capturing group, each new match will overwrite any previous match.
So you can only do a "find all" with a regex like
#result = $subject =~ m/(?<= and )([#\w]+)=(["\w]+)(?= and |\])/g;
to get an array of all matches.
This works for me:
#result = $s =~ /(#(?!me).*?)="(.*?)"/g;
foreach (#result){
print "$_\n";
}
The output is:
#test1
t1
#test2
t2
What does $1 mean in Perl? Further, what does $2 mean?
How many $number variables are there?
The $number variables contain the parts of the string that matched the capture groups ( ... ) in the pattern for your last regex match if the match was successful.
For example, take the following string:
$text = "the quick brown fox jumps over the lazy dog.";
After the statement
$text =~ m/ (b.+?) /;
$1 equals the text "brown".
The number variables are the matches from the last successful match or substitution operator you applied:
my $string = 'abcdefghi';
if ($string =~ /(abc)def(ghi)/) {
print "I found $1 and $2\n";
}
Always test that the match or substitution was successful before using $1 and so on. Otherwise, you might pick up the leftovers from another operation.
Perl regular expressions are documented in perlre.
$1, $2, etc will contain the value of captures from the last successful match - it's important to check whether the match succeeded before accessing them, i.e.
if ( $var =~ m/( )/ ) { # use $1 etc... }
An example of the problem - $1 contains 'Quick' in both print statements below:
#!/usr/bin/perl
'Quick brown fox' =~ m{ ( quick ) }ix;
print "Found: $1\n";
'Lazy dog' =~ m{ ( quick ) }ix;
print "Found: $1\n";
As others have pointed out, the $x are capture variables for regular expressions, allowing you to reference sections of a matched pattern.
Perl also supports named captures which might be easier for humans to remember in some cases.
Given input: 111 222
/(\d+)\s+(\d+)/
$1 is 111
$2 is 222
One could also say:
/(?<myvara>\d+)\s+(?<myvarb>\d+)/
$+{myvara} is 111
$+{myvarb} is 222
These are called "match variables". As previously mentioned they contain the text from your last regular expression match.
More information is in Essential Perl. (Ctrl + F for 'Match Variables' to find the corresponding section.)
Since you asked about the capture groups, you might want to know about $+ too...
Pretty useful...
use Data::Dumper;
$text = "hiabc ihabc ads byexx eybxx";
while ($text =~ /(hi|ih)abc|(bye|eyb)xx/igs)
{
print Dumper $+;
}
OUTPUT:
$VAR1 = 'hi';
$VAR1 = 'ih';
$VAR1 = 'bye';
$VAR1 = 'eyb';
The variables $1 .. $9 are also read only variables so you can't implicitly assign a value to them:
$1 = 'foo'; print $1;
That will return an error: Modification of a read-only value attempted at script line 1.
You also can't use numbers for the beginning of variable names:
$1foo = 'foo'; print $1foo;
The above will also return an error.
I would suspect that there can be as many as 2**32 -1 numbered match variables, on a 32-bit compiled Perl binary.
I thought this would have done it...
$rowfetch = $DBS->{Row}->GetCharValue("meetdays");
$rowfetch = /[-]/gi;
printline($rowfetch);
But it seems that I'm missing a small yet critical piece of the regex syntax.
$rowfetch is always something along the lines of:
------S
-M-W---
--T-TF-
etc... to represent the days of the week a meeting happens
$rowfetch =~ s/-//gi
That's what you need for your second line there. You're just finding stuff, not actually changing it without the "s" prefix.
You also need to use the regex operator "=~" for this.
Here is what your code presently does:
# Assign 'rowfetch' to the value fetched from:
# The function 'GetCharValue' which is a method of:
# An Value in A Hash Identified by the key "Row" in:
# Either a Hash-Ref or a Blessed Hash-Ref
# Where 'GetCharValue' is given the parameter "meetdays"
$rowfetch = $DBS->{Row}->GetCharValue("meetdays");
# Assign $rowfetch to the number of times
# the default variable ( $_ ) matched the expression /[-]/
$rowfetch = /[-]/gi;
# Print the number of times.
printline($rowfetch);
Which is equivalent to having written the following code:
$rowfetch = ( $_ =~ /[-]/ )
printline( $rowfetch );
The magic you are looking for is the
=~
Token instead of
=
The former is a Regex operator, and the latter is an assignment operator.
There are many different regex operators too:
if( $subject =~ m/expression/ ){
}
Will make the given codeblock execute only if $subject matches the given expression, and
$subject =~ s/foo/bar/gi
Replaces ( s/) all instances of "foo" with "bar", case-insentitively (/i), and repeating the replacement more than once(/g), on the variable $subject.
Using the tr operator is faster than using a s/// regex substitution.
$rowfetch =~ tr/-//d;
Benchmark:
use Benchmark qw(cmpthese);
my $s = 'foo-bar-baz-blee-goo-glab-blech';
cmpthese(-5, {
trd => sub { (my $a = $s) =~ tr/-//d },
sub => sub { (my $a = $s) =~ s/-//g },
});
Results on my system:
Rate sub trd
sub 300754/s -- -79%
trd 1429005/s 375% --
Off-topic, but without the hyphens, how will you know whether a "T" is Tuesday or Thursday?