Regex Word Boundary in Perl not yield expected results - regex

So I'm having an issue with pulling data from a string between 2 keywords. I understand that in regex I'm suppose to use the \b boundary tags and I've written the following for a test example, however it seems to only match the whole string instead of just the portion I want.
For example, the string: "here are more string words START OF INFORMATION SECTION some other stuff"
I am gathering text between "START" and "SECTION".
So I'm expecting "START OF INFORMATION SECTION", I believe.
This is the following snippet I have written in Perl specifically, but it doesn't yield the results I expected.
#!/usr/bin/perl
# This is perl 5, version 22, subversion 1 (v5.22.1) built for cygwin-thread-multi
use POSIX;
my $text = "here are more string words START OF INFORMATION SECTION some other stuff";
print "Original String: $text\n";
# this should provide me with the specific text between my two boundary words
$text =~ /\bSTART\b(.*?)\bSECTION\b/;
print "New String: $text\n";

Your code is simply testing whether the regex pattern matches the string, returning a true or false value to indicate whether there was a match. You discard that indicator
If there was a match then the strings captured using parentheses in the regex pattern will be assigned to the capture variables $1, $2 etc.
It's unclear what you need to do, but this program prints everything between START and SECTION: in this case OF INFORMATION
There's no need for use POSIX, but use strict and use warnings 'all' are essential
#!/usr/bin/perl
use strict;
use warnings 'all';
my $text = "here are more string words START OF INFORMATION SECTION some other stuff";
print "Original String: $text\n";
if ( $text =~ /\bSTART\b(.*?)\bSECTION\b/ ) {
my $section = $1;
print "New String: $section\n";
}
output
Original String: here are more string words START OF INFORMATION SECTION some other stuff
New String: OF INFORMATION

You should use this
$text =~ /\b(START\b(.*?)\bSECTION)\b/;
print "New String: $1\n";
IDEONE DEMO
$1 is the first captured group.
As suggested by borodin
if ( $text =~ /\b(START\b(.*?)\bSECTION)\b/ ) {
my $tmp = $1;
print "New String: $tmp\n";
}

The match operator doesn't change the string it matches.
You can use either of the following to inspect the captured string:
if ( $text =~ /\bSTART\b(.*?)\bSECTION\b/ ) {
my $section = $1;
print "New String: $section\n";
}
or
if ( my ($section) = $text =~ /\bSTART\b(.*?)\bSECTION\b/ ) {
print "New String: $section\n";
}

Related

Extract string after a symbol in Perl

How can I extract string after a symbol in Perl?
I tried doing some searches but even the code I found didn't work.
I'm trying to extract the string after a colon. So I want to show everything after the colon.
Example:
string = day1: string over here
substring = string over here
So far I have tried:
$substring = $string=~ /(\:.*)\s*$/;
But it only outputs the number 1 over and over.
That's because pattern matches in a scalar context are boolean tests. If you want to capture bracket content (capture groups), you need a list context. It's ok if the list is only one element though:
try this:
my ( $substring ) = $string=~ /(\:.*)\s*$/;
Difference maybe a bit subtle, but basically - we are assigning 'all the hits' from the pattern match to a list... that comprises one element.
Note - that's so you can do:
my #matches = $string =~ m/(.)/g;
And get multiple 'hits' returned. If you do as above, you will only get the first match - which is irrelevant given your pattern, but you can do:
my ( $key, $value ) = $string =~ m/(\w+)=(\w+)/;
for example.
I usually use parentheses to extract a part from text and then refer to the result stored in $1 variable.
look at example:
my $text = "day1: string over here";
print $1 if ($text =~ /:\s*(.+)$/);
but similar result may be recieved with this code too:
my $text = "day1: string over here";
my ($a) = $text =~ /:\s*(.+)$/;
print $a;
You can achieve desire substring by using split function also:
#!/usr/bin/perl
use warnings;
use strict;
my $string = "day1: string over here";
my (undef, $substring) = split(':\s*', $string);
print $substring, "\n";
Output:
string over here
Or you can get this by using capturing group () in regex:
my $string = "day1: string over here";
$string =~ m/(.*)\:\s+(.*)$/;
my $substring = $2;
print $substring, "\n";

Perl how do you assign a varanble to a regex match result

How do you create a $scalar from the result of a regex match?
Is there any way that once the script has matched the regex that it can be assigned to a variable so it can be used later on, outside of the block.
IE. If $regex_result = blah blah then do something.
I understand that I should make the regex as non-greedy as possible.
#!/usr/bin/perl
use strict;
use warnings;
# use diagnostics;
use Win32::OLE;
use Win32::OLE::Const 'Microsoft Outlook';
my #Qmail;
my $regex = "^\\s\*owner \#";
my $sentence = $regex =~ "/^\\s\*owner \#/";
my $outlook = Win32::OLE->new('Outlook.Application')
or warn "Failed Opening Outlook.";
my $namespace = $outlook->GetNamespace("MAPI");
my $folder = $namespace->Folders("test")->Folders("Inbox");
my $items = $folder->Items;
foreach my $msg ( $items->in ) {
if ( $msg->{Subject} =~ m/^(.*test alert) / ) {
my $name = $1;
print " processing Email for $name \n";
push #Qmail, $msg->{Body};
}
}
for(#Qmail) {
next unless /$regex|^\s*description/i;
print; # prints what i want ie lines that start with owner and description
}
print $sentence; # prints ^\\s\*offense \ # not lines that start with owner.
One way is to verify a match occurred.
use strict;
use warnings;
my $str = "hello what world";
my $match = 'no match found';
my $what = 'no what found';
if ( $str =~ /hello (what) world/ )
{
$match = $&;
$what = $1;
}
print '$match = ', $match, "\n";
print '$what = ', $what, "\n";
Use Below Perl variables to meet your requirements -
$` = The string preceding whatever was matched by the last pattern match, not counting patterns matched in nested blocks that have been exited already.
$& = Contains the string matched by the last pattern match
$' = The string following whatever was matched by the last pattern match, not counting patterns matched in nested blockes that have been exited already. For example:
$_ = 'abcdefghi';
/def/;
print "$`:$&:$'\n"; # prints abc:def:ghi
The match of a regex is stored in special variables (as well as some more readable variables if you specify the regex to do so and use the /p flag).
For the whole last match you're looking at the $MATCH (or $& for short) variable. This is covered in the manual page perlvar.
So say you wanted to store your last for loop's matches in an array called #matches, you could write the loop (and for some reason I think you meant it to be a foreach loop) as:
my #matches = ();
foreach (#Qmail) {
next unless /$regex|^\s*description/i;
push #matches_in_qmail $MATCH
print;
}
I think you have a problem in your code. I'm not sure of the original intention but looking at these lines:
my $regex = "^\\s\*owner \#";
my $sentence = $regex =~ "/^\s*owner #/";
I'll step through that as:
Assign $regexto the string ^\s*owner #.
Assign $sentence to value of running a match within $regex with the regular expression /^s*owner $/ (which won't match, if it did $sentence will be 1 but since it didn't it's false).
I think. I'm actually not exactly certain what that line will do or was meant to do.
I'm not quite sure what part of the match you want: the captures, or something else. I've written Regexp::Result which you can use to grab all the captures etc. on a successful match, and Regexp::Flow to grab multiple results (including success statuses). If you just want numbered captures, you can also use Data::Munge
You can do the following:
my $str ="hello world";
my ($hello, $world) = $str =~ /(hello)|(what)/;
say "[$_]" for($hello,$world);
As you see $hello contains "hello".
If you have older perl on your system like me, perl 5.18 or earlier, and you use $ $& $' like codequestor's answer above, it will slow down your program.
Instead, you can use your regex pattern with the modifier /p, and then check these 3 variables: ${^PREMATCH}, ${^MATCH}, and ${^POSTMATCH} for your matching results.

How to use an user input as a regex?

I have a simple program where the user can enter a string.
After this the user can enter a regex. I need the string to be compared against this regex.
The following code do not work - the regex always fails.
And I know that its maybe because I am comparing a string with a string and not a string with a regex.
But how would you do this?
while(1){
print "Enter a string: ";
$input = <>;
print "\nEnter a regex and see if it matches the string: ";
$regex = <>;
if($input =~ $regex){
print "\nThe regex $regex matched the string $input\n\n";
}
}
Use lexical variables instead of global ones.
You should remember that strings read by <> usually contain newlines, so it might be necessary to remove the newlines with chomp, like this:
chomp(my $input = <STDIN>);
chomp(my $regex = <STDIN>);
You might want to interpret regex special characters taken from the user literally, so that ^ will match a literal circumflex, not the beginning of the string, for example. If so, use the \Q escape sequence:
if ($input =~ /\Q$regex\E/) { ... }
Don't forget to read the Perl FAQ in your journey through Perl. It might have all the answers before you even begin to specify the question: How do I match a regular expression that's in a variable?
You need to use a //, m//, or s/// — but you can specify a variable as the pattern.
if ($input =~ /$regex/) {
print "match found\n";
}
I think you need to chomp input and regex variables. and correct the expression to match regex
chomp( $input );
chomp( $regex );
if($input =~ /$regex/){
print "\nThe regex $regex matched the string $input\n\n";
}

How to user Perl to find all instances of strings that match a regexp?

How to user Perl to find and print all strings that match a regexp?
The following only finds the first match.
$text="?Adsfsadfgaasdf.
?Bafadfdsaadsfadsf.
xcxvfdgfdg";
if($text =~ m/\\?([^\.]+\.)/) {
print "$1\n";
}
EDIT1: /g doesn't work
#!/usr/bin/env perl
$text="?Adsfsadfgaasdf.
?Bafadfdsaadsfadsf.
xcxvfdgfdg";
if($text =~ m/\\?([^\.]+\.)/g) {
print "$1\n";
}
$ ./test.pl
?Adsfsadfgaasdf.
The problem is that the /g modifier does not use capture groups for multiple matches. You need to either iterate over the matches in scalar context, or catch the returned list in list context. For example:
use v5.10; # required for say()
$text="?Adsfsadfgaasdf.
?Bafadfdsaadsfadsf.
xcxvfdgfdg";
while ($text =~ /\?([^.]+\.)/g) { # scalar context
say $1;
}
for ($text =~ /\?[^.]+\./g) { # list context
say; # match is held in $_
}
Note in the second case, I skipped the parens, because in list context the whole match is returned if there are no parens. You may add parens to select part of the string.
Your version, using if, uses scalar context, which saves the position of the most recent match, but does not continue. A way to see what happens is:
if($text =~ m/\?([^\.]+\.)/g) {
print "$1\n";
}
say "Rest of string: ", substr $text, pos;
pos gives the position of the most recent match.
In previous answer #TLP correctly wrote that matching should be in list context.
use Data::Dumper;
$text="?Adsfsadfgaasdf.
?Bafadfdsaa.
dsfadsf.
xcxvfdgfdg";
#arr = ($text =~ /\?([^\.]+\.)/g);
print Dumper(#arr);
Expected result:
$VAR1 = 'Adsfsadfgaasdf.';
$VAR2 = 'Bafadfdsaa.';
You seem to be missing the /g flag, which tells perl to repeat the match as many times as possible.

What does $1 mean in Perl?

What does $1 mean in Perl? Further, what does $2 mean?
How many $number variables are there?
The $number variables contain the parts of the string that matched the capture groups ( ... ) in the pattern for your last regex match if the match was successful.
For example, take the following string:
$text = "the quick brown fox jumps over the lazy dog.";
After the statement
$text =~ m/ (b.+?) /;
$1 equals the text "brown".
The number variables are the matches from the last successful match or substitution operator you applied:
my $string = 'abcdefghi';
if ($string =~ /(abc)def(ghi)/) {
print "I found $1 and $2\n";
}
Always test that the match or substitution was successful before using $1 and so on. Otherwise, you might pick up the leftovers from another operation.
Perl regular expressions are documented in perlre.
$1, $2, etc will contain the value of captures from the last successful match - it's important to check whether the match succeeded before accessing them, i.e.
if ( $var =~ m/( )/ ) { # use $1 etc... }
An example of the problem - $1 contains 'Quick' in both print statements below:
#!/usr/bin/perl
'Quick brown fox' =~ m{ ( quick ) }ix;
print "Found: $1\n";
'Lazy dog' =~ m{ ( quick ) }ix;
print "Found: $1\n";
As others have pointed out, the $x are capture variables for regular expressions, allowing you to reference sections of a matched pattern.
Perl also supports named captures which might be easier for humans to remember in some cases.
Given input: 111 222
/(\d+)\s+(\d+)/
$1 is 111
$2 is 222
One could also say:
/(?<myvara>\d+)\s+(?<myvarb>\d+)/
$+{myvara} is 111
$+{myvarb} is 222
These are called "match variables". As previously mentioned they contain the text from your last regular expression match.
More information is in Essential Perl. (Ctrl + F for 'Match Variables' to find the corresponding section.)
Since you asked about the capture groups, you might want to know about $+ too...
Pretty useful...
use Data::Dumper;
$text = "hiabc ihabc ads byexx eybxx";
while ($text =~ /(hi|ih)abc|(bye|eyb)xx/igs)
{
print Dumper $+;
}
OUTPUT:
$VAR1 = 'hi';
$VAR1 = 'ih';
$VAR1 = 'bye';
$VAR1 = 'eyb';
The variables $1 .. $9 are also read only variables so you can't implicitly assign a value to them:
$1 = 'foo'; print $1;
That will return an error: Modification of a read-only value attempted at script line 1.
You also can't use numbers for the beginning of variable names:
$1foo = 'foo'; print $1foo;
The above will also return an error.
I would suspect that there can be as many as 2**32 -1 numbered match variables, on a 32-bit compiled Perl binary.