how to match several regular expression patterns sequentially in perl - regex

I want to do matching in the following way for a large multiline text:
I have a few matching patterns:
$text =~ m#finance(.*?)end#s;
$text =~ m#<class>(.*?)</class>#s;
$text =~ m#/data(.*?)<end>#s;
If either one is matched, then print the result print $1, and then continue with the rest of the text to match again for the three patterns.
How can I get the printed results in the order they appear in the whole text?
Many thanks for your help!

while ($text =~ m#(?: finance (.*?) end
| <class> (.*?) </class>
| data (.*?) </end>
)
#sgx) {
print $+;
}
ought to do it.
$+ is the last capturing group that successfully matched.
The /g modifier is intended specifically for this kind of usage; it turns the regex into an iterator that, when resumed, continues the match where it left off instead of restarting at the beginning of $text.
(And /x lets you use arbitrary whitespace, meaning you can make your regexes readable. Or as readable as they get, at least.)
If you need to deal with multiple captures, it becomes a bit harder as you can't use $+. You can, however, test for capturing groups being defined:
while ($text =~ m#(?: a (.*?) b (.*?) c
| d (.*?) e (.*?) f
| data (.*?) </end>
)
#sgx) {
if (defined $1) {
# first set matched (don't need to check $2)
}
elsif (defined $3) {
# second set matched
}
else {
# final one matched
}
}

Related

Extract delimited substrings from a string

I have a string that looks like this
my $source = "PayRate=[[sDate=05Jul2017,Rate=0.05,eDate=06Sep2017]],item1,item2,ReceiveRate=[[sDate=05Sep2017,Rate=0.06]],item3" ;
I want to use capture groups to extract only the PayRate values contained within the first [[...]] block.
$1 = "Date=05Jul2017,Rate=0.05,EDate=06Sep2017"
I tried this but it returns the entire string.
my $match =~ m/PayRate=\[\[(.*)\]\],/ ;
It is clear that I have to put specific patterns for the series of {(.*)=(.*)} blocks inside. Need expert advice.
You are using a greedy match .*, which consumes as much input as possible while still matching, you're matching the first [[ to the last ]].
Instead, use a reluctant match .*?, which matches as little as possible while still matching:
my ( $match) = $source =~ /PayRate=\[\[(.*?)\]\]/;
Use the /x modifier on match (and substitute) so you can use white space for easier reading and comments to tell what is going on. Limit the patterns by matching everything not in the pattern. [^\]]*? is better than .*?.
my ( $match ) = $line =~ m{
Payrate \= \[\[ # select which part
( [^\]]*? ) # capture anything between square brackets
}x;

Perl $1 variable not defined after regex match

This is probably a very basic error on my part, but I've been stuck on this problem for ages and it's driving me up the wall!
I am looping through a file of Python code using Perl and identifying its variables. I am using a Perl regex to pick out substrings of alphanumeric characters in between spaces. The regex works fine and identifies the lines that the matches belong to, but when I try to return the actual substring that matches the regex, the capture variable $1 is undefined.
Here is my regex:
if ($line =~ /.*\s+[a-zA-Z0-9]+\s+.*/) {
print $line;
print $1;
}
And here is the error:
x = 1
Use of uninitialized value $1 in print at ./vars.pl line 7, <> line 2.
As I understand it, $1 is supposed to return x. Where is my code going wrong?
You're not capturing the result:
if ($line =~ /.*\s+([a-zA-Z0-9]+)\s+.*/) {
If you want to match a line like x = 1 and get both parts of it, you need to match on and capture both with parenthesis. A crude approach:
if ( $line =~ /^\s* ( \w+ ) \s* = \s* ( \w+ ) \s* $/msx ) {
my $var = $1;
my $val = $2;
}
The correct answer has been given by Leeft: You need to capture the string by using parentheses. I wanted to mention some other things. In your code:
if ($line =~ /.*\s+[a-zA-Z0-9]+\s+.*/) {
print $line;
print $1;
}
You are surrounding your match with .*\s+. This is unlikely doing what you think. You never need to use .* with m//, unless you are capturing a string (or capturing the whole match using $&). The match is not anchored by default, and will match anywhere in the string. To anchor the match you must use ^ or $. E.g.:
if ('abcdef' =~ /c/) # returns true
if ('abcdef' =~ /^c/) # returns false, match anchored to beginning
if ('abcdef' =~ /c$/) # returns false, match anchored to end
if ('abcdef' =~ /c.*$/) # returns true
As you see in the last example, using .* is quite redundant, and to get the match you need only remove the anchor. Or if you wanted to capture the whole string:
if ('abcdef' =~ /(c.*)$/) # returns true, captures 'cdef'
You can also use $&, which contains the entire match, regardless of parentheses.
You are probably using \s+ to ensure you do not match partial words. You should be aware that there is an escape sequence called word boundary, \b. This is a zero-length assertion, that checks that the characters around it are word and non-word.
'abc cde fgh' =~ /\bde\b/ # no match
'abc cde fgh' =~ /\bcde\b/ # match
'abc cde fgh' =~ /\babc/ # match
'abc cde fgh' =~ /\s+abc/ # no match! there is no whitespace before 'a'
As you see in the last example, using \s+ fails at start or end of string. Do note that \b also matches partially at non-word characters that can be part of words, such as:
'aaa-xxx' =~ /\bxxx/ # match
You must decide if you want this behaviour or not. If you do not, an alternative to using \s is to use the double negated case: (?!\S). This is a zero-length negative look-ahead assertion, looking for non-whitespace. It will be true for whitespace, and for end of string. Use a look-behind to check the other side.
Lastly, you are using [a-zA-Z0-9]. This can be replaced with \w, although \w also includes underscore _ (and other word characters).
So your regex becomes:
/\b(\w+)\b/
Or
/(?<!\S)(\w+)(?!\S)/
Documentation:
perldoc perlvar - Perl built-in variables
perldoc perlop - Perl operators
perldoc perlre - Perl regular expressions

perl Regular expression matching repeating words

a regular expression that matches any line of input that has the same word repeated
two or more times consecutively in a row. Assume there is one space between consecutive
words
if($line!~m/(\b(\w+)\b\s){2,}/{print"No match\n";}
{ print "$`"; #print out first part of string
print "<$&>"; #highlight the matching part
print "$'"; #print out the rest
}
This is best i got so far,but there is something wrong
correct me if i am wrong
\b start with a word boundary
(\w+) followed by one word or more words
\bend with a word boundary
\s then a space
{2,} check if this thing repeat 2 or more times
what's wrong with my expression
This should be what you're looking for: (?:\b(\w+)\b) (?:\1(?: |$))+
Also, don't use \s when you're just looking for spaces as it's possible you'll match a newline or some other whitespace character. Simple spaces aren't delimiters or special characters in regex, so it's fine to just type the space. You can use [ ] if you want it to be more visually apparent.
I tried CAustin's answer in regexr.com and the results were not what I would expect. Also, no need for all the non-capturing groups.
My regex:
(\b(\w+))( \2)+
Word-boundary, followed by (1 or more word characters)[group 2], followed by one or more of: space, group 2.
This next one replaces the space with \s+, generalizing the separation between the words to be 1 or more of any kind of white-space:
(\b(\w+))(\s+\2)+
You aren't actually checking to see if it's the SAME word that's repeating. To do that, you need to use a captured backreference:
if ($line =~ m/\b(\w+)(?:\s\1){2,}\b/) {
print "matched '$1'\n";
}
Also, anytime you're testing a regular expression, it's helpful if you create a list of examples to work with. The following demonstrates one way of doing that using the __DATA__ block
use strict;
use warnings;
while (my $line = <DATA>) {
if ($line =~ m/\b(\w+)(?:\s\1){2,}/) {
print "matched '$1'\n";
} else {
print "no match\n";
}
}
__DATA__
foo foo
foo bar foo
foo foo foo
Outputs
no match
no match
matched 'foo'

Perl regex issue with brackets where content are multiline

I have a string in a file, which is to be read by Perl, and can either be:
previous content ending with a linebreak
keyword: content
next content
or
previous content, also ending with a line end
keyword: { content that contains {
nested parenthesis } and may span
multiple lines,c closed by matching parenthesis}
next content
In either case, I successfully loaded the contents, from the beginning of previous content, till the end of next, in a string, call it $str.
Now, I want to extract the stuff between the linebreak that ends previous content, and the linebreak before next content.
So I used a regex on $str like this:
if($str =~
/.*\nkeyword: # keyword: is always constant, immediately after a newline
(?!\{+) # NO { follows
\s+(?!\{+) # NO { with a heading whitespace
\s* # white space between keyword: and content
(?!\{+) # no { immediately before content
# question : should the last one be a negative lookbehind AFTER the check for content itself?
([^\s]+) # the content, should be in $1;
(?!\{+) # no trailing { immediately after content
\s+ # delimited by a whitespace, ignore what comes afterwards
| # or
/.*\nkeyword: # keyword: is always constant, immediately after a newline
(?=\s*{*\s*)*) # any mix of whitespace and {
(?=\{+) # at least one {
(?=\s*{*\s*)*) # again any mix of whitespace and {
([^\{\}]+) # no { or }
(?=\s*}*\s*)*) # any mix of whitespace and }
(?=\}+) # at least one }
(?=\s*}*\s*)*) # again any mix of whitespace and }
) { #do something with $1}
I realize that this one is not really addressing multiline information with nested parenthesis; however, it should capture objects in form keyword: {{ content} }
However, while I am able to capture the content in $1 in case of
keyword: content
form, I am unable to capture
keyword: {multiline with nested
{parenthesis} }
I finally did implement it using a simple counter based parser, instead of regex. I would love to know how can I do this in regex, to capture objects of the second form, with an explanation of the regex command, please.
Also, where did my formulation go wrong that it does not even capture single line content with multiple (but matched) heading and trailing parenthesis?
You can use this:
#!/usr/bin/perl
use strict;
use warnings;
my $str = "previous content ending with a linebreak
keyword: content
next content
previous contnet, also ending with a line end
keyword: { content that contains {
nested parenthesis } and may span
multiple lines,c losed by matching parethesis}
next content";
while ($str =~ /\nkeyword:
(?| # branch reset: i.e. the two capture groups have the same number
\s*
({ (?> [^{}]++ | (?1) )*+ }) # recursive pattern
| # OR
\h*
(.*+) # capture all until the end of line
) # close the branch reset group
/xg ) {
print "$1\n";
}
This pattern try a possible content with nested curly brackets, if curly brackets are not found or are not balanced, the second alternative is tried and match only the content of the line (since the dot can't match newlines).
The branch reset feature (?|..|..) is useful to give the same number to the capturing group of each part of the alternation.
recursive pattern details:
( # open the capturing group 1
{ # literal opening curly bracket
(?> # atomic group: possible content between brackets
[^{}]++ # all that is not a curly bracket
| # OR
(?1) # recurse to the capturing group 1 (!here is the recursion!)
)*+ # repeat the atomic group zero or more times
} # literal closing curly bracket
) # close the capturing group 1
In this subpattern I use an atomic group (?>...) and possessive quantifiers ++ and *+ to avoid backtracking the most possible.
How about something like this?
if ($str =~ /keyword:\s*{(.*)}/s) {
my $key = $1;
if ($key =~ /([^{}]*)/) {
print "$1\n";
}
else {
print "$key\n";
}
}
elsif ($str =~ /keyword:\s*(.*)/) {
print "$1\n";
}
[^{|^}] is looking for a chunk of letters that doesn't have any braces in it i.e. the most inner letters of the nested braces.
The s modifier allows you to look at multiple lines even when using .*. However, you don't want to look at multiple lines for keywords without braces, so that part is in the elsif statement.
Do you need to have the same number of matching braces? For example, should keyword: {foo{bar{hello}}} output {{{hello}}}? If so, I feel like it would be better to stick with counters.
Edit:
For the input
keyword: {multiline
with nested {parenthesis} }
if you want the output
{multiline with nested {parenthesis} }
I believe that would be
if ($str =~ /keyword:\s*({.*})/s) {
my $match = $1;
$match =~ s/\n//g;
print "$match\n";
}
elsif ($str =~ /keyword:\s*(.*)/) {
print "$1\n";
}

Perl regex - why does the regex /[0-9\.]+(\,)/ match comma

The following seems to match ,
Can someone explain why?
I would like to match more than one Number or point, ended by comma.
123.456.768,
123,
.,
1.2,
But doing the following unexpectedly prints , too
my $text = "241.000,00";
foreach my $match ($text =~ /[0-9\.]+(\,)/g){
print "$match \n";
}
print $text;
# prints 241.000,
# ,
Update:
The comma matched because:
In list context, //g returns a list of matched groupings, or if there are no groupings, a list of matches to the whole regex
As defined here.
Use a zero-width positive look-ahead assertion to exclude the comma from the match itself:
$text =~ /[0-9\.]+(?=,)/g
Your match in the foreach loop is in list context. In list context, a match returns what its captured. Parens indicate a capture, not the whole regex. You have parens around a comma. You want it the other way around, put the parens aroundt he bit you want.
my $text = "241.000,00";
# my($foo) puts the right hand side in list context.
my($integer_part) = $text =~ /([0-9\.]+),/;
print "$integer_part\n"; # 241.000
If you don't want to match the comma, use a lookahead assertion:
/[0-9\.]+(?=,)/g
You're capturing the wrong thing! Move the parens from around the comma to around the number.
$text =~ /([0-9\.]+),/g
You can replace the comma with a lookahead, or just exclude the comma altogether since it isn't part of what you want to capture, it won't make a difference in this case. However, the pattern as it is puts the comma instead of the number into capture group 1, and then doesn't even reference by capture group, returning the entire match instead.
This is how a capture group is retrieved:
$mystring = "The start text always precedes the end of the end text.";
if($mystring =~ m/start(.*)end/) {
print $1;
}