Perl Non-greedy Matching -- Is the "?" character used correctly? - regex

I am trying to match the parameter name of a parameter declaration line such as below:
parameter BWIDTH = 32;
The Perl regular expression used is:
$line =~ /(\w+)\s*=/
where the parameter name, BWIDTH, is captured into $1. Most parameters I encountered are declared in such a way that the name precedes the equal sign, "=", which is the reason the regular expression is designed with the "=" in it (/(\w+)\s*=/).
However there are special cases where the parameter is declared:
parameter reg [31:0] PORT_WIDTH [BWIDTH-1:0] = 32;
In this case, the parameter name that I am trying to capture is PORT_WIDTH. Revising the regular expression to match this instance does not capture PORT_WIDTH successfully, although it does capture BWIDTH fine.
$line =~ /(\w+)(\s*\[.*?\])*\s*=/
where (\s*\[.*?\])* matches reg [31:0] PORT_WIDTH [BWIDTH-1:0] which is greedy matching.
I am baffled as to why the metacharacter ? does not halt the greedy matching? How should I revise the regular expression?

Replace the .*? with [^][]* to match 0+ chars other than ] and [:
/(\w+)(\s*\[[^][]*])*\s*=/
^^^^^^
You may also turn the second capturing group into a non-capturing one if you are not using that value.
Pattern details:
(\w+) - Group 1: one or more word chars
(\s*\[[^][]*])* - a capturing group (add ?: after ( to make it non-capturing) zero or more occurrences of:
\s* - 0+ whitespaces
\[ - a literal [
[^][]* - a negated character class matching zero or more chars other than ] and [
] - a literal ]
\s* - zero or more whitespaces
= - an equal sign.

Greediness vs. non-greediness affects where a match ends, but it still starts as early as possible. Basically, a greedy match is the leftmost-longest possible match, while non-greedy is leftmost-shortest. But non-greedy is still leftmost, not rightmost.
To get what you want, I would use a more explicit description of what I want matched: /(\w+)(\s*\[[^]]*\])?\s*=/ In English, that's a word (\w+), optionally followed by some text in square brackets ((\s*\[[^]]*\])?), and then optional whitespace and an equals sign. Note that I used a negated character class ([^]]) instead of a non-greedy match for what's inside the brackets - IMO, negated character classes are generally a better option than non-greedy matching.
Results with this regex:
$ perl -E '$x = q(parameter reg [31:0] PORT_WIDTH [BWIDTH-1:0] = 32;); $x =~ /(\w+)(:?\s*\[[^]]*\])?\s*=/; say $1;'
PORT_WIDTH
$ perl -E '$x = q(parameter BWIDTH = 32;); $x =~ /:?(\w+)(\s*\[[^]]*\])?\s*=/; say $1;'
BWIDTH

You have information available to you which you are choosing not to use. You know the basic structure of each statement you are trying to parse. The statements have mandatory and optional parts. So, put the information you have in to the match. For example:
#!/usr/bin/env perl
use strict;
use warnings;
my $stuff_in_square_brackets = qr{ \[ [^\]]+ \] }x;
my $re = qr{
^
parameter \s+
(?: reg \s+)?
(?: $stuff_in_square_brackets \s+)?
(\w+) \s+
(?: $stuff_in_square_brackets \s+)?
= \s+
(\w+) ;
$
}x;
while (my $line = <DATA>) {
if (my($p, $v) = ($line =~ $re)) {
print "'$p' = '$v'\n";
}
}
__DATA__
parameter BWIDTH = 32;
parameter reg [31:0] PORT_WIDTH [BWIDTH-1:0] = 32;
Output:
'BWIDTH' = '32'
'PORT_WIDTH' = '32'

Related

Bash regex matching "0xffffffc0006e0584 is in some_function (/path/to/my/file.c:93)."

In a Bash script I'm writing, I need to capture the /path/to/my/file.c and 93 in this line:
0xffffffc0006e0584 is in some_function (/path/to/my/file.c:93).
0xffffffc0006e0584 is in another_function(char *arg1, int arg2) (/path/to/my/other_file.c:94).
With the help of regex101.com, I've managed to create this Perl regex:
^(?:\S+\s){1,5}\((\S+):(\d+)\)
but I hear that Bash doesn't understand \d or ?:, so I came up with this:
^([:alpha:]+[:space:]){1,5}\(([:alpha:]+):([0-9]+)\)
But when I try it out:
line1="0xffffffc0006e0584 is in some_function (/path/to/my/file.c:93)."
regex="^([:alpha:]+[:space:]){1,5}\(([:alpha:]+):([0-9]+)\)"
[[ $line1 =~ $regex ]]
echo ${BASH_REMATCH[0]}
I don't get any match. What am I doing wrong? How can I write a Bash-compatible regex to do this?
You are right, Bash uses POSIX ERE and does not support \d shorthand character class, nor does it support non-capturing groups. See more regex features unsupported in POSIX ERE/BRE in this post.
Use
.*\((.+):([0-9]+)\)
Or even (if you need to grab the first (...) substring in a string):
\(([^()]+):([0-9]+)\)
Details
.* - any 0+ chars, as many as possible (may be omitted, only necessary if there are other (...) substrings and you only need to grab the last one)
\( - a ( char
(.+) - Group 1 (${BASH_REMATCH[1]}): any 1+ chars as many as possible
: - a colon
([0-9]+) - Group 2 (${BASH_REMATCH[2]}): 1+ digits
\) - a ) char.
See the Bash demo (or this one):
test='0xffffffc0006e0584 is in some_function (/path/to/my/file.c:93).'
reg='.*\((.+):([0-9]+)\)'
# reg='\(([^()]+):([0-9]+)\)' # This also works for the current scenario
if [[ $test =~ $reg ]]; then
echo ${BASH_REMATCH[1]};
echo ${BASH_REMATCH[2]};
fi
Output:
/path/to/my/file.c
93
In the first pattern you use \S+ which matches a non whitespace char. That is a broad match and will also match for example / which is not taken into account in the second pattern.
The pattern starts with [:alpha:] but the first char is a 0. You could use [:alnum:] instead. Since the repetition should also match _ that could be added as well.
Note that when using a quantifier for a capturing group, the group captures the last value of the iteration. So when using {1,5} you use that quantifier only for the repetition. Its value would be some_function
You might use:
^([[:alnum:]_]+[[:space:]]){1,5}\(((/[[:alpha:]]+)+\.[[:alpha:]]):([[:digit:]]+)\)\.$
Regex demo | Bash demo
Your code could look like
line1="0xffffffc0006e0584 is in some_function (/path/to/my/file.c:93)."
regex="^([[:alnum:]_]+[[:space:]]){1,5}\(((/[[:alpha:]]+)+\.[[:alpha:]]):([[:digit:]]+)\)\.$"
[[ $line1 =~ $regex ]]
echo ${BASH_REMATCH[2]}
echo ${BASH_REMATCH[4]}
Result
/path/to/my/file.c
93
Or a bit shorter version using \S and the values are in group 2 and 3
^([[:alnum:]_]+[[:space:]]){1,5}\((\S+\.[[:alpha:]]):([[:digit:]]+)\)\.$
Explanation
^ Start of string
([[:alnum:]_]+[[:space:]]){1,5} Repeat 1-5 times what is captured in group 1
\( match (
(\S+\.[[:alpha:]]) Capture group 2 Match 1+ non whitespace chars, . and an alphabetic character
: Match :
([[:digit:]]+) Capture group 3 Match 1+ digits
\)\. Match ).
$ End of string
See this page about bracket expressions
Regex demo

Regular expression to match requiring a module?

I'm using regular expressions in a custom text editor to in effect whitelist certain modules (assert and crypto). I'm close to what I need but not quite there. Here it is:
/require\s*\(\s*'(?!(\bassert\b|\bcrypto\b)).*'\s*\)/
I want the regular expression to match any line with require('foo'); where foo is anything except for 'assert' or 'crypto'. The case I'm failing is require('assert '); which is not being matched with my regex however require(' assert'); is correctly being matched.
https://regexr.com/4i6ot
If you don't want to match assert or crypto between ', you could change the lookahead to assert exactly that. You can omit the word boundaries matching the words right after the '.
If what follows should match until the first occurrence of ', you could use a negated character class [^'\r\n]* to match any char except ' or a newline.
require\s*\(\s*'(?!(assert|crypto)')[^'\r\n]*'\s*\)
^
Regex demo
You can use: require\s*\(\s*'(?!(\bassert'|\bcrypto')).*'\s*\)
Online demo
The difference is that I replaced word boundary \b with ' at the end of the module names. With \b a module name of 'assert ' was matched by negative lookahead, because t was matched by \b. In the new version, we require ' at the end of the name of the module.
EDIT
As Cary Swoveland advised, leading \b are not required:
require\s*\(\s*'(?!(assert'|crypto')).*'\s*\)
Demo
I assume from the flawed regex that if there is a match the string between "('" and "')" is to be captured. One way to do that follows.
r = /
require # match word
\ * # match zero or more spaces (note escaped space)
\( # match a left paren
(?! # begin a negative lookahead
' # match a single quote
(?:assert|crypto) # match either word
' # match a single quote
(?=\)) # match a right paren in a forward lookahead
) # end negative lookahead
' # match a single quote
(.*?) # match any number of characters lazily in a capture group 1
' # match a single quote
\) # match a right paren
/x # free-spacing regex definition mode
As the capture group is followed by a single quote, matching characters in the capture group lazily ensures that a single quote is not matched in the capture group. I could have instead written ([^']*). In conventional form this regex is written as follows:
r = /require *\((?!'(?:assert|crypto)'(?=\)))'(.*?)'\)/
Note that in free-spacing regex definition mode spaces will be removed unless they are escaped, put in a character class ([ ]), replaced with \p{Space} and so on.
"require ('victory')" =~ r #=> 0
$1 #=> "victory"
"require (' assert')" =~ r #=> 0
$1 #=> " assert"
"require ('assert ')" =~ r #=> 0
$1 #=> "assert "
"require ('crypto')" =~ r #=> nil
"require ('assert')" =~ r #=> nil
"require\n('victory')" =~ r #=> nil
Notice that had I replace the space character in the regex with "\s" in the last example I would have obtained:
"require\n('victory')" =~ r #=> 0
$1 #=> "victory"
I don't think you need anything remotely that complicated, this simple pattern will work just fine:
require\((?!'crypto'|'assert')'.*'\);
regex101 demo

Perl greedy regex is not acting greedy

Giving the following code:
use strict;
use warnings;
my $text = "asdf(blablabla)";
$text =~ s/(.*?)\((.*)\)/$2/;
print "\nfirst match: $1";
print "\nsecond match: $2";
I expected that $2 would catch my last bracket, yet my output is:
If .* by default it's greedy why it stopped at the bracket?
The .* is a greedy subpattern, but it does not account for grouping. Grouping is defined with a pair of unescaped parentheses (see Use Parentheses for Grouping and Capturing).
See where your group boundaries are:
s/(.*?)\((.*)\)/$2/
| G1| |G2|
So, the \( and \) matching ( and ) are outside the groups, and will not be part of neither $1 nor $2.
If you need the ) be part of $2, use
s/(.*?)\((.*\))/$2/
^
A regex engine is processing both the string and the pattern from left to right. The first (.*?) is handled first, and it matches up to the first literal ( symbol as it is lazy (matches as few chars as possible before it can return a valid match), and the whole part before the ( is placed into Group 1 stack. Then, the ( is matched, but not captured, then (.*) matches any 0+ characters other than a newline up to the last ) symbol, and places the capture into Group 2. Then, the ) is just matched. The point is that .* grabs the whole string up to the end, but then backtracking happens since the engine tries to accommodate for the final ) in the pattern. The ) must be matched, but not captured in your pattern, thus, it is not part of Group 2 due to the group boundary placement. You can see the regex debugger at this regex demo page to see how the pattern matches your string.

Perl regex to extract digits from string with parenthesis

I have the following string:
my $string = "Ethernet FlexNIC (NIC 1) LOM1:1-a FC:15:B4:13:6A:A8";
I want to extract the number that is in brackets (1) in another variable.
The following statement does not work:
my ($NAdapter) = $string =~ /\((\d+)\)/;
What is the correct syntax?
\d+(?=[^(]*\))
You can use this.See demo.Yours will not work as inside () there is more data besides \d+.
https://regex101.com/r/fM9lY3/57
You could try something like
my ($NAdapter) = $string =~ /\(.*(\d+).*\)/;
After that, $NAdapter should include the number that you want.
my $string = "Ethernet FlexNIC (NIC 1) LOM1:1-a FC:15:B4:13:6A:A8";
I want to extract the number that is in brackets (1) in another
variable
Your regex (with some spaces for clarity):
/ \( (\d+) \) /x;
says to match:
A literal opening parenthesis, immediately followed by...
A digit, one or more times (captured in group 1), immediately followed by...
A literal closing parenthesis.
Yet, the substring you want to match:
(NIC 1)
is of the form:
A literal opening parenthesis, immediately followed by...
Some capital letters
STOP EVERYTHING! NO MATCH!
As an alternative, your substring:
(NIC 1)
could be described as:
Some digits, immediately followed by...
A literal closing parenthesis.
Here's the regex:
use strict;
use warnings;
use 5.020;
my $string = "Ethernet FlexNIC (NIC 1234) LOM1:1-a FC:15:B4:13:6A:A8";
my ($match) = $string =~ /
(\d+) #Match any digit, one or more times, captured in group 1, followed by...
\) #a literal closing parenthesis.
#Parentheses have a special meaning in a regex--they create a capture
#group--so if you want to match a parenthesis in your string, you
#have to escape the parenthesis in your regex with a backslash.
/xms; #Standard flags that some people apply to every regex.
say $match;
--output:--
1234
Another description of your substring:
(NIC 1)
could be:
A literal opening parenthesis, immediately followed by...
Some non-digits, immediately followed by...
Some digits, immediately followed by..
A literal closing parenthesis.
Here's the regex:
use strict;
use warnings;
use 5.020;
my $string = "Ethernet FlexNIC (ABC NIC789) LOM1:1-a FC:15:B4:13:6A:A8";
my ($match) = $string =~ /
\( #Match a literal opening parethesis, followed by...
\D+ #a non-digit, one or more times, followed by...
(\d+) #a digit, one or more times, captured in group 1, followed by...
\) #a literal closing parentheses.
/xms; #Standard flags that some people apply to every regex.
say $match;
--output:--
789
If there might be spaces on some lines and not others, such as:
spaces
||
VV
(NIC 1 )
(NIC 2)
You can insert a \s* (any whitespace, zero or more times) in the appropriate place in the regex, for instance:
my ($match) = $string =~ /
#Parentheses have special meaning in a regex--they create a capture
#group--so if you want to match a parenthesis in your string, you
#have to escape the parenthesis in your regex with a backslash.
\( #Match a literal opening parethesis, followed by...
\D+ #a non-digit, one or more times, followed by...
(\d+) #a digit, one or more times, captured in group 1, followed by...
\s* #any whitespace, zero or more times, followed by...
\) #a literal closing parentheses.
/xms; #Standard flags that some people apply to every regex.

What does this Perl regex mean: m/(.*?):(.*?)$/g?

I am editing a Perl file, but I don't understand this regexp comparison. Can someone please explain it to me?
if ($lines =~ m/(.*?):(.*?)$/g) { } ..
What happens here? $lines is a line from a text file.
Break it up into parts:
$lines =~ m/ (.*?) # Match any character (except newlines)
# zero or more times, not greedily, and
# stick the results in $1.
: # Match a colon.
(.*?) # Match any character (except newlines)
# zero or more times, not greedily, and
# stick the results in $2.
$ # Match the end of the line.
/gx;
So, this will match strings like ":" (it matches zero characters, then a colon, then zero characters before the end of the line, $1 and $2 are empty strings), or "abc:" ($1 = "abc", $2 is an empty string), or "abc:def:ghi" ($1 = "abc" and $2 = "def:ghi").
And if you pass in a line that doesn't match (it looks like this would be if the string does not contain a colon), then it won't process the code that's within the brackets. But if it does match, then the code within the brackets can use and process the special $1 and $2 variables (at least, until the next regular expression shows up, if there is one within the brackets).
There is a tool to help understand regexes: YAPE::Regex::Explain.
Ignoring the g modifier, which is not needed here:
use strict;
use warnings;
use YAPE::Regex::Explain;
my $re = qr/(.*?):(.*?)$/;
print YAPE::Regex::Explain->new($re)->explain();
__END__
The regular expression:
(?-imsx:(.*?):(.*?)$)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
: ':'
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
$ before an optional \n, and the end of the
string
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
See also perldoc perlre.
It was written by someone who either knows too much about regular expressions or not enough about the $' and $` variables.
THis could have been written as
if ($lines =~ /:/) {
... # use $` ($PREMATCH) instead of $1
... # use $' ($POSTMATCH) instead of $2
}
or
if ( ($var1,$var2) = split /:/, $lines, 2 and defined($var2) ) {
... # use $var1, $var2 instead of $1,$2
}
(.*?) captures any characters, but as few of them as possible.
So it looks for patterns like <something>:<somethingelse><end of line>, and if there are multiple : in the string, the first one will be used as the divider between <something> and <somethingelse>.
That line says to perform a regular expression match on $lines with the regex m/(.*?):(.*?)$/g. It will effectively return true if a match can be found in $lines and false if one cannot be found.
An explanation of the =~ operator:
Binary "=~" binds a scalar expression
to a pattern match. Certain operations
search or modify the string $_ by
default. This operator makes that kind
of operation work on some other
string. The right argument is a search
pattern, substitution, or
transliteration. The left argument is
what is supposed to be searched,
substituted, or transliterated instead
of the default $_. When used in scalar
context, the return value generally
indicates the success of the
operation.
The regex itself is:
m/ #Perform a "match" operation
(.*?) #Match zero or more repetitions of any characters, but match as few as possible (ungreedy)
: #Match a literal colon character
(.*?) #Match zero or more repetitions of any characters, but match as few as possible (ungreedy)
$ #Match the end of string
/g #Perform the regex globally (find all occurrences in $line)
So if $lines matches against that regex, it will go into the conditional portion, otherwise it will be false and will skip it.