I am trying to match the parameter name of a parameter declaration line such as below:
parameter BWIDTH = 32;
The Perl regular expression used is:
$line =~ /(\w+)\s*=/
where the parameter name, BWIDTH, is captured into $1. Most parameters I encountered are declared in such a way that the name precedes the equal sign, "=", which is the reason the regular expression is designed with the "=" in it (/(\w+)\s*=/).
However there are special cases where the parameter is declared:
parameter reg [31:0] PORT_WIDTH [BWIDTH-1:0] = 32;
In this case, the parameter name that I am trying to capture is PORT_WIDTH. Revising the regular expression to match this instance does not capture PORT_WIDTH successfully, although it does capture BWIDTH fine.
$line =~ /(\w+)(\s*\[.*?\])*\s*=/
where (\s*\[.*?\])* matches reg [31:0] PORT_WIDTH [BWIDTH-1:0] which is greedy matching.
I am baffled as to why the metacharacter ? does not halt the greedy matching? How should I revise the regular expression?
Replace the .*? with [^][]* to match 0+ chars other than ] and [:
/(\w+)(\s*\[[^][]*])*\s*=/
^^^^^^
You may also turn the second capturing group into a non-capturing one if you are not using that value.
Pattern details:
(\w+) - Group 1: one or more word chars
(\s*\[[^][]*])* - a capturing group (add ?: after ( to make it non-capturing) zero or more occurrences of:
\s* - 0+ whitespaces
\[ - a literal [
[^][]* - a negated character class matching zero or more chars other than ] and [
] - a literal ]
\s* - zero or more whitespaces
= - an equal sign.
Greediness vs. non-greediness affects where a match ends, but it still starts as early as possible. Basically, a greedy match is the leftmost-longest possible match, while non-greedy is leftmost-shortest. But non-greedy is still leftmost, not rightmost.
To get what you want, I would use a more explicit description of what I want matched: /(\w+)(\s*\[[^]]*\])?\s*=/ In English, that's a word (\w+), optionally followed by some text in square brackets ((\s*\[[^]]*\])?), and then optional whitespace and an equals sign. Note that I used a negated character class ([^]]) instead of a non-greedy match for what's inside the brackets - IMO, negated character classes are generally a better option than non-greedy matching.
Results with this regex:
$ perl -E '$x = q(parameter reg [31:0] PORT_WIDTH [BWIDTH-1:0] = 32;); $x =~ /(\w+)(:?\s*\[[^]]*\])?\s*=/; say $1;'
PORT_WIDTH
$ perl -E '$x = q(parameter BWIDTH = 32;); $x =~ /:?(\w+)(\s*\[[^]]*\])?\s*=/; say $1;'
BWIDTH
You have information available to you which you are choosing not to use. You know the basic structure of each statement you are trying to parse. The statements have mandatory and optional parts. So, put the information you have in to the match. For example:
#!/usr/bin/env perl
use strict;
use warnings;
my $stuff_in_square_brackets = qr{ \[ [^\]]+ \] }x;
my $re = qr{
^
parameter \s+
(?: reg \s+)?
(?: $stuff_in_square_brackets \s+)?
(\w+) \s+
(?: $stuff_in_square_brackets \s+)?
= \s+
(\w+) ;
$
}x;
while (my $line = <DATA>) {
if (my($p, $v) = ($line =~ $re)) {
print "'$p' = '$v'\n";
}
}
__DATA__
parameter BWIDTH = 32;
parameter reg [31:0] PORT_WIDTH [BWIDTH-1:0] = 32;
Output:
'BWIDTH' = '32'
'PORT_WIDTH' = '32'
I need to parse a file that includes function calls. For example:
function(otherFunction1(parameters1), otherFunction2(parameters2))
I need the output to be:
otherFunction1(parameters1), otherFunction2(parameters2)
My attempt is this:
open(my $DATA, '<', 'txt') or die "...";
while(my $line = <$DATA>){
$line =~ /\((\w+)\)/;
my $parameters = $1;
print "$parameters\n";
}
I am just getting
parameters1
Is there a way to use regexp to perhaps find the first and last occurrence of the specified character?
Thanks!
You'll need a recursive regex to do it properly. Like this one (with the x flag):
(?(DEFINE)
(?<fn> # a function is:
\w+ \s* # a name
\( (?¶mList) \) # and a parameter list
)
(?<paramList>
(?:
\s* (?¶m)
(?: , \s* (?¶m) )* \s*
)*
)
(?<param> # a parameter is:
(?&fn) # a function call
| \w+ # or a simple value
)
)
\w+ \s* \( (?<extractedParameters>(?¶mList)) \)
Demo.
This is required to match the opening and closing parentheses. Just expand the syntax as needed.
The pattern at the bottom is equivalent to (?&fn) except it encloses the parameter list in a capture group.
You almost have it. You do want everything between the first and the last parentheses on each line, right? Unless the lines to parse are more complex than your example, this small change in code might be all you need.
$line =~ /\((.*)\)/;
my $parameters = $1;
Your \w+ will stop matching at the first non-word character in the string. In your example, that is the first right-hand parenthesis.
I'm trying to match:
JOB: fruit 342 apples to get
The code matches:
$line =~ /^JOB: fruit (\d+) apples to get/
But, when I add the /x switch in:
$line =~ /^JOB: fruit (\d+) apples to get/x
It does not match.
I looked into the /x switch, and it says it just lets you do comments. I don't know why adding /x stops my regex from matching.
The /x modifier tells Perl to ignore most whitespace that isn't escaped in the regex.
For example, let's just focus on apples to get. You could match it with:
$line =~ /apples to get/
But if you try:
$line =~ /apples to get/x
then Perl will ignore the spaces. So it would be like trying to match applestoget.
You can read more about it in perlre. They have this nice example of how you can use the modifier to make the code more readable.
# Delete (most) C comments.
$program =~ s {
/\* # Match the opening delimiter.
.*? # Match a minimal number of characters.
\*/ # Match the closing delimiter.
} []gsx;
They also mention how to match whitespace or # again while using the /x modifier.
Use of /x means that if you want real whitespace or # characters in
the pattern (outside a bracketed character class, which is unaffected
by /x), then you'll either have to escape them (using backslashes or
\Q...\E ) or encode them using octal, hex, or \N{} escapes.
Part of allowing comments is also ignoring literal white space. Use \s or [ ] for spaces you wish to match.
For example
$line =~ /^ #beginning of string
JOB:[ ]fruit[ ] #some literal text
(\d+) #capture digits to $1
[ ]apples[ ]to[ ]get #more literal text
/x
Notice all those spaces before the beginning of the comments. It would stink if they counted....
I have an SQL Select dump with many lines each looks like this:
07/11/2011 16:48:08,07/11/2011 16:48:08,'YD','MANUAL',0,1,'text','text','text','text',,,,'text',,,0,0,
I want to do 2 things to each line:
Replace all dates with Oracle's sysdate function. Dates can also come without hour (like 07/11/2011).
Replace all null values with null string
Here's my attempt:
$_ =~ s/,(,|\n)/,null$1/g; # Replace no data by "null"
$_ =~ s/\d{2}\/\d{2}\/d{4}.*?,/sysdate,/g; # Replace dates by "sysdate"
But this would transform the string to:
07/11/2011 16:48:08,07/11/2011 16:48:08,'YD','MANUAL',0,1,'text','text','text','text',null,,null,'text',null,,0,0,null
while I expect it to be
sysdate,sysdate,'YD','MANUAL',0,1,'text','text','text','text',null,null,null,'text',null,null,0,0,null
I don't understand why dates do not match and why some ,, are not replaced by null.
Any insights welcome, thanks in advance.
\d{2}\/\d{2}\/d{4}.*?, didn't work because the last d wasn't escaped.
If a , can be on either side, or begin/end of string, you could do it in 2 steps:
step 1
s/(?:^|(?<=,))(?=,|\n)/null/g
expanded:
/
(?: ^ # Begining of line, ie: nothing behind us
| (?<=,) # Or, a comma behind us
)
# we are HERE!, this is the place between characters
(?= , # A comma in front of us
| \n # Or, a newline in front of us
)
/null/g
# The above regex does not consume, it just inserts 'null', leaving the
# same search position (after the insertion, but before the comma).
# If you want to consume a comma, it would be done this way:
s/(?:^|(?<=,))(,|\n)/null$1/xg
# Now the search position is after the 'null,'
step 2
s/(?:^|(?<=,))\d{2}\/\d{2}\/\d{4}.*?(?=,|\n)/sysdate/g
Or, you could combine them into a single regex, using the eval modifier:
$row =~ s/(?:^|(?<=,))(\d{2}\/\d{2}\/\d{4}.*?|)(?=,|\n)/ length $1 ? 'sysdate' : 'null'/eg;
Broken down it looks like this
s{
(?: ^ | (?<=,) ) # begin of line or comma behind us
( # Capt group $1
\d{2}/\d{2}/\d{4}.*? # date format and optional non-newline chars
| # Or, nothing at all
) # End Capt group 1
(?= , | \n ) # comma or newline in front of us
}{
length $1 ? 'sysdate' : 'null'
}eg
If there is a chance of non-newline whitespace padding, it could be written as:
$row =~ s/(?:^|(?<=,))(?:([^\S\n]*\d{2}\/\d{2}\/\d{4}.*?)|[^\S\n]*)(?=,|\n)/ defined $1 ? 'sysdate' : 'null'/eg;
You could do this:
$ cat perlregex.pl
use warnings;
use strict;
my $row = "07/11/2011 16:48:08,07/11/2011 16:48:08,'YD','MANUAL',0,1,'text','text','text','text',,,,'text',,,0,0,\n";
print( "$row\n" );
while ( $row =~ /,([,\n])/ ) { $row =~ s/,([,\n])/,null$1/; }
print( "$row\n" );
$row =~ s/\d{2}\/\d{2}\/\d{4}.*?,/sysdate,/g;
print( "$row\n" );
Which results in this:
$ ./perlregex.pl
07/11/2011 16:48:08,07/11/2011 16:48:08,'YD','MANUAL',0,1,'text','text','text','text',,,,'text',,,0,0,
07/11/2011 16:48:08,07/11/2011 16:48:08,'YD','MANUAL',0,1,'text','text','text','text',null,null,null,'text',null,null,0,0,null
sysdate,sysdate,'YD','MANUAL',0,1,'text','text','text','text',null,null,null,'text',null,null,0,0,null
This could certainly be optimized, but it gets the point across.
You want to replace something. Usually lookaheads are a better option for this :
$subject =~ s/(?<=,)(?=,|$)/null/g;
Explanation :
"
(?<= # Assert that the regex below can be matched, with the match ending at this position (positive lookbehind)
, # Match the character “,” literally
)
(?= # Assert that the regex below can be matched, starting at this position (positive lookahead)
# Match either the regular expression below (attempting the next alternative only if this one fails)
, # Match the character “,” literally
| # Or match regular expression number 2 below (the entire group fails if this one fails to match)
\$ # Assert position at the end of the string (or before the line break at the end of the string, if any)
)
"
Secodnly you wish to replace the dates :
$subject =~ s!\d{2}/\d{2}/\d{4}.*?(?=,)!sysdate!g;
That's almost the same with your original regex. Just replace the last , with lookahead. (If you don't want to replace it , don't match it.)
# \d{2}/\d{2}/\d{4}.*?(?=,)
#
# Match a single digit 0..9 «\d{2}»
# Exactly 2 times «{2}»
# Match the character “/” literally «/»
# Match a single digit 0..9 «\d{2}»
# Exactly 2 times «{2}»
# Match the character “/” literally «/»
# Match a single digit 0..9 «\d{4}»
# Exactly 4 times «{4}»
# Match any single character that is not a line break character «.*?»
# Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
# Assert that the regex below can be matched, starting at this position (positive lookahead) «(?=,)»
# Match the character “,” literally «,»
Maybe .*? is too greedy, try:
$_ =~ s/\d{2}\/\d{2}\/d{4}[^,]+,/sysdate,/g;
I am editing a Perl file, but I don't understand this regexp comparison. Can someone please explain it to me?
if ($lines =~ m/(.*?):(.*?)$/g) { } ..
What happens here? $lines is a line from a text file.
Break it up into parts:
$lines =~ m/ (.*?) # Match any character (except newlines)
# zero or more times, not greedily, and
# stick the results in $1.
: # Match a colon.
(.*?) # Match any character (except newlines)
# zero or more times, not greedily, and
# stick the results in $2.
$ # Match the end of the line.
/gx;
So, this will match strings like ":" (it matches zero characters, then a colon, then zero characters before the end of the line, $1 and $2 are empty strings), or "abc:" ($1 = "abc", $2 is an empty string), or "abc:def:ghi" ($1 = "abc" and $2 = "def:ghi").
And if you pass in a line that doesn't match (it looks like this would be if the string does not contain a colon), then it won't process the code that's within the brackets. But if it does match, then the code within the brackets can use and process the special $1 and $2 variables (at least, until the next regular expression shows up, if there is one within the brackets).
There is a tool to help understand regexes: YAPE::Regex::Explain.
Ignoring the g modifier, which is not needed here:
use strict;
use warnings;
use YAPE::Regex::Explain;
my $re = qr/(.*?):(.*?)$/;
print YAPE::Regex::Explain->new($re)->explain();
__END__
The regular expression:
(?-imsx:(.*?):(.*?)$)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
: ':'
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
$ before an optional \n, and the end of the
string
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
See also perldoc perlre.
It was written by someone who either knows too much about regular expressions or not enough about the $' and $` variables.
THis could have been written as
if ($lines =~ /:/) {
... # use $` ($PREMATCH) instead of $1
... # use $' ($POSTMATCH) instead of $2
}
or
if ( ($var1,$var2) = split /:/, $lines, 2 and defined($var2) ) {
... # use $var1, $var2 instead of $1,$2
}
(.*?) captures any characters, but as few of them as possible.
So it looks for patterns like <something>:<somethingelse><end of line>, and if there are multiple : in the string, the first one will be used as the divider between <something> and <somethingelse>.
That line says to perform a regular expression match on $lines with the regex m/(.*?):(.*?)$/g. It will effectively return true if a match can be found in $lines and false if one cannot be found.
An explanation of the =~ operator:
Binary "=~" binds a scalar expression
to a pattern match. Certain operations
search or modify the string $_ by
default. This operator makes that kind
of operation work on some other
string. The right argument is a search
pattern, substitution, or
transliteration. The left argument is
what is supposed to be searched,
substituted, or transliterated instead
of the default $_. When used in scalar
context, the return value generally
indicates the success of the
operation.
The regex itself is:
m/ #Perform a "match" operation
(.*?) #Match zero or more repetitions of any characters, but match as few as possible (ungreedy)
: #Match a literal colon character
(.*?) #Match zero or more repetitions of any characters, but match as few as possible (ungreedy)
$ #Match the end of string
/g #Perform the regex globally (find all occurrences in $line)
So if $lines matches against that regex, it will go into the conditional portion, otherwise it will be false and will skip it.