Regex parameter parsing - regex

I need to parse a file that includes function calls. For example:
function(otherFunction1(parameters1), otherFunction2(parameters2))
I need the output to be:
otherFunction1(parameters1), otherFunction2(parameters2)
My attempt is this:
open(my $DATA, '<', 'txt') or die "...";
while(my $line = <$DATA>){
$line =~ /\((\w+)\)/;
my $parameters = $1;
print "$parameters\n";
}
I am just getting
parameters1
Is there a way to use regexp to perhaps find the first and last occurrence of the specified character?
Thanks!

You'll need a recursive regex to do it properly. Like this one (with the x flag):
(?(DEFINE)
(?<fn> # a function is:
\w+ \s* # a name
\( (?&paramList) \) # and a parameter list
)
(?<paramList>
(?:
\s* (?&param)
(?: , \s* (?&param) )* \s*
)*
)
(?<param> # a parameter is:
(?&fn) # a function call
| \w+ # or a simple value
)
)
\w+ \s* \( (?<extractedParameters>(?&paramList)) \)
Demo.
This is required to match the opening and closing parentheses. Just expand the syntax as needed.
The pattern at the bottom is equivalent to (?&fn) except it encloses the parameter list in a capture group.

You almost have it. You do want everything between the first and the last parentheses on each line, right? Unless the lines to parse are more complex than your example, this small change in code might be all you need.
$line =~ /\((.*)\)/;
my $parameters = $1;
Your \w+ will stop matching at the first non-word character in the string. In your example, that is the first right-hand parenthesis.

Related

How to get this perl extended regex to work?

I have the following code
my #txt = ("Line 1. [foo] bar",
"Line 2. foo bar",
"Line 3. foo [bar]"
);
my $regex = qr/^
Line # Bare word
(\d+)\. # line number
\[ # Open brace
(\w+) # Text in braces
] # close brace
.* # slurp
$
/x;
my $nregex = qr/^\s*Line\s*(\d+)\.\s*\[\s*(\w+)\s*].*$/;
foreach (#txt) {
if ($_ =~ $regex) {
print "Lnum $1 => $2\n";
}
if ($_ =~ $nregex) {
print "N Lnum $1 => $2\n";
}
}
Output
N Lnum 1 => foo
I am expecting both the regexs to be equivalent and capture only the first line of the array. However only $nregex works!
How can $regex be fixed so that it also works identically (with the x option)?
Edit
Based on the response, updated the regex and it works.
my $regex = qr/^ \s*
Line \s* # Bare word
(\d+)\. \s* # line number
\[ \s* # Open brace
(\w+) \s* # Text in braces
] \s* # close brace
.* # slurp
$
/x;
Your two expressions are NOT the same. You need to have the \s* bits in the first one. The /x allows you to write neatly formatted expressions - with comments as you've noticed. As such, the spaces in the /x version are not considered significant, and will not contribute to any matching activity.
In other words, your /x version is the equivalent of
qr/^Line(\d+)\.\[(\w+)].*$/x
By the way, just having a plain space instead of \s* or \s+ would also fail many times; your sample data contains TWO spaces next to each other in a few places. These two places will not match a single space.
Final tip: when you MUST have at least one space in a certain position, you should use \s+ to enforce at least one space. You can surely figure out where that might be useful in your patterns once you know it is possible.

Globally matching name value pairs without a seperator between them

I have the following formatted sample string:
== header == information about things ==headeragain== info can have characters like.*?{=
etc on just one line.
I want to parse this in to a hash such that the keys are the "==.+?==" and the values are the info after the keys. I've tried a couple of regular expressions to globally match these pairs:
%hash = $string =~ /(==.+?==)(.+)/g
and
%hash = $string =~ /(==.+?==)(.+?)/g
Will match the first key and then everything else as its value, and match just the keys respectively.
%hash = $string =~ /(==.+?==)(.+(?===.+?==))/g
is supposed to look ahead for the next key, but not "eat it up" as I understand it. However, it will only match the first pair and go no further.
I think this problem has come from a misunderstanding of how the global modifier acts. Do I need to tweak something in one of my expressions? Or do I need to be doing something completely different?
Even though you're using non-greedy modifier, there's no limitation for 2nd subgroup in you 2nd example.
Add positive look-ahead: (?=$|==) after value. Here (?= is a declaration of look-ahead block and $ or == is a substring, you're searching for.
I.e. the solution is: /(==.+?==)(.+?)(?=$|==)/g
while ($line =~ /
== \s*
( .+? )
\s* == \s*
( .*? )
(?= \s* (?: == | \z ) )
/xg) {
my $key = $1;
my $val = $2;
...
}
But I dislike using the "?" quantifier modifier. It doesn't actually prevent the wrong thing from being matched when given wrong or unexpected input. So I'd use:
while ($line =~ /
== \s*
( \S (?: (?! \s* == ). )* )
\s* == \s*
( (?: (?! \s* == ). )* )
/xg) {
my $key = $1;
my $val = $2;
...
}

Matching the rightmost and the leftmost symbols in perl with regular expression

I'm trying to match a string such that the leftmost symbol and the rightmost symbol are the same. How do I do that?
It’s impossible to know exactly what you mean without clarification of what you consider a “symbol”, but here is one possible solution:
use Unicode::Normalize;
NFD($string) =~ / \A \s* ( (?= \p{Grapheme_Base} ) \X ) .* \1 \s* \z /sx;
and here is another:
use Unicode::Normalize;
NFD($string) =~ / \A \s* ( (?= \p{Symbol} ) \X ) .* \1 \s* \z /sx;
and here is one more:
use Unicode::Normalize;
NFD($string) =~ / \A \s* ( (?: (?= \p{Symbol} ) \X )+ ) .* \1 \s* \z /sx;
And it is even possible that you might be able in some very limited circumstances be able to get away with:
$string =~ / ^ (\pS) .* \1 $ /xs;
But if you do, it’s also likely that someday you’re going to wish you had been more careful.
$string =~ m/^(.).*\1$/
should work. This fails to match strings of length 1, though.
Why do you want to do this with a regex? Is it homework? I avoid regexes for trivial patterns like this.
use Unicode::Normalize qw(NFC);
$s = NFC( $s );
substr( $s, 0, 1 ) eq substr( $s, -1, 1 );
Because Tom will complain about characters versus graphemes, you can handle that too:
use v5.10.1;
use Unicode::GCString;
use Unicode::Normalize qw(NFC);
my $gcs = Unicode::GCString->new( NFC( $s ) );
$gcs->substr( 0, 1 ) eq $gcs->substr( -1, 1 )
These regex's match strings with length 1 and greater. In the expressions, (.) represent a capture group where the dot should be substituted with your class of symbols I guess (see Unicode guru's, although that does not seem to be the intent of the question).
The context of this regex is single line (/s modifier). It allows the dot to match
newlines as well as anything else (like [\s\S]) so newlines can be embedded as well as being the outter most delimeter.
Using \z is the same as $ (in /s mode), except \z corrects a scenario where $ could match before a newline (matches at the end of string is more commona). If the character in question is a newline and you use un-greedy quantifier (like .*?) and the target string is "\nasdf\n\n", it could falsly match before the final newline. But that is a moot issue since the match is all greedy. Still, leave it in for grins.
/^(?=(.)).*\1\z/s
expanded
/
^ # Beginning of string
(?=(.)) # Lookahead - capture grp1, first (any) character (but don't consume it)
.* # Optionally consume all the characters up until before the last character
\1 # Backreference to capture grp1, this must exist
\z # End of string
/s # s modifier
Example stipulating just word class characters
/^(?=(\w)).*\1\z/s
Again, just substitute your acceptable symbols

Regex not matching data and dates

I have an SQL Select dump with many lines each looks like this:
07/11/2011 16:48:08,07/11/2011 16:48:08,'YD','MANUAL',0,1,'text','text','text','text',,,,'text',,,0,0,
I want to do 2 things to each line:
Replace all dates with Oracle's sysdate function. Dates can also come without hour (like 07/11/2011).
Replace all null values with null string
Here's my attempt:
$_ =~ s/,(,|\n)/,null$1/g; # Replace no data by "null"
$_ =~ s/\d{2}\/\d{2}\/d{4}.*?,/sysdate,/g; # Replace dates by "sysdate"
But this would transform the string to:
07/11/2011 16:48:08,07/11/2011 16:48:08,'YD','MANUAL',0,1,'text','text','text','text',null,,null,'text',null,,0,0,null
while I expect it to be
sysdate,sysdate,'YD','MANUAL',0,1,'text','text','text','text',null,null,null,'text',null,null,0,0,null
I don't understand why dates do not match and why some ,, are not replaced by null.
Any insights welcome, thanks in advance.
\d{2}\/\d{2}\/d{4}.*?, didn't work because the last d wasn't escaped.
If a , can be on either side, or begin/end of string, you could do it in 2 steps:
step 1
s/(?:^|(?<=,))(?=,|\n)/null/g
expanded:
/
(?: ^ # Begining of line, ie: nothing behind us
| (?<=,) # Or, a comma behind us
)
# we are HERE!, this is the place between characters
(?= , # A comma in front of us
| \n # Or, a newline in front of us
)
/null/g
# The above regex does not consume, it just inserts 'null', leaving the
# same search position (after the insertion, but before the comma).
# If you want to consume a comma, it would be done this way:
s/(?:^|(?<=,))(,|\n)/null$1/xg
# Now the search position is after the 'null,'
step 2
s/(?:^|(?<=,))\d{2}\/\d{2}\/\d{4}.*?(?=,|\n)/sysdate/g
Or, you could combine them into a single regex, using the eval modifier:
$row =~ s/(?:^|(?<=,))(\d{2}\/\d{2}\/\d{4}.*?|)(?=,|\n)/ length $1 ? 'sysdate' : 'null'/eg;
Broken down it looks like this
s{
(?: ^ | (?<=,) ) # begin of line or comma behind us
( # Capt group $1
\d{2}/\d{2}/\d{4}.*? # date format and optional non-newline chars
| # Or, nothing at all
) # End Capt group 1
(?= , | \n ) # comma or newline in front of us
}{
length $1 ? 'sysdate' : 'null'
}eg
If there is a chance of non-newline whitespace padding, it could be written as:
$row =~ s/(?:^|(?<=,))(?:([^\S\n]*\d{2}\/\d{2}\/\d{4}.*?)|[^\S\n]*)(?=,|\n)/ defined $1 ? 'sysdate' : 'null'/eg;
You could do this:
$ cat perlregex.pl
use warnings;
use strict;
my $row = "07/11/2011 16:48:08,07/11/2011 16:48:08,'YD','MANUAL',0,1,'text','text','text','text',,,,'text',,,0,0,\n";
print( "$row\n" );
while ( $row =~ /,([,\n])/ ) { $row =~ s/,([,\n])/,null$1/; }
print( "$row\n" );
$row =~ s/\d{2}\/\d{2}\/\d{4}.*?,/sysdate,/g;
print( "$row\n" );
Which results in this:
$ ./perlregex.pl
07/11/2011 16:48:08,07/11/2011 16:48:08,'YD','MANUAL',0,1,'text','text','text','text',,,,'text',,,0,0,
07/11/2011 16:48:08,07/11/2011 16:48:08,'YD','MANUAL',0,1,'text','text','text','text',null,null,null,'text',null,null,0,0,null
sysdate,sysdate,'YD','MANUAL',0,1,'text','text','text','text',null,null,null,'text',null,null,0,0,null
This could certainly be optimized, but it gets the point across.
You want to replace something. Usually lookaheads are a better option for this :
$subject =~ s/(?<=,)(?=,|$)/null/g;
Explanation :
"
(?<= # Assert that the regex below can be matched, with the match ending at this position (positive lookbehind)
, # Match the character “,” literally
)
(?= # Assert that the regex below can be matched, starting at this position (positive lookahead)
# Match either the regular expression below (attempting the next alternative only if this one fails)
, # Match the character “,” literally
| # Or match regular expression number 2 below (the entire group fails if this one fails to match)
\$ # Assert position at the end of the string (or before the line break at the end of the string, if any)
)
"
Secodnly you wish to replace the dates :
$subject =~ s!\d{2}/\d{2}/\d{4}.*?(?=,)!sysdate!g;
That's almost the same with your original regex. Just replace the last , with lookahead. (If you don't want to replace it , don't match it.)
# \d{2}/\d{2}/\d{4}.*?(?=,)
#
# Match a single digit 0..9 «\d{2}»
# Exactly 2 times «{2}»
# Match the character “/” literally «/»
# Match a single digit 0..9 «\d{2}»
# Exactly 2 times «{2}»
# Match the character “/” literally «/»
# Match a single digit 0..9 «\d{4}»
# Exactly 4 times «{4}»
# Match any single character that is not a line break character «.*?»
# Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
# Assert that the regex below can be matched, starting at this position (positive lookahead) «(?=,)»
# Match the character “,” literally «,»
Maybe .*? is too greedy, try:
$_ =~ s/\d{2}\/\d{2}\/d{4}[^,]+,/sysdate,/g;

What does this Perl regex mean: m/(.*?):(.*?)$/g?

I am editing a Perl file, but I don't understand this regexp comparison. Can someone please explain it to me?
if ($lines =~ m/(.*?):(.*?)$/g) { } ..
What happens here? $lines is a line from a text file.
Break it up into parts:
$lines =~ m/ (.*?) # Match any character (except newlines)
# zero or more times, not greedily, and
# stick the results in $1.
: # Match a colon.
(.*?) # Match any character (except newlines)
# zero or more times, not greedily, and
# stick the results in $2.
$ # Match the end of the line.
/gx;
So, this will match strings like ":" (it matches zero characters, then a colon, then zero characters before the end of the line, $1 and $2 are empty strings), or "abc:" ($1 = "abc", $2 is an empty string), or "abc:def:ghi" ($1 = "abc" and $2 = "def:ghi").
And if you pass in a line that doesn't match (it looks like this would be if the string does not contain a colon), then it won't process the code that's within the brackets. But if it does match, then the code within the brackets can use and process the special $1 and $2 variables (at least, until the next regular expression shows up, if there is one within the brackets).
There is a tool to help understand regexes: YAPE::Regex::Explain.
Ignoring the g modifier, which is not needed here:
use strict;
use warnings;
use YAPE::Regex::Explain;
my $re = qr/(.*?):(.*?)$/;
print YAPE::Regex::Explain->new($re)->explain();
__END__
The regular expression:
(?-imsx:(.*?):(.*?)$)
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
( group and capture to \1:
----------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
----------------------------------------------------------------------
) end of \1
----------------------------------------------------------------------
: ':'
----------------------------------------------------------------------
( group and capture to \2:
----------------------------------------------------------------------
.*? any character except \n (0 or more times
(matching the least amount possible))
----------------------------------------------------------------------
) end of \2
----------------------------------------------------------------------
$ before an optional \n, and the end of the
string
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
See also perldoc perlre.
It was written by someone who either knows too much about regular expressions or not enough about the $' and $` variables.
THis could have been written as
if ($lines =~ /:/) {
... # use $` ($PREMATCH) instead of $1
... # use $' ($POSTMATCH) instead of $2
}
or
if ( ($var1,$var2) = split /:/, $lines, 2 and defined($var2) ) {
... # use $var1, $var2 instead of $1,$2
}
(.*?) captures any characters, but as few of them as possible.
So it looks for patterns like <something>:<somethingelse><end of line>, and if there are multiple : in the string, the first one will be used as the divider between <something> and <somethingelse>.
That line says to perform a regular expression match on $lines with the regex m/(.*?):(.*?)$/g. It will effectively return true if a match can be found in $lines and false if one cannot be found.
An explanation of the =~ operator:
Binary "=~" binds a scalar expression
to a pattern match. Certain operations
search or modify the string $_ by
default. This operator makes that kind
of operation work on some other
string. The right argument is a search
pattern, substitution, or
transliteration. The left argument is
what is supposed to be searched,
substituted, or transliterated instead
of the default $_. When used in scalar
context, the return value generally
indicates the success of the
operation.
The regex itself is:
m/ #Perform a "match" operation
(.*?) #Match zero or more repetitions of any characters, but match as few as possible (ungreedy)
: #Match a literal colon character
(.*?) #Match zero or more repetitions of any characters, but match as few as possible (ungreedy)
$ #Match the end of string
/g #Perform the regex globally (find all occurrences in $line)
So if $lines matches against that regex, it will go into the conditional portion, otherwise it will be false and will skip it.