Globally matching name value pairs without a seperator between them - regex

I have the following formatted sample string:
== header == information about things ==headeragain== info can have characters like.*?{=
etc on just one line.
I want to parse this in to a hash such that the keys are the "==.+?==" and the values are the info after the keys. I've tried a couple of regular expressions to globally match these pairs:
%hash = $string =~ /(==.+?==)(.+)/g
and
%hash = $string =~ /(==.+?==)(.+?)/g
Will match the first key and then everything else as its value, and match just the keys respectively.
%hash = $string =~ /(==.+?==)(.+(?===.+?==))/g
is supposed to look ahead for the next key, but not "eat it up" as I understand it. However, it will only match the first pair and go no further.
I think this problem has come from a misunderstanding of how the global modifier acts. Do I need to tweak something in one of my expressions? Or do I need to be doing something completely different?

Even though you're using non-greedy modifier, there's no limitation for 2nd subgroup in you 2nd example.
Add positive look-ahead: (?=$|==) after value. Here (?= is a declaration of look-ahead block and $ or == is a substring, you're searching for.
I.e. the solution is: /(==.+?==)(.+?)(?=$|==)/g

while ($line =~ /
== \s*
( .+? )
\s* == \s*
( .*? )
(?= \s* (?: == | \z ) )
/xg) {
my $key = $1;
my $val = $2;
...
}
But I dislike using the "?" quantifier modifier. It doesn't actually prevent the wrong thing from being matched when given wrong or unexpected input. So I'd use:
while ($line =~ /
== \s*
( \S (?: (?! \s* == ). )* )
\s* == \s*
( (?: (?! \s* == ). )* )
/xg) {
my $key = $1;
my $val = $2;
...
}

Related

Perl Non-greedy Matching -- Is the "?" character used correctly?

I am trying to match the parameter name of a parameter declaration line such as below:
parameter BWIDTH = 32;
The Perl regular expression used is:
$line =~ /(\w+)\s*=/
where the parameter name, BWIDTH, is captured into $1. Most parameters I encountered are declared in such a way that the name precedes the equal sign, "=", which is the reason the regular expression is designed with the "=" in it (/(\w+)\s*=/).
However there are special cases where the parameter is declared:
parameter reg [31:0] PORT_WIDTH [BWIDTH-1:0] = 32;
In this case, the parameter name that I am trying to capture is PORT_WIDTH. Revising the regular expression to match this instance does not capture PORT_WIDTH successfully, although it does capture BWIDTH fine.
$line =~ /(\w+)(\s*\[.*?\])*\s*=/
where (\s*\[.*?\])* matches reg [31:0] PORT_WIDTH [BWIDTH-1:0] which is greedy matching.
I am baffled as to why the metacharacter ? does not halt the greedy matching? How should I revise the regular expression?
Replace the .*? with [^][]* to match 0+ chars other than ] and [:
/(\w+)(\s*\[[^][]*])*\s*=/
^^^^^^
You may also turn the second capturing group into a non-capturing one if you are not using that value.
Pattern details:
(\w+) - Group 1: one or more word chars
(\s*\[[^][]*])* - a capturing group (add ?: after ( to make it non-capturing) zero or more occurrences of:
\s* - 0+ whitespaces
\[ - a literal [
[^][]* - a negated character class matching zero or more chars other than ] and [
] - a literal ]
\s* - zero or more whitespaces
= - an equal sign.
Greediness vs. non-greediness affects where a match ends, but it still starts as early as possible. Basically, a greedy match is the leftmost-longest possible match, while non-greedy is leftmost-shortest. But non-greedy is still leftmost, not rightmost.
To get what you want, I would use a more explicit description of what I want matched: /(\w+)(\s*\[[^]]*\])?\s*=/ In English, that's a word (\w+), optionally followed by some text in square brackets ((\s*\[[^]]*\])?), and then optional whitespace and an equals sign. Note that I used a negated character class ([^]]) instead of a non-greedy match for what's inside the brackets - IMO, negated character classes are generally a better option than non-greedy matching.
Results with this regex:
$ perl -E '$x = q(parameter reg [31:0] PORT_WIDTH [BWIDTH-1:0] = 32;); $x =~ /(\w+)(:?\s*\[[^]]*\])?\s*=/; say $1;'
PORT_WIDTH
$ perl -E '$x = q(parameter BWIDTH = 32;); $x =~ /:?(\w+)(\s*\[[^]]*\])?\s*=/; say $1;'
BWIDTH
You have information available to you which you are choosing not to use. You know the basic structure of each statement you are trying to parse. The statements have mandatory and optional parts. So, put the information you have in to the match. For example:
#!/usr/bin/env perl
use strict;
use warnings;
my $stuff_in_square_brackets = qr{ \[ [^\]]+ \] }x;
my $re = qr{
^
parameter \s+
(?: reg \s+)?
(?: $stuff_in_square_brackets \s+)?
(\w+) \s+
(?: $stuff_in_square_brackets \s+)?
= \s+
(\w+) ;
$
}x;
while (my $line = <DATA>) {
if (my($p, $v) = ($line =~ $re)) {
print "'$p' = '$v'\n";
}
}
__DATA__
parameter BWIDTH = 32;
parameter reg [31:0] PORT_WIDTH [BWIDTH-1:0] = 32;
Output:
'BWIDTH' = '32'
'PORT_WIDTH' = '32'

Matching first letter of word

I want to match the first letter of a word in one string to another with the similar letter. In this example the letter H:
25HB matches to HC
I am using the match operator shown below:
my ($match) = ( $value =~ m/^d(\w)/ );
to not match the digit, but the first matching word character. How could I correct this?
That regex doesn't do what you think it does:
m/^d(\w)/
Matches 'start of line' - letter d then a single word character.
You may want:
m/^\d+(\w)/
Which will then match one or more digits from the start of line, and grab the first word character after that.
E.g.:
my $string = '25HC';
my ( $match ) =( $string =~ m/^\d+(\w)/ );
print $match,"\n";
Prints H
You are not clear about what you want. If you want to match the first letter in a string to the same letter later in the string:
m{
( # start a capture
[[:alpha:]] # match a single letter
) # end of capture
.*? # skip minimum number of any character
\1 # match the captured letter
}msx; # /m means multilines, /s means . matches newlines, /x means ignore whitespace in pattern
See perldoc perlre for more details.
Addendum:
If by word, you mean any alphanumeric sequence, this may be closer to what you want:
m{
\b # match a word boundary (start or end of a word)
\d* # greedy match any digits
( # start a capture
[[:alpha:]] # match a single letter
) # end of capture
.*? # skip minimum number of any character
\b # match a word boundary (start or end of a word)
\d* # greedy match any digits
\1 # match the captured letter
}msx; # /m means multilines, /s means . matches newlines, /x means ignore whitespace in pattern
You could try ^.*?([A-Za-z]).
The following code returns:
ITEM: 22hb
MATCH: h
ITEM: 33HB
MATCH: H
ITEM: 3333
MATCH:
ITEM: 43 H
MATCH: H
ITEM: HB33
MATCH: H
Script.
#!/usr/bin/perl
my #array = ('22hb','33HB','3333','43 H','HB33');
for my $item (#array) {
my $match = $1 if $item =~ /^.*?([A-Za-z])/;
print "ITEM: $item \nMATCH: $match\n\n";
}
I believe this is what you are looking for:
(If you can provide more clear example of what you are looking for we may be able to help you better)
The following code takes two strings and finds the first non-digit character common in both the strings:
my $string1 = '25HB';
my $string2 = 'HC';
#strip all digits
$string1 =~ s/\d//g;
foreach my $alpha (split //, $string1) {
# for each non-digit check if we find a match
if ($string2 =~ /$alpha/) {
print "First matching non-numeric character: $alpha\n";
exit;
}
}

Regex parameter parsing

I need to parse a file that includes function calls. For example:
function(otherFunction1(parameters1), otherFunction2(parameters2))
I need the output to be:
otherFunction1(parameters1), otherFunction2(parameters2)
My attempt is this:
open(my $DATA, '<', 'txt') or die "...";
while(my $line = <$DATA>){
$line =~ /\((\w+)\)/;
my $parameters = $1;
print "$parameters\n";
}
I am just getting
parameters1
Is there a way to use regexp to perhaps find the first and last occurrence of the specified character?
Thanks!
You'll need a recursive regex to do it properly. Like this one (with the x flag):
(?(DEFINE)
(?<fn> # a function is:
\w+ \s* # a name
\( (?&paramList) \) # and a parameter list
)
(?<paramList>
(?:
\s* (?&param)
(?: , \s* (?&param) )* \s*
)*
)
(?<param> # a parameter is:
(?&fn) # a function call
| \w+ # or a simple value
)
)
\w+ \s* \( (?<extractedParameters>(?&paramList)) \)
Demo.
This is required to match the opening and closing parentheses. Just expand the syntax as needed.
The pattern at the bottom is equivalent to (?&fn) except it encloses the parameter list in a capture group.
You almost have it. You do want everything between the first and the last parentheses on each line, right? Unless the lines to parse are more complex than your example, this small change in code might be all you need.
$line =~ /\((.*)\)/;
my $parameters = $1;
Your \w+ will stop matching at the first non-word character in the string. In your example, that is the first right-hand parenthesis.

How to get this perl extended regex to work?

I have the following code
my #txt = ("Line 1. [foo] bar",
"Line 2. foo bar",
"Line 3. foo [bar]"
);
my $regex = qr/^
Line # Bare word
(\d+)\. # line number
\[ # Open brace
(\w+) # Text in braces
] # close brace
.* # slurp
$
/x;
my $nregex = qr/^\s*Line\s*(\d+)\.\s*\[\s*(\w+)\s*].*$/;
foreach (#txt) {
if ($_ =~ $regex) {
print "Lnum $1 => $2\n";
}
if ($_ =~ $nregex) {
print "N Lnum $1 => $2\n";
}
}
Output
N Lnum 1 => foo
I am expecting both the regexs to be equivalent and capture only the first line of the array. However only $nregex works!
How can $regex be fixed so that it also works identically (with the x option)?
Edit
Based on the response, updated the regex and it works.
my $regex = qr/^ \s*
Line \s* # Bare word
(\d+)\. \s* # line number
\[ \s* # Open brace
(\w+) \s* # Text in braces
] \s* # close brace
.* # slurp
$
/x;
Your two expressions are NOT the same. You need to have the \s* bits in the first one. The /x allows you to write neatly formatted expressions - with comments as you've noticed. As such, the spaces in the /x version are not considered significant, and will not contribute to any matching activity.
In other words, your /x version is the equivalent of
qr/^Line(\d+)\.\[(\w+)].*$/x
By the way, just having a plain space instead of \s* or \s+ would also fail many times; your sample data contains TWO spaces next to each other in a few places. These two places will not match a single space.
Final tip: when you MUST have at least one space in a certain position, you should use \s+ to enforce at least one space. You can surely figure out where that might be useful in your patterns once you know it is possible.

single regex not working for a three different patterns

I need a optimal regexp to match all these three types of texts in a text file.
[TRUE,FALSE]
[4,5,6,7]
[2-15]
i am trying the following regex match which is not working
m/([0-9A-Fa-fx,]+)\s*[-~,]\s*([0-9A-Fa-fx,]+)/)
/
(?(DEFINE)
(?<WORD> [a-zA-Z]+ )
(?<NUM> [0-9]+ )
)
\[ \s*
(?: (?&WORD) (?: \s* , \s* (?&WORD) )+
| (?&NUM) (?: \s* , \s* (?&NUM) )+
| (?&NUM) \s* - \s* (?&NUM)
)
\s* \]
/x
4-7 is a subset of 2-15. This regex should capture them:
/TRUE|FALSE|[2-9]|1[0-5]/
A quick'n'dirty test program:
#!/usr/bin/env perl
use strict;
use warnings;
for my $line (<DATA>) {
chomp $line;
print "$line: ";
if ($line =~ /
^ # beginning of the string
\[ # a literal opening sq. bracket
( # alternatives:
(TRUE|FALSE) (,(TRUE|FALSE))* # one or more thruth words
| (\d+) (,\d+)* # one or more numbers
| (\d+) - (\d+) # a range of numbers
) # end of alternatives
\] # a literal closing sq. bracket
$ # end of the string
/x) {
print "match\n";
}
else {
print "no match\n";
}
}
__DATA__
[TRUE]
foo
[FALSE,TRUE,FALSE]
[FALSE,TRUE,]
[42,FALSE]
[17,42,666]
bar
[17-42]
[17,42-666]
Output:
[TRUE]: match
foo: no match
[FALSE,TRUE,FALSE]: match
[FALSE,TRUE,]: no match
[42,FALSE]: no match
[17,42,666]: match
bar: no match
[17-42]: match
[17,42-666]: no match