How to find the next unbalanced brace? - regex

The regex below captures everything up to the last balanced }.
Now, what regex would be able to capture everything up to the next unbalanced }? In other words, how can I can get ... {three {four}} five} from $str instead of just ... {three {four}}?
my $str = "one two {three {four}} five} six";
if ( $str =~ /
(
.*?
{
(?> [^{}] | (?-1) )+
}
)
/sx
)
{
print "$1\n";
}

So you want to match
[noncurlies [block noncurlies [...]]] "}"
where a block is
"{" [noncurlies [block noncurlies [...]]] "}"
As a grammar:
start : text "}"
text : noncurly* ( block noncurly* )*
block : "{" text "}"
noncurly : /[^{}]/
As a regex (5.10+):
/
^
(
(
[^{}]*
(?:
\{ (?-1) \}
[^{}]*
)*
)
\}
)
/x
As a regex (5.10+):
/
^ ( (?&TEXT) \} )
(?(DEFINE)
(?<TEXT> [^{}]* (?: (?&BLOCK) [^{}]* )* )
(?<BLOCK> \{ (?&TEXT) \} )
)
/x

Related

need return value for captured group from last captured string in perl

I have XML files from which i want to capture init value( tag) for each parameter.I am copying some part of xml for reference.
I have port name and parameter name( tag(MNO) available with me.
eg . port name is XYZ & parameter name is MNO
port name is PQR & parameter name is ABC and GHI
There can be multiple tag under one container.
<R-PORT-PROTOTYPE UUID="Oac11eff016c6bb667f357a89xOac11f0ad174240e817fa858f00">
<SHORT-NAME>XYZ</SHORT-NAME>
<REQUIRED-COM-SPECS>
<PARAMETER-REQUIRE-COM-SPEC>
<INIT-VALUE>
<APPLICATION-VALUE-SPECIFICATION>
<SHORT-LABEL>Init_Val</SHORT-LABEL>
<CATEGORY>VALUE</CATEGORY>
<SW-VALUE-CONT>
<SW-VALUES-PHYS>
<V>0.071</V>
</SW-VALUES-PHYS>
</SW-VALUE-CONT>
</APPLICATION-VALUE-SPECIFICATION>
</INIT-VALUE>
<PARAMETER-REF DEST="PARAMETER-DATA-PROTOTYPE">/SoftwareTypes/Interfaces/MNO</PARAMETER-REF>
</PARAMETER-REQUIRE-COM-SPEC>
</REQUIRED-COM-SPECS>
</R-PORT-PROTOTYPE>
<R-PORT-PROTOTYPE UUID="Oac11eff016c6bb667f357a89xOac11f0ad174240e817f8f55900">
<SHORT-NAME>PQR</SHORT-NAME>
<REQUIRED-COM-SPECS>
<PARAMETER-REQUIRE-COM-SPEC>
<INIT-VALUE>
<APPLICATION-VALUE-SPECIFICATION>
<SHORT-LABEL>Init_0</SHORT-LABEL>
<CATEGORY>VALUE</CATEGORY>
<SW-VALUE-CONT>
<SW-VALUES-PHYS>
<V>80</V>
</SW-VALUES-PHYS>
</SW-VALUE-CONT>
</APPLICATION-VALUE-SPECIFICATION>
</INIT-VALUE>
<PARAMETER-REF DEST="PARAMETER-DATA-PROTOTYPE">/SoftwareTypes/Interfaces/ABC</PARAMETER-REF>
</PARAMETER-REQUIRE-COM-SPEC>
<PARAMETER-REQUIRE-COM-SPEC>
<INIT-VALUE>
<APPLICATION-VALUE-SPECIFICATION>
<SHORT-LABEL>Int_ghi</SHORT-LABEL>
<CATEGORY>VALUE</CATEGORY>
<SW-VALUE-CONT>
<SW-VALUES-PHYS>
<V>-80</V>
</SW-VALUES-PHYS>
</SW-VALUE-CONT>
</APPLICATION-VALUE-SPECIFICATION>
</INIT-VALUE>
<PARAMETER-REF DEST="PARAMETER-DATA-PROTOTYPE">/SoftwareTypes/Interfaces/GHI</PARAMETER-REF>
</PARAMETER-REQUIRE-COM-SPEC>
</REQUIRED-COM-SPECS>
</R-PORT-PROTOTYPE>
regex :
if($test_string=~ /<R-PORT-PROTOTYPE.*?<short-name>Port_name<\/short-name>.*?<V>(.*?)<\/.*?<PARAMETER-REF DEST="PARAMETER-DATA-PROTOTYPE">.*?Parameter_name<\/PARAMETER-REF>/gis) {
print $2;
}
I need output 80 if parameter is ABC and -80 if parameter is GHI
I suggest using XML::LibXML.
Here I've combined two Xpath queries to find V nodes:
SHORT-NAME is XYZ and PARAMETER-REF (with DEST == PARAMETER-DATA-PROTOTYPE) contains MNO.
SHORT-NAME is PQR and PARAMETER-REF (with DEST == PARAMETER-DATA-PROTOTYPE) contains ABC or GHI.
Example:
#!/usr/bin/perl
use strict;
use warnings;
use XML::LibXML;
my $dom = XML::LibXML->load_xml(location => 'doc.xml');
my $query = q{
//R-PORT-PROTOTYPE/SHORT-NAME[text()="XYZ"]/..
//PARAMETER-REF[#DEST="PARAMETER-DATA-PROTOTYPE"][
contains(text(),'MNO')
]/..//V
|
//R-PORT-PROTOTYPE/SHORT-NAME[text()="PQR"]/..
//PARAMETER-REF[#DEST="PARAMETER-DATA-PROTOTYPE"][
contains(text(),'ABC') or contains(text(),'GHI')
]/..//V
};
foreach my $vnode ($dom->findnodes($query)) {
print $vnode->to_literal() . "\n";
}
Output:
0.071
80
-80
The two ways to get either or both is
1 - Linear https://regex101.com/r/NYbvI8/1
# https://regex101.com/r/NYbvI8/1
# if($test_string=~ /<R-PORT-PROTOTYPE.*?<short-name>PQR<\/short-name>(?:.*?<V>(.*?)<\/V>.*?<PARAMETER-REF[ ]DEST="PARAMETER-DATA-PROTOTYPE">.*?MNO1<\/PARAMETER-REF>)?(?:.*?<V>(.*?)<\/V>.*?<PARAMETER-REF[ ]DEST="PARAMETER-DATA-PROTOTYPE">.*?MNO2<\/PARAMETER-REF>)?(?(1)|(?(2)|(?!)))/gis)
<R-PORT-PROTOTYPE .*? <short-name>PQR</short-name>
(?:
.*?
<V>
( .*? ) # (1)
</V>
.*?
<PARAMETER-REF [ ] DEST="PARAMETER-DATA-PROTOTYPE"> .*? MNO1</PARAMETER-REF>
)?
(?:
.*?
<V>
( .*? ) # (2)
</V>
.*?
<PARAMETER-REF [ ] DEST="PARAMETER-DATA-PROTOTYPE"> .*? MNO2</PARAMETER-REF>
)?
(?(1)
| (?(2)
| (?!)
)
)
2 - Out of order https://regex101.com/r/gQJ3cO/1
# https://regex101.com/r/t4M9UB/1
# if($test_string=~ /<R-PORT-PROTOTYPE.*?<short-name>PQR<\/short-name>(?:(?:(?(1)(?!)).*?<V>(.*?)<\/V>.*?<PARAMETER-REF[ ]DEST="PARAMETER-DATA-PROTOTYPE">.*?MNO1<\/PARAMETER-REF>)|(?:(?(2)(?!)).*?<V>(.*?)<\/V>.*?<PARAMETER-REF[ ]DEST="PARAMETER-DATA-PROTOTYPE">.*?MNO2<\/PARAMETER-REF>)){1,2}/gis)
<R-PORT-PROTOTYPE .*? <short-name>PQR</short-name>
(?:
(?:
(?(1) (?!) )
.*?
<V>
( .*? ) # (1)
</V>
.*?
<PARAMETER-REF [ ] DEST="PARAMETER-DATA-PROTOTYPE"> .*? MNO1</PARAMETER-REF>
)
|
(?:
(?(2) (?!) )
.*?
<V>
( .*? ) # (2)
</V>
.*?
<PARAMETER-REF [ ] DEST="PARAMETER-DATA-PROTOTYPE"> .*? MNO2</PARAMETER-REF>
)
){1,2}

Rewriting a recursive regex for older Perl version

The following piece of code works just fine with Perl (v5.16.2). However, when I run it using Perl v5.8.9, it complains about the following regex. How can I rewrite this regex in a way that works with Perl v5.8.9. (I can't update the version).
REGEX:
use strict;
use warnings;
our %formula_per_k;
INIT {
# List all functions that you want to allow in formulas. All other words will be interpretted as variables.
my #FORMULA_FUNCS = qw(sqrt exp log);
# Load the data via a file.
my $data = do {local $/; <DATA>};
# Parse K blocks
while ($data =~ m{
^K \s+ (\w+) \s* \{
( (?: [^{}]+ | \{(?2)\} )* ) # Matched braces only.
\}
}mgx) {
my ($name, $params) = ($1, $2);
# Parse LOL block
next if $params !~ m{
LOL \s* \{
( (?: [^{}]+ | \{(?1)\} )*? ) # Matched braces only.
\}
}mx;
my $lol = $1;
# Start building anonymous subroutine
my $conditions = '';
# Parse Conditions and Formulas
while ($lol =~ m{
COND \s* \{ (.*?) \} \s*
FORMULA \s* \{ (.*?) \}
}gx) {
my ($cond, $formula) = ($1, $2);
# Remove Excess spacing and translate variable into perl scalar.
for ($cond, $formula) {
s/^\s+|\s+$//g;
s{([a-zA-Z]+)}{
my $var = $1;
$var = "\$hashref->{$var}" if ! grep {$var eq $_} #FORMULA_FUNCS;
$var
}eg;
}
$conditions .= "return $formula if $cond; ";
}
my $code = "sub {my \$hashref = shift; ${conditions} return; }";
my $sub = eval $code;
if ($#) {
die "Invalid formulas in $name: $#";
}
$formula_per_k{$name} = $sub;
}
}
sub formula_per_k {
my ($k, $vars) = #_;
die "Unrecognized K value '$k'" if ! exists $formula_per_k{$k};
return $formula_per_k{$k}($vars);
}
print "'K1', {d => .1} = " . formula_per_k('K1', {d => .1}) . "\n";
print "'K1', {d => .05} = " . formula_per_k('K1', {d => .05}) . "\n";
print "'K3', {d => .02} = " . formula_per_k('K3', {d => .02}) . "\n";
print "'K3', {d => .021} = " . formula_per_k('K3', {d => .021}) . "\n";
__DATA__
... #OTHER STUFFS
K K1 {
LOL {
COND { d < 0.01 }
FORMULA { -0.2 + 3.3*sqrt(d) }
COND { d >= 0.01 }
FORMULA { -0.2 + 3.3*sqrt(d+0.4) }
}
}
... #OTHER STUFFS
K K2 {
LOL {
COND { d < 0.03 }
FORMULA { -2.2 + 1.3*sqrt(d) }
COND { d >= 0.03 }
FORMULA { -2.2 + 1.3*sqrt(d+0.8) }
}
}
... #OTHER STUFFS
K K3 {
LOL {
COND { d < 0.02 }
FORMULA { -4.3 + 0.3*sqrt(d) }
COND { d >= 0.02 }
FORMULA { -4.3 + 0.3*sqrt(d+0.3) }
}
}
... #OTHER STUFF
Outputs:
'K1', {d => .1} = 2.13345237791561
'K1', {d => .05} = 2.01370729772479
'K3', {d => .02} = -4.13029437251523
'K3', {d => .021} = -4.13002941430942
ERROR:
Sequence (?1...) not recognized in regex; marked by <-- HERE in m/
^K \s+ M3 \s* {
( (?: [^{}]+ | {(?2 <-- HERE )} )* ) # Matched braces only.
}
/ at ./code.pl line 215, <RFILE> line 12.
UPDATE:
Code is updated.
This was originally suggested by https://stackoverflow.com/users/1733163/miller
Before the introduction of (?PARNO), we had to use (??{ code }) to create recursive regular expressions. An example can be found in perlre - Extended Patterns.
The following is tested on v5.16.2, v5.20.0, and locally on a v5.8.9 perlbrew:
our $braces_re;
$braces_re = qr{
\{
(?:
(?> [^{}]+ )
|
(??{ $braces_re })
)*
\}
}sx;
# parse FOO block
while (
$data =~ m{
^FOO \s+ (\w+) \s* \{
( (?: [^{}]+ | (??{ $braces_re }) )* ) # Matched braces only.
\}
}mgx
)
{
my $params = $1;
# parse BAR block
next if $params !~ m{
BAR \s* \{
( (?: [^{}]+ | (??{ $braces_re }) )*? ) # Matched braces only.
\}
}mx;
# SOME CODE
}
Note, I intentionally separated out the declaration of the _re variable and its initialization. There are some versions of perl that will let you declare a recursive regular expression in the same statement as the initialization, but v5.8.9 is not one of them.
Also, if you're comfortable altering your original regex more than just dropping in a replacement for (?PARNO) notation, then the above can be reduced to the following. Also confirmed on v5.16.2:
my $braces_re;
$braces_re = qr{
(?:
(?> [^{}]+ )
| # The following is a "postponed" regular subexpression.
\{ (??{ $braces_re }) \} # Deferred execution enables recursive regex
)*
}sx;
# parse FOO block
while ( $data =~ m{^FOO \s+ (\w+) \s* \{ ( $braces_re ) \} }mgx ) {
my $params = $1;
# parse BAR block
next if $params !~ m{BAR \s* \{ ( $braces_re ) \}}mx;
# SOME CODE
}

perl stream file for regex token including scanned tokens

I am trying to stream a file in perl and tokenize the lines and include the tokens.
I have:
while( $line =~ /([\/][\d]*[%].*?[%][\d]*[\/]|[^\s]+|[\s]+)/g ) {
my $word = $1;
#...
}
But it doesn't work when there's no spaces in the token.
For example, if my line is:
$line = '/15%one (1)(2)%15/ is a /%good (1)%/ +/%number(2)%/.'
I would like to split that line into:
$output =
[
'/15%one (1)(2)%15/',
' ',
'is',
' ',
'a',
'/%good (1)%/',
' ',
'+',
'/%number(2)%/',
'.'
]
What is the best way to do this?
(?:(?!STRING).)* is to STRING as [^CHAR]* is to CHAR, so
my #tokens;
push #tokens, $1
while $line =~ m{
\G
( \s+
| ([\/])([0-9]*)%
(?: (?! %\3\2 ). )*
%\3\2
| (?: (?! [\/][0-9]*% )\S )+
)
}sxg;
but that doesn't validate. If you want to validate, you could use
my #tokens;
push #tokens, $1
while $line =~ m{
\G
( \s+
| ([\/])([0-9]*)%
(?: (?! %\3\2 ). )*
%\3\2
| (?: (?! [\/][0-9]*% )\S )+
| \z (*COMMIT) (*FAIL)
| (?{ die "Syntax error" })
)
}sxg;
The following also validates, but it's a bit more readable and makes it easy to differentiate the token types.:
my #tokens;
for ($line) {
m{\G ( \s+ ) }sxgc
&& do { push #tokens, $1; redo };
m{\G ( ([\/])([0-9]*)% (?: (?! %\3\2 ). )* %\3\2 ) }sxgc
&& do { push #tokens, $1; redo };
m{\G ( (?: (?! [\/][0-9]*% )\S )+ ) }sxgc
&& do { push #tokens, $1; redo };
m{\G \z }sxgc
&& last;
die "Syntax error";
}
pos will get you information about where the error occurred.

Regex with recursive expression to match nested braces?

I'm trying to match text like sp { ...{...}... }, where the curly braces are allowed to nest. This is what I have so far:
my $regex = qr/
( #save $1
sp\s+ #start Soar production
( #save $2
\{ #opening brace
[^{}]* #anything but braces
\} #closing brace
| (?1) #or nested braces
)+ #0 or more
)
/x;
I just cannot get it to match the following text: sp { { word } }. Can anyone see what is wrong with my regex?
There are numerous problems. The recursive bit should be:
(
(?: \{ (?-1) \}
| [^{}]+
)*
)
All together:
my $regex = qr/
sp\s+
\{
(
(?: \{ (?-1) \}
| [^{}]++
)*
)
\}
/x;
print "$1\n" if 'sp { { word } }' =~ /($regex)/;
This is case for the underused Text::Balanced, a very handy core module for this kind of thing. It does rely on the pos of the start of the delimited sequence being found/set first, so I typically invoke it like this:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::Balanced 'extract_bracketed';
sub get_bracketed {
my $str = shift;
# seek to beginning of bracket
return undef unless $str =~ /(sp\s+)(?={)/gc;
# store the prefix
my $prefix = $1;
# get everything from the start brace to the matching end brace
my ($bracketed) = extract_bracketed( $str, '{}');
# no closing brace found
return undef unless $bracketed;
# return the whole match
return $prefix . $bracketed;
}
my $str = 'sp { { word } }';
print get_bracketed $str;
The regex with the gc modifier tells the string to remember where the end point of the match is, and extract_bracketed uses that information to know where to start.

Parse IF condition using Regular Expression

I need to create the RE meets the following IF condition
string InputValue=" If (X.Value==” X”) then X.Value = “X”;
Elseif (X.Value==” X”) then X.Value = “X”;
Elseif (X.Value==” Y ") then X.Value = “Y”;
Elseif (X.Value== ” Z ") then X.Value = “Z”;
Else X.Value = “M”;";
as you know its only 1 if and 0 or many ElseIF and 0 or 1 Else and also i want to consider space and Enter
I try to use the following RE but its failed
string pattern="If\([a-z]*\.Value==""[a-z]*""\) Then [a-z]*\.Value=""[a-z]*""\;
(ElseIf\([a-z]*\.Value==""[a-z]*""\) Then [a-z]*\.Value=""[a-z]*""\;)*
(Else [a-z]*\.Value=""[a-z]*""\;)?";
bool result = Regex.IsMatch(InputValue, pattern, RegexOptions.IgnoreCase);
All ideas are welcomed
As in http://ideone.com/onS2e:
string condition = #"[_a-z]\w* \.VALUE \s* == \s* "" [^""]* """;
string assignment = #"[_a-z]\w* \.VALUE \s* = \s* "" [^""]* "" \s* ;";
string pattern = string.Format(
#"\b IF \s* \( \s* {0} \s* \) \s* THEN \s+ {1} \s*
( \b ELSEIF \s* \( \s* {0} \s* \) \s* THEN \s+ {1} \s* )* # repeat ELSEIF any number of times
( \b ELSE \s+ {1} )? # at most one ELSE",
condition, assignment);
Regex myRegex = new Regex( pattern, RegexOptions.IgnorePatternWhitespace |
RegexOptions.IgnoreCase | RegexOptions.Singleline );
Updated:
As in http://ideone.com/1coOp:
string pattern = string.Format(
#"^ \s* IF \s* \( \s* {0} \s* \) \s* THEN \s+ {1} \s*
( \b ELSEIF \s* \( \s* {0} \s* \) \s* THEN \s+ {1} \s* )* # repeat ELSEIF any number of times
( \b ELSE \s+ {1} )? # at most one ELSE
\s* $",
condition, assignment);