extract contents of each level of parentheses - regex

I am converting a SMAPI grammar to JSGF. They are pretty similar grammars used in different speech recognition systems. SMAPI uses a question mark they way the rest of the world does, to mean 0 or 1 of the previous thing. JSGF uses square brackets for this. So, I need to convert a string like stuff? to [stuff], and parenthesized strings like ((((stuff)? that)? I)? like)? to [[[[stuff] that] I] like]. I have to leave alone strings like ((((stuff) that) I) hate). As Qtax pointed out, a more complicated example would be (foo ((bar)? (baz))?) being replaced by (foo [[bar] (baz)]).
Because of this, I have to extract every level of a parenthesized expression, see if it ends in a question mark, and replace the parens and question mark with square braces if it does.
I think Eric Strom's answer to this question is almost what I need. The problem is that when I use it, it returns the largest matched grouping, whereas I need to do operations on each individual groupings.
This is what I have so far: s/( \( (?: [^()?]* | (?0) )* \) ) \?/[$1]/xg. When matched with ((((stuff)? that)? I)? like)?, however, it produces only [((((stuff)? that)? I)? like)]. Any ideas on how to do this?
I

You'll also want to look at ysth's solution to that question, and use a tool that is already available to solve this problem:
use Text::Balanced qw(extract_bracketed);
$text = '((((stuff)? that)? I)? like)?';
for ($i=0; $i<length($text); $i++) {
($match,$remainder) = extract_bracketed( substr($text,$i), '()' );
if ($match && $remainder =~ /^\?/) {
substr($text,$i) =
'[' . substr($match,1,-1) . ']' . substr($remainder,1);
$i=-1; # fixed
}
}

In older Perl versions (pre 5.10), one could have used code assertions and dynamic regex for this:
...
my $s = '((((stuff)? that)? I)? like)?';
# recursive dynamic regex, we need
# to pre-declare lexical variables
my $rg;
# use a dynamically generated regex (??{..})
# and a code assertion (?{..})
$rg = qr{
(?: # start expression
(?> [^)(]+) # (a) we don't see any (..) => atomic!
| # OR
( # (b) start capturing group for level
\( (??{$rg}) \) \? # oops, we found parentheses \(,\) w/sth
) # in between and the \? at the end
(?{ print "[ $^N ]\n" }) # if we got here, print the captured text $^N
)* # done, repeat expression if possible
}xs;
$s =~ /$rg/;
...
during the match, the code assertion prints all matches, which are:
[ (stuff)? ]
[ ((stuff)? that)? ]
[ (((stuff)? that)? I)? ]
[ ((((stuff)? that)? I)? like)? ]
To use this according to your requirements, you could change the code assertion slightly, put the capturing parentheses at the right place, and save the matches in an array:
...
my #result;
my $rg;
$rg = qr{
(?:
(?> [^)(]+)
|
\( ( (??{$rg}) ) \) \? (?{ push #result, $^N })
)*
}xs;
$s =~ /$rg/ && print map "[$_]\n", #result;
...
which says:
[stuff]
[(stuff)? that]
[((stuff)? that)? I]
[(((stuff)? that)? I)? like]
Regards
rbo

You could solve it in a couple of ways, simplest being just executing your expression till there are no more replacements made. E.g:
1 while s/( \( (?: [^()?]* | (?0) )* \) ) \?/[$1]/xg;
But that is highly inefficient (for deeply nested strings).
You could do it in one pass like this instead:
s{
(?(DEFINE)
(?<r> \( (?: [^()]++ | (?&r) )*+ \) )
)
( \( )
(?= (?: [^()]++ | (?&r) )*+ \) \? )
|
\) \?
}{
$2? '[': ']'
}gex;

Related

regular expression with recursion in Perl

I'm trying to use this but can't make it work. I want to check the syntax of expressions like this: (1+2)*(3+4)
I have integers, +, * and brackets. That's it, but it can be nested to any depth.
In BNF syntax the expr can be described like this:
expr
<sum>
sum
<product>{+<product>}
product
<atom>{*<atom>}
atom
<number>|(<expr>)
number
<digit>{<digit>}
I tried to translate this to Perl like this:
$number = '\d+';
$atom = "($number|\\((?R)\\))";
$product = "$atom(\\*$atom)*";
$sum = "$product(\\+$product)*";
$expr = $sum;
if ('(1+2)*(3+4)' =~ /^$expr$/)
{
print "OK";
}
But it doesn't match! What am I doing wrong?
When you recurse, the ^ at the start of the pattern will fail to match.
Use (?(DEFINE)...) to define the rules instead of using (?R).
'(1+2)*(3+4)' =~ /
^ (?&expr) \z
(?(DEFINE)
# Rules.
(?<expr> (?&sum) )
(?<sum> (?&product) (?: \+ (?&product) )*+ )
(?<product> (?&atom) (?: \* (?&atom) )*+ )
(?<atom> (?&NUMBER) | \( (?&expr) \) )
# Tokens.
(?<NUMBER> \d++ )
)
/x
or die("Doesn't match.\n");
which simplifies to
'(1+2)*(3+4)' =~ /
^ (?&expr) \z
(?(DEFINE)
# Rules.
(?<expr> (?&binary_op) )
(?<binary_op> (?&atom) (?: [+*] (?&atom) )*+ )
(?<atom> (?&NUMBER) | \( (?&expr) \) )
# Tokens.
(?<NUMBER> \d++ )
)
/x
or die("Doesn't match.\n");
That's assuming you're only trying to check for validity rather than trying to parse the string. If you need to parse the string, you can build a parser using Parse::RecDescent or Marpa::R2.
ikegami's workaround above with the DEFINE stuff is beautiful, but it doesn't answer the question how to do it my way. A minimal change of my code to make it work? ikegami is right, the cause of no match is the ^ in /^$expr$/ . When the parser reenters the regex recursively it again checks for beginning of string, which fails. So I cannot have ^ and $ in the regex it seems. Without them my string matches. But then some invalid strings match too, like A(1+2)*(3+4)B . In the absence of ^ and $ it doesn't necessarily match the whole string. Problem.
ikegami suggested a solution to this in a comment above. I'll just write it out. I have tested it and it works:
$number = '\d+';
$atom = "($number|\\((?1)\\))";
$product = "$atom(\\*$atom)*";
$sum = "$product(\\+$product)*";
$expr = $sum;
if ('(1+2)*(3+4)' =~ /^($expr)$/)
{
print "OK";
}
Notice that I now have (?1) instead of (?R) and that I have enclosed $expr in brackets. (?1) refers to the first capture group, which is ($expr). So the recursion reenters this subex instead of the whole regex. ^ is not met again. That solves it.

Perl regex - read java file and match entire text of a function in file

I am trying to read a .java file into a perl variable, and I want to match a function, say for instance:
public String example(){
return "hello";
}
What would the regex patter for this look like?
Current Attempt:
use strict;
use warnings;
open ( FILE, "example.java" ) || die "can't open file!";
my #lines = <FILE>;
close (FILE);
my $line;
foreach $line (#lines) {
if($line =~ /String example(.*)}/s){
print $line;
}
}
**Adopted from this answer
Regex:
^\s*([\w\s]+\(.*\)\s*(\{((?>"(?:[^"\\]*+|\\.)*"|'(?:[^'\\]*+|\\.)*'|//.*$|/\*[\s\S]*?\*/(\w+)["']?[^;]+\4;$|[^{}<'"/]++|[^{}]++|(?2))*)}))
Breakdown:
^ \s*
( # (1 start)
[\w\s]+ \( .* \) \s* # How it matches a function definition
( # (2 start)
\{ # Opening curly bracket
( # (3 start)
(?> # Atomic grouping (for its non-capturing purpose only)
"(?: [^"\\]*+ | \\ . )*" # Double quoted strings
| '(?: [^'\\]*+ | \\ . )*' # Single quoted strings
| // .* $ # A comment block starting with //
| /\* [\s\S]*? \*/ # A multi-line comment block /*...*/
( \w+ ) # (4) ^
["']? [^;]+ \4 ; $ # ^
| [^{}<'"/]++ # Force engine to backtrack if it encounters special characters (possessive)
| [^{}]++ # Default matching behavior (possessive)
| (?2) # Recurs 2nd capturing group
)* # Zero to many times of atomic group
) # (3 end)
} # Closing curly bracket
) # (2 end)
) # (1 end)
Revo's regex is the Right Way To Do it (as much as a regex ever can be!).
But sometimes you just need something quick, to manipulate a file you have control over. I find, when using regexes, that it's often important to define "Good enough".
So, it may be "good enough" to assume the indentation is correct. In that case, you can just detect the start of the fn, then read until you find the next closing curly with the same indentation:
( # Capture \1.
^([\t ])+ # Match and capture leading whitespace to \2.
(?:\w+\s*)? # Privacy specifier, if any.
\w+\s*\( # Name and opening round brace: is a function.
.*? # Need Dot-matches-newline, to match fn body.
\n\2} # Curly brace is as indented as start of fn.
) # End capture of \1.
Should work on clean code that you wrote yourself, code you can pass through an auto-formatter first, etc.
Will work with K&R, Hortmann and Allman indent styles.
Will fail with one-line and in-line functions, and indent styles like GNU, Whitesmiths, Pico, Ratliff and Pico - things which Rico's answer handles with no problems at all.
Also fails on lambdas, nested functions, and functions which use generics, but even Revo's doesn't recognize those, and they're not that common.
And neither of our regexes capture the comments preceding a function, which is pretty sinful.

extract a part of string using regex

I have a text file with pattern as below.
"s|o|m|j|n|k|v|a|l|u|e|s|cap1{capture|these|values}|s|o|m|j|n|k|v|a|l|u|e|s|cap2[capture|these|values]|s|o|m|j|n|k|v|a|l|u|e|s|CAP3{[capture|these|values]|[capture|these|values]}"
I am trying to extract the values cap1, cap2, CAP3.
I am trying with regex "([a-z]|[|])cap1(\{(.*?)\})([a-z]|[|]|[0-9])" but with no luck any help is appreciated.
As I understand you want to extract the value of cap1, cap2, CAP3 one by one. There are 3 regex then
For cap1
cap1\{([^\}]*)\}
Explanation
cap1\{ match text cap1{,
([^\}]*) capture any characters except } to group $1,
\} match text }.
For cap2
cap2\[([^\]]*)\]
Explanation
cap2\[ match text cap2[,
([^\]]*) capture any characters except ] to group $1,
\] match text ].
For CAP3
CAP3\{\[([^\]]*)\]\|\[([^\]]*)\]\}
Explanation
CAP3\{ match text CAP3{,
\[([^\]]*)\]\|\[([^\]]*)\] capture any characters except ] to groups $1, $2 respectively,
\} match text }.
Additional: Thank you for a comment from #Borodin, to do this task you don't need to use lookaround but in case that you want to do search and replace, the lookaround may be necessary.
For cap1: (?<=cap1\{)([^\}]*)(?=\})
For cap2: (?<=cap2\[)([^\]]*)(?=\])
For CAP3: (?<=CAP3\{)\[([^\]]*)\]\|\[([^\]]*)\](?=\})
Using a pattern such as this should work:
[{\[]+([^}{\]\[]+)[\]}]+
Code:
$searchText =~ m/[{\[]+([^}{\]\[]+)[\]}]+/
Example:
https://regex101.com/r/qI3fI6/1
Update
I apologise -- I initially mistook your question for something more trivial
Essentially you want to perform a split on pipe | characters, excluding those found inside pairs of brackets or braces [ ... ] or { ... }. As long as you don't need to take account of nesting inside brackets of the same type (i.e. braces will only ever contain brackets, and brackets will only ever contain braces) then it is simply done like this
my #matches = $s =~ m{ \w+ ( \{ [^{}]* \} | \[ [^\[\]]* \] ) }gx;
print "$_\n" for #matches;
output
{capture|these|values}
[capture|these|values]
{[capture|these|values]|[capture|these|values]}
The data you show has no instances of braces containing braces, or brackets containing brackets, but I suspect that there is no theoretical limit to the nesting of the your data in which case some recursion is necessary
The regex pattern in the program below defines the text that can appear inside a pair of matching brackets as a pipe-delimited sequence of
another pair of matching brackets and their content [ ... ]
another pair of matching braces and their content { ... }
a sequence of word characters like capture and values
A pattern matching that is inside the second pair of capturing parentheses. It is a recursive pattern that calls itself using relative numbering (?-1). That could also be absolute numbering (?2) but it would have to be changed if the number of preceding captures was changed
The complete pattern looks for and captures a series of word characters immediately before the recursive pattern to account for the cap1, cap2 etc. This allows the result of a glolbal search to be assigned directly to a hash with the result show below
use strict;
use warnings;
my $s = "s|o|m|j|n|k|v|a|l|u|e|s|cap1{capture|these|values}|s|o|m|j|n|k|v|a|l|u|e|s|cap2[capture|these|values]|s|o|m|j|n|k|v|a|l|u|e|s|CAP3{[capture|these|values]|[capture|these|values]}";
my %captures = $s =~ m{
( (?> \w+ ) )
(
\{ (?-1) (?> \| (?-1) )* \} |
\[ (?-1) (?> \| (?-1) )* \] |
\w+
)
}gx;
use Data::Dump;
dd \%captures;
output
{
cap1 => "{capture|these|values}",
cap2 => "[capture|these|values]",
CAP3 => "{[capture|these|values]|[capture|these|values]}",
}
Original answer
It looks like you want all identifiers that are preceded by a pipe | character and followed by either a square or curly opening bracket [ or {
This program will do that for you
use strict;
use warnings;
use v5.10;
my $s = "s|o|m|j|n|k|v|a|l|u|e|s|cap1{capture|these|values}|s|o|m|j|n|k|v|a|l|u|e|s|cap2[capture|these|values]|s|o|m|j|n|k|v|a|l|u|e|s|CAP3{[capture|these|values]|[capture|these|values]}";
for ( $s ) {
my #captures = /\|(\w+)[\[\{]/g;
say for #captures;
}
output
cap1
cap2
CAP3

Perl regex: Extract token_key value from string

I just started learning Perl and trying to do regex to break down a token key.
The token itself has multiple "columns" and I only need the KEY section.
token:
{
"token_key":"C9B3A703ADFEE7A579561799DC019685C75F16E6D4F80E3AA01798CA2B1BD4C396E91C62D73A9604EE90C72BED760AC24D70B072517B06C3D2E1E3102046103E813E2AA59741D2B6543475DEED4EF4A9625BFFF15DAC5417209AEED968016E0671BE1878C8",
"key_type":"xyz",
"expires":1200
}
but I only need this part
C9B3A703ADFEE7A579561799DC019685C75F16E6D4F80E3AA01798CA2B1BD4C396E91C62D73A9604EE90C72BED760AC24D70B072517B06C3D2E1E3102046103E813E2AA59741D2B6543475DEED4EF4A9625BFFF15DAC5417209AEED968016E0671BE1878C8
everything else can be ignored when I output it.
Any suggestions or advice are welcome!
Thank You!
This looks like JSON -- perhaps a proper parser would be better?
use JSON::PP;
my $json = JSON::PP->new->utf8->allow_barekey;
my $token = $json->decode('{' . $str . '}')->{'token'};
print $token->{'token_key'};
In any case, you can extract it (a bit more hackishly) with a regex like so:
$str =~ /['"]token_key['"]:\s*['"]([a-f0-9]+)['"]/i;
print $1;
A slightly less hackish regex would be:
# (["'])\s*token_key\s*\1\s*:\s*(["'])((?:(?!\2)[\S\s])*)\2
( ["'] ) # (1)
\s* token_key \s*
\1
\s* : \s*
( ["'] ) # (2)
( # (3 start)
(?:
(?! \2 )
[\S\s]
)*
) # (3 end)
\2
Assuming the following:
absence of white characters between "token_key" and its value,
double quotes, not single quotes have been used in parsed strings,
a string of file is in $_ variable.
In that case
if (/"token_key":"([^"]+)/)
{ print "$1\n" }

Perl regex for multichar nested bracket

I use recursive Perl regular reg-expressions to scan for nested singlechar brackets:
$RE = qr'(?:[\(]((?:(?>[^\(\)]+)|(??{$RE}))*)[\)])';
This lets me scan c-function calls, something like :
"func (a(b()))" ~= /$RE/
matching "(a(b()))" . Now I'd like to parse Pascal style nested [if,if-end] brackets, i.e.:
if (a) then
if (b) then
blaif := 1;
else
blaend := 2;
end if;
end if;
I tried to rewrite $RE from above to:
$RE_if = qr'(?:(?:if)((?:(?>(?!(?:\bif\b|\bend\s+if))+)|(??{$RE_if}))*)\
(?:\bend\s+if))';
But it kindof doesnt work. Does somebody have a regex that handles multichar brackets
like ["if","end if"] ?
-- Greetings Konrad
Let's look at the original pattern: (Extraneous escapes removed. Needless surrounding (?:) removed.)
[(] # Prefix.
(
(?: (?> [^()] +) # Some characters containing neither prefix nor suffix.
| (??{ $RE }) # Recursion
)*
)
[)] # Suffix.
(?:(?!STRING).)* is to STRING as [^CHAR]* is to CHAR, so:
\bif\b
(
(?: (?> (?:(?! \b(?:end\s+)?if\b ).)+ )
| (??{ $RE })
)*
)
\bend\s+if\b
By the way, (?>PAT+) can be written PAT++.