Regular Expression Inception (a match within a match) - regex

I am trying to create a regular expression that captures a named group and then looks within that named group to check if it contains certain qualities.
For example. I have a regular expression that matches a code block and I can use it to match and capture code blocks:
test.pl:
use strict;
use warnings;
my $text = <<'END_TEXT';
block {
// random stuff
}
block {
dog
}
END_TEXT
my $code_block_rx = qr{(?(DEFINE)
(?<code_block>
block\h\{ (?: [^{}]++ | (?&code_block) )*+ \}
)
)}xms;
while ($text =~ m/(?<match>(?&code_block))$code_block_rx/g) {
print $+{match}."\n";
}
This code will print both code blocks. But what if I only want to capture code blocks that contain the word "dog"?
Is there a way (in a single regular expression) to capture a code block, and then if that is found, look within the code block for the word "dog"?
I've tried modifying the regex to use a look ahead assertion, but it just causes the whole thing to fail: /(?<match>(?=dog)(?&code_block))$code_block_rx/g
What am I missing?

You tried to match dog at the position where the match starts.
Instead, you can check if it's in the block you matched.
while ($text =~ /(
\b block \h*+ ( (?&code_block) )
(?(DEFINE)
(?<code_block> \{ (?&code_block_body) \} )
(?<code_block_body> (?: [^{}]++ | (?&code_block) )*+ )
)
)/xg) {
my $block_stmt = $1;
my $block_stmt_block = $2;
if ($block_stmt_block =~ /\b dog \b/x) {
say $block_stmt;
}
}
It can be done in a single pattern by using (?(?{!( assertion() )})(*FAIL)) to match against something you already captured.
while ($text =~ m{(
\b block \h*+
# A code_block that contains the word 'dog'.
( (?&code_block) ) (?(?{!( "$^N" =~ /\b dog \b/x )})(*FAIL))
(?(DEFINE)
(?<code_block> \{ (?&code_block_body) \} )
(?<code_block_body> (?: [^{}]++ | (?&code_block) )*+ )
)
)}xg) {
say $1;
}

Related

regular expression with recursion in Perl

I'm trying to use this but can't make it work. I want to check the syntax of expressions like this: (1+2)*(3+4)
I have integers, +, * and brackets. That's it, but it can be nested to any depth.
In BNF syntax the expr can be described like this:
expr
<sum>
sum
<product>{+<product>}
product
<atom>{*<atom>}
atom
<number>|(<expr>)
number
<digit>{<digit>}
I tried to translate this to Perl like this:
$number = '\d+';
$atom = "($number|\\((?R)\\))";
$product = "$atom(\\*$atom)*";
$sum = "$product(\\+$product)*";
$expr = $sum;
if ('(1+2)*(3+4)' =~ /^$expr$/)
{
print "OK";
}
But it doesn't match! What am I doing wrong?
When you recurse, the ^ at the start of the pattern will fail to match.
Use (?(DEFINE)...) to define the rules instead of using (?R).
'(1+2)*(3+4)' =~ /
^ (?&expr) \z
(?(DEFINE)
# Rules.
(?<expr> (?&sum) )
(?<sum> (?&product) (?: \+ (?&product) )*+ )
(?<product> (?&atom) (?: \* (?&atom) )*+ )
(?<atom> (?&NUMBER) | \( (?&expr) \) )
# Tokens.
(?<NUMBER> \d++ )
)
/x
or die("Doesn't match.\n");
which simplifies to
'(1+2)*(3+4)' =~ /
^ (?&expr) \z
(?(DEFINE)
# Rules.
(?<expr> (?&binary_op) )
(?<binary_op> (?&atom) (?: [+*] (?&atom) )*+ )
(?<atom> (?&NUMBER) | \( (?&expr) \) )
# Tokens.
(?<NUMBER> \d++ )
)
/x
or die("Doesn't match.\n");
That's assuming you're only trying to check for validity rather than trying to parse the string. If you need to parse the string, you can build a parser using Parse::RecDescent or Marpa::R2.
ikegami's workaround above with the DEFINE stuff is beautiful, but it doesn't answer the question how to do it my way. A minimal change of my code to make it work? ikegami is right, the cause of no match is the ^ in /^$expr$/ . When the parser reenters the regex recursively it again checks for beginning of string, which fails. So I cannot have ^ and $ in the regex it seems. Without them my string matches. But then some invalid strings match too, like A(1+2)*(3+4)B . In the absence of ^ and $ it doesn't necessarily match the whole string. Problem.
ikegami suggested a solution to this in a comment above. I'll just write it out. I have tested it and it works:
$number = '\d+';
$atom = "($number|\\((?1)\\))";
$product = "$atom(\\*$atom)*";
$sum = "$product(\\+$product)*";
$expr = $sum;
if ('(1+2)*(3+4)' =~ /^($expr)$/)
{
print "OK";
}
Notice that I now have (?1) instead of (?R) and that I have enclosed $expr in brackets. (?1) refers to the first capture group, which is ($expr). So the recursion reenters this subex instead of the whole regex. ^ is not met again. That solves it.

extract a part of string using regex

I have a text file with pattern as below.
"s|o|m|j|n|k|v|a|l|u|e|s|cap1{capture|these|values}|s|o|m|j|n|k|v|a|l|u|e|s|cap2[capture|these|values]|s|o|m|j|n|k|v|a|l|u|e|s|CAP3{[capture|these|values]|[capture|these|values]}"
I am trying to extract the values cap1, cap2, CAP3.
I am trying with regex "([a-z]|[|])cap1(\{(.*?)\})([a-z]|[|]|[0-9])" but with no luck any help is appreciated.
As I understand you want to extract the value of cap1, cap2, CAP3 one by one. There are 3 regex then
For cap1
cap1\{([^\}]*)\}
Explanation
cap1\{ match text cap1{,
([^\}]*) capture any characters except } to group $1,
\} match text }.
For cap2
cap2\[([^\]]*)\]
Explanation
cap2\[ match text cap2[,
([^\]]*) capture any characters except ] to group $1,
\] match text ].
For CAP3
CAP3\{\[([^\]]*)\]\|\[([^\]]*)\]\}
Explanation
CAP3\{ match text CAP3{,
\[([^\]]*)\]\|\[([^\]]*)\] capture any characters except ] to groups $1, $2 respectively,
\} match text }.
Additional: Thank you for a comment from #Borodin, to do this task you don't need to use lookaround but in case that you want to do search and replace, the lookaround may be necessary.
For cap1: (?<=cap1\{)([^\}]*)(?=\})
For cap2: (?<=cap2\[)([^\]]*)(?=\])
For CAP3: (?<=CAP3\{)\[([^\]]*)\]\|\[([^\]]*)\](?=\})
Using a pattern such as this should work:
[{\[]+([^}{\]\[]+)[\]}]+
Code:
$searchText =~ m/[{\[]+([^}{\]\[]+)[\]}]+/
Example:
https://regex101.com/r/qI3fI6/1
Update
I apologise -- I initially mistook your question for something more trivial
Essentially you want to perform a split on pipe | characters, excluding those found inside pairs of brackets or braces [ ... ] or { ... }. As long as you don't need to take account of nesting inside brackets of the same type (i.e. braces will only ever contain brackets, and brackets will only ever contain braces) then it is simply done like this
my #matches = $s =~ m{ \w+ ( \{ [^{}]* \} | \[ [^\[\]]* \] ) }gx;
print "$_\n" for #matches;
output
{capture|these|values}
[capture|these|values]
{[capture|these|values]|[capture|these|values]}
The data you show has no instances of braces containing braces, or brackets containing brackets, but I suspect that there is no theoretical limit to the nesting of the your data in which case some recursion is necessary
The regex pattern in the program below defines the text that can appear inside a pair of matching brackets as a pipe-delimited sequence of
another pair of matching brackets and their content [ ... ]
another pair of matching braces and their content { ... }
a sequence of word characters like capture and values
A pattern matching that is inside the second pair of capturing parentheses. It is a recursive pattern that calls itself using relative numbering (?-1). That could also be absolute numbering (?2) but it would have to be changed if the number of preceding captures was changed
The complete pattern looks for and captures a series of word characters immediately before the recursive pattern to account for the cap1, cap2 etc. This allows the result of a glolbal search to be assigned directly to a hash with the result show below
use strict;
use warnings;
my $s = "s|o|m|j|n|k|v|a|l|u|e|s|cap1{capture|these|values}|s|o|m|j|n|k|v|a|l|u|e|s|cap2[capture|these|values]|s|o|m|j|n|k|v|a|l|u|e|s|CAP3{[capture|these|values]|[capture|these|values]}";
my %captures = $s =~ m{
( (?> \w+ ) )
(
\{ (?-1) (?> \| (?-1) )* \} |
\[ (?-1) (?> \| (?-1) )* \] |
\w+
)
}gx;
use Data::Dump;
dd \%captures;
output
{
cap1 => "{capture|these|values}",
cap2 => "[capture|these|values]",
CAP3 => "{[capture|these|values]|[capture|these|values]}",
}
Original answer
It looks like you want all identifiers that are preceded by a pipe | character and followed by either a square or curly opening bracket [ or {
This program will do that for you
use strict;
use warnings;
use v5.10;
my $s = "s|o|m|j|n|k|v|a|l|u|e|s|cap1{capture|these|values}|s|o|m|j|n|k|v|a|l|u|e|s|cap2[capture|these|values]|s|o|m|j|n|k|v|a|l|u|e|s|CAP3{[capture|these|values]|[capture|these|values]}";
for ( $s ) {
my #captures = /\|(\w+)[\[\{]/g;
say for #captures;
}
output
cap1
cap2
CAP3

Perl regex: Extract token_key value from string

I just started learning Perl and trying to do regex to break down a token key.
The token itself has multiple "columns" and I only need the KEY section.
token:
{
"token_key":"C9B3A703ADFEE7A579561799DC019685C75F16E6D4F80E3AA01798CA2B1BD4C396E91C62D73A9604EE90C72BED760AC24D70B072517B06C3D2E1E3102046103E813E2AA59741D2B6543475DEED4EF4A9625BFFF15DAC5417209AEED968016E0671BE1878C8",
"key_type":"xyz",
"expires":1200
}
but I only need this part
C9B3A703ADFEE7A579561799DC019685C75F16E6D4F80E3AA01798CA2B1BD4C396E91C62D73A9604EE90C72BED760AC24D70B072517B06C3D2E1E3102046103E813E2AA59741D2B6543475DEED4EF4A9625BFFF15DAC5417209AEED968016E0671BE1878C8
everything else can be ignored when I output it.
Any suggestions or advice are welcome!
Thank You!
This looks like JSON -- perhaps a proper parser would be better?
use JSON::PP;
my $json = JSON::PP->new->utf8->allow_barekey;
my $token = $json->decode('{' . $str . '}')->{'token'};
print $token->{'token_key'};
In any case, you can extract it (a bit more hackishly) with a regex like so:
$str =~ /['"]token_key['"]:\s*['"]([a-f0-9]+)['"]/i;
print $1;
A slightly less hackish regex would be:
# (["'])\s*token_key\s*\1\s*:\s*(["'])((?:(?!\2)[\S\s])*)\2
( ["'] ) # (1)
\s* token_key \s*
\1
\s* : \s*
( ["'] ) # (2)
( # (3 start)
(?:
(?! \2 )
[\S\s]
)*
) # (3 end)
\2
Assuming the following:
absence of white characters between "token_key" and its value,
double quotes, not single quotes have been used in parsed strings,
a string of file is in $_ variable.
In that case
if (/"token_key":"([^"]+)/)
{ print "$1\n" }

Perl regex for multichar nested bracket

I use recursive Perl regular reg-expressions to scan for nested singlechar brackets:
$RE = qr'(?:[\(]((?:(?>[^\(\)]+)|(??{$RE}))*)[\)])';
This lets me scan c-function calls, something like :
"func (a(b()))" ~= /$RE/
matching "(a(b()))" . Now I'd like to parse Pascal style nested [if,if-end] brackets, i.e.:
if (a) then
if (b) then
blaif := 1;
else
blaend := 2;
end if;
end if;
I tried to rewrite $RE from above to:
$RE_if = qr'(?:(?:if)((?:(?>(?!(?:\bif\b|\bend\s+if))+)|(??{$RE_if}))*)\
(?:\bend\s+if))';
But it kindof doesnt work. Does somebody have a regex that handles multichar brackets
like ["if","end if"] ?
-- Greetings Konrad
Let's look at the original pattern: (Extraneous escapes removed. Needless surrounding (?:) removed.)
[(] # Prefix.
(
(?: (?> [^()] +) # Some characters containing neither prefix nor suffix.
| (??{ $RE }) # Recursion
)*
)
[)] # Suffix.
(?:(?!STRING).)* is to STRING as [^CHAR]* is to CHAR, so:
\bif\b
(
(?: (?> (?:(?! \b(?:end\s+)?if\b ).)+ )
| (??{ $RE })
)*
)
\bend\s+if\b
By the way, (?>PAT+) can be written PAT++.

extract contents of each level of parentheses

I am converting a SMAPI grammar to JSGF. They are pretty similar grammars used in different speech recognition systems. SMAPI uses a question mark they way the rest of the world does, to mean 0 or 1 of the previous thing. JSGF uses square brackets for this. So, I need to convert a string like stuff? to [stuff], and parenthesized strings like ((((stuff)? that)? I)? like)? to [[[[stuff] that] I] like]. I have to leave alone strings like ((((stuff) that) I) hate). As Qtax pointed out, a more complicated example would be (foo ((bar)? (baz))?) being replaced by (foo [[bar] (baz)]).
Because of this, I have to extract every level of a parenthesized expression, see if it ends in a question mark, and replace the parens and question mark with square braces if it does.
I think Eric Strom's answer to this question is almost what I need. The problem is that when I use it, it returns the largest matched grouping, whereas I need to do operations on each individual groupings.
This is what I have so far: s/( \( (?: [^()?]* | (?0) )* \) ) \?/[$1]/xg. When matched with ((((stuff)? that)? I)? like)?, however, it produces only [((((stuff)? that)? I)? like)]. Any ideas on how to do this?
I
You'll also want to look at ysth's solution to that question, and use a tool that is already available to solve this problem:
use Text::Balanced qw(extract_bracketed);
$text = '((((stuff)? that)? I)? like)?';
for ($i=0; $i<length($text); $i++) {
($match,$remainder) = extract_bracketed( substr($text,$i), '()' );
if ($match && $remainder =~ /^\?/) {
substr($text,$i) =
'[' . substr($match,1,-1) . ']' . substr($remainder,1);
$i=-1; # fixed
}
}
In older Perl versions (pre 5.10), one could have used code assertions and dynamic regex for this:
...
my $s = '((((stuff)? that)? I)? like)?';
# recursive dynamic regex, we need
# to pre-declare lexical variables
my $rg;
# use a dynamically generated regex (??{..})
# and a code assertion (?{..})
$rg = qr{
(?: # start expression
(?> [^)(]+) # (a) we don't see any (..) => atomic!
| # OR
( # (b) start capturing group for level
\( (??{$rg}) \) \? # oops, we found parentheses \(,\) w/sth
) # in between and the \? at the end
(?{ print "[ $^N ]\n" }) # if we got here, print the captured text $^N
)* # done, repeat expression if possible
}xs;
$s =~ /$rg/;
...
during the match, the code assertion prints all matches, which are:
[ (stuff)? ]
[ ((stuff)? that)? ]
[ (((stuff)? that)? I)? ]
[ ((((stuff)? that)? I)? like)? ]
To use this according to your requirements, you could change the code assertion slightly, put the capturing parentheses at the right place, and save the matches in an array:
...
my #result;
my $rg;
$rg = qr{
(?:
(?> [^)(]+)
|
\( ( (??{$rg}) ) \) \? (?{ push #result, $^N })
)*
}xs;
$s =~ /$rg/ && print map "[$_]\n", #result;
...
which says:
[stuff]
[(stuff)? that]
[((stuff)? that)? I]
[(((stuff)? that)? I)? like]
Regards
rbo
You could solve it in a couple of ways, simplest being just executing your expression till there are no more replacements made. E.g:
1 while s/( \( (?: [^()?]* | (?0) )* \) ) \?/[$1]/xg;
But that is highly inefficient (for deeply nested strings).
You could do it in one pass like this instead:
s{
(?(DEFINE)
(?<r> \( (?: [^()]++ | (?&r) )*+ \) )
)
( \( )
(?= (?: [^()]++ | (?&r) )*+ \) \? )
|
\) \?
}{
$2? '[': ']'
}gex;