capturing multiple instances of a pattern

capturing multiple instances of a pattern - regex

I have a string:
{value1}+{value2}-{value3}*{value...n}
using a regular expression, I want to capture each of the bracketed values as well as the operators in between them and I do not know how many brackets there will be.
I tried:
/(\{.*\}).*([\+|\-|\*|\/])*/mgU
but that is just getting me the values and not the operators. Where did I go wrong?

You can validate the string first with
/\A ({ [^{}]* }) (?: [\/+*-] (?1))* \z/x
Details:
\A - start of string
({[^{}]*}) - Group 1: a {, any zero or more chars other than { and } and then a } char
(?:[\/+*-](?1))* - zero or more occurrences of a /, +, * or - char and then the Group 1 pattern
\z - end of string.
Then, you may collect individual matches with
/ { [^{}]* } | [\/+*-] /gx
This regex matches all occurrences of any substrings between { and } (with {[^{}]*}) or /, +, * or - chars (with [\/+*-]).
See a complete demo script:
#!/usr/bin/perl
use strict;
use warnings;
my $text = "{value1}+{value2}-{value3}*{value...n}";
if ($text =~ /\A ({ [^{}]* }) (?: [\/+*-] (?1))* \z/x) {
while($text =~ / { [^{}]* } | [\/+*-] /gx) {
print "$&\n";
}
}
Output:
{value1}
+
{value2}
-
{value3}
*
{value...n}

Another idea might be using the \G anchor and 2 capture groups, where the curly values are in group 1 and the operator in group 2:
\G(?=.*{[^{}]*}\z)({[^{}]*})([+*\/-])?
The pattern matches
\G Assert the position at the end of the previous match, or at the start of the string (in this case)
(?=.*{[^{}]*}\z) Positive lookahead, assert that the string ends with a curly part
({[^{}]*}) Capture the curly braces in group 1
([+*\/-])? Optionally capture an operator in group 2
Regex demo | Perl demo
Example
my $str = "{value1}+{value2}-{value3}*{value...n}";
while ($str =~ /\G(?=.*\{[^{}]*}\z)({[^{}]*})([+*\/-])?/g) {
print "Curly value: $1 Operator: $2\n";
}
Output
Curly value: {value1} Operator: +
Curly value: {value2} Operator: -
Curly value: {value3} Operator: *
Curly value: {value...n} Operator:

The tokenizer approach:
my #tokens;
for ($str) {
while (1) {
/\G \s+ /xgc;
/\G \{ ( [^{}]* ) \} /xgc
and do { push #tokens, [ VALUE => $1 ]; next; };
/\G ( [+-*\/] ) /xgc
and do { push #tokens, [ OP => $1 ]; next; };
/\G \Z /xgc
and last;
die( "Unexpected character at pos ".( pos )."\n" );
}
}
It might be overkill, but it's easier to extend.

If you only have non-nested blocks, separated by a known list of operators, you can use split to very easily separate a statement into values and operators.
use strict;
use warnings;
use Data::Dumper;
my #val = split m#([-+/*])#, <DATA>; # parens will prevent operators from being consumed
print Dumper \#val;
__DATA__
{value1}+{value2}-{value3}*{valuen}/{value4}+{value5}-{value6}*{valuen}+{value7}+{value8}-{value9}
This will print:
$VAR1 = [
'{value1}',
'+',
'{value2}',
'-',
'{value3}',
'*',
'{valuen}',
'/',
'{value4}',
'+',
'{value5}',
'-',
'{value6}',
'*',
'{valuen}',
'+',
'{value7}',
'+',
'{value8}',
'-',
'{value9}
'
];
From there, it should be a simple task to validate and clean up the values, as well as identify the operators.

Related

perl regex to get comma not in parenthesis or nested parenthesis

I have a comma separated string and I want to match every comma that is not in parenthesis (parenthesis are guaranteed to be balanced).
a , (b) , (d$_,c) , ((,),d,(,))
The commas between a and (b), (b) and (d$,c), (d$,c) and ((,),d,(,)) should match but not inside (d$_,c) or ((,),d,(,)).
Note: Eventually I want to split the string by these commas.
It tried this regex:
(?!<(?:\(|\[)[^)\]]+),(?![^(\[]+(?:\)|\])) from here but it only works for non-nested parenthesis.

You may use
(\((?:[^()]++|(?1))*\))(*SKIP)(*F)|,
See the regex demo
Details
(\((?:[^()]++|(?1))*\)) - Capturing group 1: matches a substring between balanced parentheses:
\( - a ( char
(?:[^()]++|(?1))* - zero or more occurrences of 1+ chars other than ( and ) or the whole Group 1 pattern (due to the regex subroutine (?1) that is necessary here since only a part of the whole regex pattern is recursed)
\) - a ) char.
(*SKIP)(*F) - omits the found match and starts the next search from the end of the match
| - or
, - matches a comma outside nested parentheses.

A single regex for this is massively overcomplicated and difficult to maintain or extend. Here is an iterative parser approach:
use strict;
use warnings;
my $str = 'a , (b) , (d$_,c) , ((,),d,(,))';
my $nesting = 0;
my $buffer = '';
my #vals;
while ($str =~ m/\G([,()]|[^,()]+)/g) {
my $token = $1;
if ($token eq ',' and !$nesting) {
push #vals, $buffer;
$buffer = '';
} else {
$buffer .= $token;
if ($token eq '(') {
$nesting++;
} elsif ($token eq ')') {
$nesting--;
}
}
}
push #vals, $buffer if length $buffer;
print "$_\n" for #vals;
You can use Parser::MGC to construct this sort of parser more abstractly.

How to match string that contain exact 3 time occurrence of special character in perl

I have try few method to match a word that contain exact 3 times slash but cannot work. Below are the example
#array = qw( abc/ab1/abc/abc a2/b1/c3/d4/ee w/5/a s/t )
foreach my $string (#array){
if ( $string =~ /^\/{3}/ ){
print " yes, word with 3 / found !\n";
print "$string\n";
}
else {
print " no word contain 3 / found\n";
}
Few macthing i try but none of them work
$string =~ /^\/{3}/;
$string =~ /^(\w+\/\w+\/\w+\/\w+)/;
$string =~ /^(.*\/.*\/.*\/.*)/;
Any other way i can match this type of string and print the string?

Match a / globally and compare the number of matches with 3
if ( ( () = m{/}g ) == 3 ) { say "Matched 3 times" }
where the =()= operator is a play on context, forcing list context on its right side but returning the number of elements of that list when scalar context is provided on its left side.
If you are uncomfortable with such a syntax stretch then assign to an array
if ( ( my #m = m{/}g ) == 3 ) { say "Matched 3 times" }
where the subsequent comparison evaluates it in the scalar context.
You are trying to match three consecutive / and your string doesn't have that.

The pattern you need (with whitespace added) is
^ [^/]* / [^/]* / [^/]* / [^/]* \z
or
^ [^/]* (?: / [^/]* ){3} \z
Your second attempt was close, but using ^ without \z made it so you checked for string starting with your pattern.
Solutions:
say for grep { m{^ [^/]* (?: / [^/]* ){3} \z}x } #array;
or
say for grep { ( () = m{/}g ) == 3 } #array;
or
say for grep { tr{/}{} == 3 } #array;

You need to match
a slash
surrounded by some non-slashes (^(?:[^\/]*)
repeating the match exactly three times
and enclosing the whole triple in start of line and and of line anchors:
$string =~ /^(?:[^\/]*\/[^\/]*){3}$/;

if ( $string =~ /\/.*\/.*\// and $string !~ /\/.*\/.*\/.*\// )

How can I find sentences nested deeper than one bracket '()' set?

I want to print sentences from text file placed in () brackets deeper than one pair of brackets.
For example for this text file :
blabla(nothing(print me)) nanana (nanan)
blablabla(aaaaaaa(eeee(bbbb(cccc)bbb))aa)
blabla (blabla(hhhhh))
the output should be :
print me
eeee(bbbb(cccc)bbb)
bbbb(cccc)bbb
cccc
hhhhh
This is what I've done so far:
#!/usr/bin/perl -w
open(FILE, "<", $ARGV[0]) or die "file open error";
if ( #ARGV ) #if there are args
{
if ( -f $ARGV[0] ) #if its regular file
{
while(<FILE>)
{
my #array = split('\)',$_);
foreach(#array)
{
if ($_ =~ /.*\((.*)/)
{
print "$1\n";
}
}
}
close(FILE);
}
else{
print "Arg is not a file\n";}
}
else{
print "no args\n";}
My code can't separate the sentences placed in deeper brackets.

Assuming brackets are balanced:
use strict;
use warnings;
my #a;
while (<DATA>) {
while (/\(([^()]*(?:\(((?1))\)[^()]*(?{push #a, $2}))*+)\)/g){}
}
print join "\n", #a;
__DATA__
blabla(nothing(print me)) nanana (nanan)
blablabla(aaaaaaa(eeee(bbbb(cccc)bb(xxxx)b))aa)
blabla (blabla(hhhhh))
It returns:
print me
cccc
xxxx
bbbb(cccc)bb(xxxx)b
eeee(bbbb(cccc)bb(xxxx)b)
hhhhh
The idea is to store the capture group 2 content after each recursion, using the (?{...}) construct to execute code in the pattern.
Note that the order of results isn't ideal since the innermost content appears first. Unfortunately, I didn't find a way to change the order of results.
Pattern details:
\( # opening bracket level 1
( # open capture group 1
[^()]* # all that is not a bracket
(?:
\( # opening bracket for level 2 (or more when a recursion occurs)
( # capture group 2: to store the result
(?1) # recursion
)
\) # closing bracket for level 2 (or more ...)
[^()]* #
(?{push #a, $2}) # store the capture group 2 content in #a
)*+ # repeat when needed
)
\) # closing bracket level 1
EDIT: This pattern assumes that brackets are balanced, but if it isn't the case, this may cause problems of unwanted results for certain strings. The reason is that results are stored before the whole pattern succeeds.
Example with the string 1234 ( 5678 (abcd(efgh)ijkl) where a closing bracket is missing:
1234 ( 5678 (abcd(efgh)ijkl)
# ^ ^---- second attempt succeeds, "efgh" is stored
# '---- first attempt fails, but "efgh", "abcd(efgh)ijkl" are stored
To solve the problem, you can choose between two default behaviours:
the strict behaviour that only accepts balanced brackets. All you need is to store the results in a temporary array and to reset this array in the while loop or when a closing bracket is missing. In this case the result will only be "efgh":
my #a;
my #b;
while (<DATA>) {
while (/\(([^()]*(?:\(((?1))\)[^()]*(?{push #b, $2}))*+)(?:\)|(?{undef #b})(*F))/g) {
push #a, #b;
undef #b;
}
}
a more tolerant behaviour that doesn't make mandatory the closing bracket. To do that you must replace each \) with (?:\)|$). In this case, the first attempt succeeds and consumes characters until the end of the string (in other words, there isn't a second attempt). The results are "efgh" and "abcd(efgh)ijkl"

This is probably easiest, and the most maintainable with a two-pass solution.
The initial pass captures all first level parentheses. The second pass captures all enclosed parenthesis groups, only advancing a single character in order to match every level of embedded paren groups:
#!/usr/bin/env perl
use strict;
use warnings;
use v5.10;
my $data = do { local $/; <DATA> };
my $parens_content_re = qr{
\(
(
(?:
[^()]*+
|
\( (?1) \)
)*
)
\)
}x;
say for map {/(?=$parens_content_re)\(/g} map {/$parens_content_re/g} $data;
__DATA__
blabla(nothing(print me)) nanana (nanan)
blablabla(aaaaaaa(eeee(bbbb(cccc)bbb))aa)
blabla (blabla(hhhhh))
----(----(aaaa(123)bbbb(456)cccc)----)----
Outputs:
$ perl parens.pl
print me
eeee(bbbb(cccc)bbb)
bbbb(cccc)bbb
cccc
hhhhh
aaaa(123)bbbb(456)cccc
123
456

This code works by capturing levels recursively, using a simple regex for ) and split-ing by ( for the opening paren. It first prepares by peeling off the two starting layers of nesting. It works for shown examples, and a few others. However, there are other ways to nest pairs, for which rules are not specified. Also, this is probably rough around the edges. There is no magic of any kind involved and adjusting code for new cases should be feasible.
use warnings;
use strict;
my ($lev, #el, #res, $rret);
while (my $str = <DATA>)
{
print "\nString: $str\n";
#res = ();
# Drop two layers to start: strip last two ), split by ( and drop 0,1
$str =~ s/ (.*) \) [^)]* \) [^)]* $/$1/x;
#el = split '\(', $str;
#el = #el[2..$#el];
# Edge case: may have one element and be done, but with extra )
if (#el > 1) { $lev = join '(', #el }
else { ($lev = $el[0]) =~ s|\)||g }
push #res, $lev;
# Get next level and join string back, recursively
while ( $rret = nest_one($lev) ) {
$lev = join '(', #$rret;
push #res, $lev;
last if #$rret == 1;
}
print "\t$_\n" for #res;
}
# Strip last ) and past it, split by ( and drop first element
sub nest_one {
(my $lev = $_[0]) =~ s/(.*) \) [^)]* $/$1/x;
my #el = split '\(', $lev;
shift #el;
return (#el) ? \#el : undef;
}
__DATA__
blabla(nothing(print me)) nanana (nanan)
blablabla(aaaaaaa(eeee(bbbb(cccc)bbb))aa)
blabla (blabla(hhhhh))
It prints
blabla(nothing(print me)) nanana (nanan)
print me
blablabla(aaaaaaa(eeee(bbbb(cccc)bbb))aa)
eeee(bbbb(cccc)bbb)
bbbb(cccc)bbb
cccc
blabla (blabla(hhhhh))
hhhhh

Regex with recursive expression to match nested braces?

I'm trying to match text like sp { ...{...}... }, where the curly braces are allowed to nest. This is what I have so far:
my $regex = qr/
( #save $1
sp\s+ #start Soar production
( #save $2
\{ #opening brace
[^{}]* #anything but braces
\} #closing brace
| (?1) #or nested braces
)+ #0 or more
)
/x;
I just cannot get it to match the following text: sp { { word } }. Can anyone see what is wrong with my regex?

There are numerous problems. The recursive bit should be:
(
(?: \{ (?-1) \}
| [^{}]+
)*
)
All together:
my $regex = qr/
sp\s+
\{
(
(?: \{ (?-1) \}
| [^{}]++
)*
)
\}
/x;
print "$1\n" if 'sp { { word } }' =~ /($regex)/;

This is case for the underused Text::Balanced, a very handy core module for this kind of thing. It does rely on the pos of the start of the delimited sequence being found/set first, so I typically invoke it like this:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::Balanced 'extract_bracketed';
sub get_bracketed {
my $str = shift;
# seek to beginning of bracket
return undef unless $str =~ /(sp\s+)(?={)/gc;
# store the prefix
my $prefix = $1;
# get everything from the start brace to the matching end brace
my ($bracketed) = extract_bracketed( $str, '{}');
# no closing brace found
return undef unless $bracketed;
# return the whole match
return $prefix . $bracketed;
}
my $str = 'sp { { word } }';
print get_bracketed $str;
The regex with the gc modifier tells the string to remember where the end point of the match is, and extract_bracketed uses that information to know where to start.

Matching balanced parenthesis in Perl regex

I have an expression which I need to split and store in an array:
aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } }, aaa="bbb{}" { aa="b}b" }, aaa="bbb,ccc"
It should look like this once split and stored in the array:
aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } }
aaa="bbb{}" { aa="b}b" }
aaa="bbb,ccc"
I use Perl version 5.8 and could someone resolve this?

Use the perl module "Regexp::Common". It has a nice balanced parenthesis Regex that works well.
# ASN.1
use Regexp::Common;
$bp = $RE{balanced}{-parens=>'{}'};
#genes = $l =~ /($bp)/g;

There's an example in perlre, using the recursive regex features introduced in v5.10. Although you are limited to v5.8, other people coming to this question should get the right solution :)
$re = qr{
( # paren group 1 (full function)
foo
( # paren group 2 (parens)
\(
( # paren group 3 (contents of parens)
(?:
(?> [^()]+ ) # Non-parens without backtracking
|
(?2) # Recurse to start of paren group 2
)*
)
\)
)
)
}x;

I agree with Scott Rippey, more or less, about writing your own parser. Here's a simple one:
my $in = 'aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } }, ' .
'aaa="bbb{}" { aa="b}b" }, ' .
'aaa="bbb,ccc"'
;
my #out = ('');
my $nesting = 0;
while($in !~ m/\G$/cg)
{
if($nesting == 0 && $in =~ m/\G,\s*/cg)
{
push #out, '';
next;
}
if($in =~ m/\G(\{+)/cg)
{ $nesting += length $1; }
elsif($in =~ m/\G(\}+)/cg)
{
$nesting -= length $1;
die if $nesting < 0;
}
elsif($in =~ m/\G((?:[^{}"]|"[^"]*")+)/cg)
{ }
else
{ die; }
$out[-1] .= $1;
}
(Tested in Perl 5.10; sorry, I don't have Perl 5.8 handy, but so far as I know there aren't any relevant differences.) Needless to say, you'll want to replace the dies with something application-specific. And you'll likely have to tweak the above to handle cases not included in your example. (For example, can quoted strings contain \"? Can ' be used instead of "? This code doesn't handle either of those possibilities.)

To match balanced parenthesis or curly brackets, and if you want to take under account backslashed (escaped) ones, the proposed solutions would not work. Instead, you would write something like this (building on the suggested solution in perlre):
$re = qr/
( # paren group 1 (full function)
foo
(?<paren_group> # paren group 2 (parens)
\(
( # paren group 3 (contents of parens)
(?:
(?> (?:\\[()]|(?![()]).)+ ) # escaped parens or no parens
|
(?&paren_group) # Recurse to named capture group
)*
)
\)
)
)
/x;

Try something like this:
use strict;
use warnings;
use Data::Dumper;
my $exp=<<END;
aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } } , aaa="bbb{}" { aa="b}b" }, aaa="bbb,ccc"
END
chomp $exp;
my #arr = map { $_ =~ s/^\s*//; $_ =~ s/\s* $//; "$_}"} split('}\s*,',$exp);
print Dumper(\#arr);

Although Recursive Regular Expressions can usually be used to capture "balanced braces" {}, they won't work for you, because you ALSO have the requirement to match "balanced quotes" ".
This would be a very tricky task for a Perl Regular Expression, and I'm fairly certain it's not possible. (In contrast, it could probably be done with Microsoft's "balancing groups" Regex feature).
I would suggest creating your own parser. As you process each character, you count each " and {}, and only split on , if they are "balanced".

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

capturing multiple instances of a pattern - regex

Related

perl regex to get comma not in parenthesis or nested parenthesis

How to match string that contain exact 3 time occurrence of special character in perl

How can I find sentences nested deeper than one bracket '()' set?

Regex with recursive expression to match nested braces?

Matching balanced parenthesis in Perl regex

Categories

Resources