perl regex to get comma not in parenthesis or nested parenthesis - regex

I have a comma separated string and I want to match every comma that is not in parenthesis (parenthesis are guaranteed to be balanced).
a , (b) , (d$_,c) , ((,),d,(,))
The commas between a and (b), (b) and (d$,c), (d$,c) and ((,),d,(,)) should match but not inside (d$_,c) or ((,),d,(,)).
Note: Eventually I want to split the string by these commas.
It tried this regex:
(?!<(?:\(|\[)[^)\]]+),(?![^(\[]+(?:\)|\])) from here but it only works for non-nested parenthesis.

You may use
(\((?:[^()]++|(?1))*\))(*SKIP)(*F)|,
See the regex demo
Details
(\((?:[^()]++|(?1))*\)) - Capturing group 1: matches a substring between balanced parentheses:
\( - a ( char
(?:[^()]++|(?1))* - zero or more occurrences of 1+ chars other than ( and ) or the whole Group 1 pattern (due to the regex subroutine (?1) that is necessary here since only a part of the whole regex pattern is recursed)
\) - a ) char.
(*SKIP)(*F) - omits the found match and starts the next search from the end of the match
| - or
, - matches a comma outside nested parentheses.

A single regex for this is massively overcomplicated and difficult to maintain or extend. Here is an iterative parser approach:
use strict;
use warnings;
my $str = 'a , (b) , (d$_,c) , ((,),d,(,))';
my $nesting = 0;
my $buffer = '';
my #vals;
while ($str =~ m/\G([,()]|[^,()]+)/g) {
my $token = $1;
if ($token eq ',' and !$nesting) {
push #vals, $buffer;
$buffer = '';
} else {
$buffer .= $token;
if ($token eq '(') {
$nesting++;
} elsif ($token eq ')') {
$nesting--;
}
}
}
push #vals, $buffer if length $buffer;
print "$_\n" for #vals;
You can use Parser::MGC to construct this sort of parser more abstractly.

Related

capturing multiple instances of a pattern

I have a string:
{value1}+{value2}-{value3}*{value...n}
using a regular expression, I want to capture each of the bracketed values as well as the operators in between them and I do not know how many brackets there will be.
I tried:
/(\{.*\}).*([\+|\-|\*|\/])*/mgU
but that is just getting me the values and not the operators. Where did I go wrong?
You can validate the string first with
/\A ({ [^{}]* }) (?: [\/+*-] (?1))* \z/x
Details:
\A - start of string
({[^{}]*}) - Group 1: a {, any zero or more chars other than { and } and then a } char
(?:[\/+*-](?1))* - zero or more occurrences of a /, +, * or - char and then the Group 1 pattern
\z - end of string.
Then, you may collect individual matches with
/ { [^{}]* } | [\/+*-] /gx
This regex matches all occurrences of any substrings between { and } (with {[^{}]*}) or /, +, * or - chars (with [\/+*-]).
See a complete demo script:
#!/usr/bin/perl
use strict;
use warnings;
my $text = "{value1}+{value2}-{value3}*{value...n}";
if ($text =~ /\A ({ [^{}]* }) (?: [\/+*-] (?1))* \z/x) {
while($text =~ / { [^{}]* } | [\/+*-] /gx) {
print "$&\n";
}
}
Output:
{value1}
+
{value2}
-
{value3}
*
{value...n}
Another idea might be using the \G anchor and 2 capture groups, where the curly values are in group 1 and the operator in group 2:
\G(?=.*{[^{}]*}\z)({[^{}]*})([+*\/-])?
The pattern matches
\G Assert the position at the end of the previous match, or at the start of the string (in this case)
(?=.*{[^{}]*}\z) Positive lookahead, assert that the string ends with a curly part
({[^{}]*}) Capture the curly braces in group 1
([+*\/-])? Optionally capture an operator in group 2
Regex demo | Perl demo
Example
my $str = "{value1}+{value2}-{value3}*{value...n}";
while ($str =~ /\G(?=.*\{[^{}]*}\z)({[^{}]*})([+*\/-])?/g) {
print "Curly value: $1 Operator: $2\n";
}
Output
Curly value: {value1} Operator: +
Curly value: {value2} Operator: -
Curly value: {value3} Operator: *
Curly value: {value...n} Operator:
The tokenizer approach:
my #tokens;
for ($str) {
while (1) {
/\G \s+ /xgc;
/\G \{ ( [^{}]* ) \} /xgc
and do { push #tokens, [ VALUE => $1 ]; next; };
/\G ( [+-*\/] ) /xgc
and do { push #tokens, [ OP => $1 ]; next; };
/\G \Z /xgc
and last;
die( "Unexpected character at pos ".( pos )."\n" );
}
}
It might be overkill, but it's easier to extend.
If you only have non-nested blocks, separated by a known list of operators, you can use split to very easily separate a statement into values and operators.
use strict;
use warnings;
use Data::Dumper;
my #val = split m#([-+/*])#, <DATA>; # parens will prevent operators from being consumed
print Dumper \#val;
__DATA__
{value1}+{value2}-{value3}*{valuen}/{value4}+{value5}-{value6}*{valuen}+{value7}+{value8}-{value9}
This will print:
$VAR1 = [
'{value1}',
'+',
'{value2}',
'-',
'{value3}',
'*',
'{valuen}',
'/',
'{value4}',
'+',
'{value5}',
'-',
'{value6}',
'*',
'{valuen}',
'+',
'{value7}',
'+',
'{value8}',
'-',
'{value9}
'
];
From there, it should be a simple task to validate and clean up the values, as well as identify the operators.

How to match string that contain exact 3 time occurrence of special character in perl

I have try few method to match a word that contain exact 3 times slash but cannot work. Below are the example
#array = qw( abc/ab1/abc/abc a2/b1/c3/d4/ee w/5/a s/t )
foreach my $string (#array){
if ( $string =~ /^\/{3}/ ){
print " yes, word with 3 / found !\n";
print "$string\n";
}
else {
print " no word contain 3 / found\n";
}
Few macthing i try but none of them work
$string =~ /^\/{3}/;
$string =~ /^(\w+\/\w+\/\w+\/\w+)/;
$string =~ /^(.*\/.*\/.*\/.*)/;
Any other way i can match this type of string and print the string?
Match a / globally and compare the number of matches with 3
if ( ( () = m{/}g ) == 3 ) { say "Matched 3 times" }
where the =()= operator is a play on context, forcing list context on its right side but returning the number of elements of that list when scalar context is provided on its left side.
If you are uncomfortable with such a syntax stretch then assign to an array
if ( ( my #m = m{/}g ) == 3 ) { say "Matched 3 times" }
where the subsequent comparison evaluates it in the scalar context.
You are trying to match three consecutive / and your string doesn't have that.
The pattern you need (with whitespace added) is
^ [^/]* / [^/]* / [^/]* / [^/]* \z
or
^ [^/]* (?: / [^/]* ){3} \z
Your second attempt was close, but using ^ without \z made it so you checked for string starting with your pattern.
Solutions:
say for grep { m{^ [^/]* (?: / [^/]* ){3} \z}x } #array;
or
say for grep { ( () = m{/}g ) == 3 } #array;
or
say for grep { tr{/}{} == 3 } #array;
You need to match
a slash
surrounded by some non-slashes (^(?:[^\/]*)
repeating the match exactly three times
and enclosing the whole triple in start of line and and of line anchors:
$string =~ /^(?:[^\/]*\/[^\/]*){3}$/;
if ( $string =~ /\/.*\/.*\// and $string !~ /\/.*\/.*\/.*\// )

Perl all matches of a regexp in a given string

Perl's regexp matching is left-greedy, so that the regexp
/\A (a+) (.+) \z/x
matching the string 'aaab', will set $1='aaa' and $2='b'.
(The \A and \z are just to force start and end of the string.)
You can also give non-greedy qualifiers, as
/\A (a+?) (.+?) \z/x
This will still match, but give $1='a' and $2='aab'.
But I would like to check all possible ways to generate the string, which are
$1='aaa' $2='b'
$1='aa' $2='ab'
$1='a' $2='aab'
The first way corresponds to the default left-greedy behaviour, and the third way corresponds to making the first match non-greedy, but there may be ways in between those extremes. Is there a regexp engine (whether Perl's, or some other such as PCRE or RE2) which can be made to try all possible ways that the regexp specified generates the given string?
Among other things, this would let you implement 'POSIX-compatible' regexp matching where the longest total match is picked. In my case I really would like to see every possibility.
(One way would be to munge the regexp itself, replacing the + modifier with {1,1} on the first attempt, then {1,2}, {1,3} and so on - for each combination of + and * modifiers in the regexp. That is very laborious and slow, and it's not obvious when to stop. I hope for something smarter.)
Background
To answer Jim G.'s question on what problem this might solve, consider a rule-based translation system between two languages, given by the rules
translate(any string of one or more 'a' . y) = 'M' . translate(y)
translate('ab') = 'U'
Then there is a possible result of translate('aaab'), namely 'MU'.
You might try to put these rules into Perl code based on regexps, as
our #m;
my #rules = (
[ qr/\A (a+) (.*) \z/x => sub { 'M' . translate($m[1]) } ],
[ qr/\A ab \z/x => sub { 'U' } ],
);
where translate runs over each of #rules and tries to apply them in turn:
sub translate {
my $in = shift;
foreach (#rules) {
my ($lhs, $rhs) = #$_;
$in =~ $lhs or next;
local #m = ($1, $2);
my $r = &$rhs;
next if index($r, 'fail') != -1;
return $r;
}
return 'fail';
}
However, calling translate('aaab') returns 'fail'. This is because
it tries to apply the first rule matching (a+)(.*) and the regexp
engine finds the match with the longest possible string of 'a'.
Using the answer suggested by ikegami, we can try all ways in which
the regular expression generates the string:
use re 'eval';
sub translate {
my $in = shift;
foreach (#rules) {
my ($lhs, $rhs) = #$_;
local our #matches;
$in =~ /$lhs (?{ push #matches, [ $1, $2 ] }) (*FAIL)/x;
foreach (#matches) {
local #m = #$_;
my $r = &$rhs;
next if index($r, 'fail') != -1;
return $r;
}
}
return 'fail';
}
Now translate('aaab') returns 'MU'.
local our #matches;
'aaab' =~ /^ (a+) (.+) \z (?{ push #matches, [ $1, $2 ] }) (*FAIL)/x;

Matching balanced parenthesis in Perl regex

I have an expression which I need to split and store in an array:
aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } }, aaa="bbb{}" { aa="b}b" }, aaa="bbb,ccc"
It should look like this once split and stored in the array:
aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } }
aaa="bbb{}" { aa="b}b" }
aaa="bbb,ccc"
I use Perl version 5.8 and could someone resolve this?
Use the perl module "Regexp::Common". It has a nice balanced parenthesis Regex that works well.
# ASN.1
use Regexp::Common;
$bp = $RE{balanced}{-parens=>'{}'};
#genes = $l =~ /($bp)/g;
There's an example in perlre, using the recursive regex features introduced in v5.10. Although you are limited to v5.8, other people coming to this question should get the right solution :)
$re = qr{
( # paren group 1 (full function)
foo
( # paren group 2 (parens)
\(
( # paren group 3 (contents of parens)
(?:
(?> [^()]+ ) # Non-parens without backtracking
|
(?2) # Recurse to start of paren group 2
)*
)
\)
)
)
}x;
I agree with Scott Rippey, more or less, about writing your own parser. Here's a simple one:
my $in = 'aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } }, ' .
'aaa="bbb{}" { aa="b}b" }, ' .
'aaa="bbb,ccc"'
;
my #out = ('');
my $nesting = 0;
while($in !~ m/\G$/cg)
{
if($nesting == 0 && $in =~ m/\G,\s*/cg)
{
push #out, '';
next;
}
if($in =~ m/\G(\{+)/cg)
{ $nesting += length $1; }
elsif($in =~ m/\G(\}+)/cg)
{
$nesting -= length $1;
die if $nesting < 0;
}
elsif($in =~ m/\G((?:[^{}"]|"[^"]*")+)/cg)
{ }
else
{ die; }
$out[-1] .= $1;
}
(Tested in Perl 5.10; sorry, I don't have Perl 5.8 handy, but so far as I know there aren't any relevant differences.) Needless to say, you'll want to replace the dies with something application-specific. And you'll likely have to tweak the above to handle cases not included in your example. (For example, can quoted strings contain \"? Can ' be used instead of "? This code doesn't handle either of those possibilities.)
To match balanced parenthesis or curly brackets, and if you want to take under account backslashed (escaped) ones, the proposed solutions would not work. Instead, you would write something like this (building on the suggested solution in perlre):
$re = qr/
( # paren group 1 (full function)
foo
(?<paren_group> # paren group 2 (parens)
\(
( # paren group 3 (contents of parens)
(?:
(?> (?:\\[()]|(?![()]).)+ ) # escaped parens or no parens
|
(?&paren_group) # Recurse to named capture group
)*
)
\)
)
)
/x;
Try something like this:
use strict;
use warnings;
use Data::Dumper;
my $exp=<<END;
aaa="bbb{ccc}ddd" { aa="bb,cc" { a="b", c="d" } } , aaa="bbb{}" { aa="b}b" }, aaa="bbb,ccc"
END
chomp $exp;
my #arr = map { $_ =~ s/^\s*//; $_ =~ s/\s* $//; "$_}"} split('}\s*,',$exp);
print Dumper(\#arr);
Although Recursive Regular Expressions can usually be used to capture "balanced braces" {}, they won't work for you, because you ALSO have the requirement to match "balanced quotes" ".
This would be a very tricky task for a Perl Regular Expression, and I'm fairly certain it's not possible. (In contrast, it could probably be done with Microsoft's "balancing groups" Regex feature).
I would suggest creating your own parser. As you process each character, you count each " and {}, and only split on , if they are "balanced".

Finding results from and between groups of parentheses with regexp

Text format:
(Superships)
Eirik Raude - olajkutató fúrósziget
(Eirik Raude - Oil Patch Explorer)
I need regex to match text beetween first set of parentheses. Results: text1.
I need regex to match text beetween first set of parentheses and second set of parentheses. Results: text2.
I need regex to match text beetween second set of parentheses. Results: text3.
text1: Superships, represent english title,
text2: Eirik Raude - olajkutató fúrósziget, represent hungarian subtitle,
text3: Eirik Raude - Oil Patch Explorer, represent english subtitle.
I need regex for perl script to match this title and subtitle. Example script:
($anchor) = $tree->look_down(_tag=>"h1", class=>"blackbigtitle");
if ($anchor) {
$elem = $anchor;
my ($engtitle, $engsubtitle, $hunsubtitle #tmp);
while (($elem = $elem->right()) &&
((ref $elem) && ($elem->tag() ne "table"))) {
#tmp = get_all_text($elem);
push #lines, #tmp;
$line = join(' ', #tmp);
if (($engtitle) = $line =~ m/**regex need that return text1**/) {
push #{$prog->{q(title)}}, [$engtitle, 'en'];
t "english-title added: $engtitle";
}
elsif (($engsubtitle) = $line =~ m/**regex need that return text3**/) {
push #{$prog->{q(sub-title)}}, [$subtitle, 'en'];
t "english_subtitle added: $engsubtitle";
}
elsif (($hunsubtitle) = $line =~ m/**regex need that return text2**/) {
push #{$prog->{q(hun-subtitle)}}, [$hunsubtitle, 'hu'];
t "hungarinan_subtitle added: $hunsubtitle";
}
}
}
Considering your comment, you can do something like :
if (($english_title) = $line =~ m/^\(([^)]+)\)$/) {
$found_english_title = 1;
# do stuff
} elsif (($english-subtitle) = $line =~ m/^([^()]+)$/) {
# do stuff
} elsif ($found_english_title && ($hungarian-title) = $line =~ m/^\(([^)]+)\)$/) {
# do stuff
}
If you need to match them all in one expression:
\(([^)]+)\)([^(]+)\(([^)]+)\)
This matches (, then anything that's not ), then ), then anything that's not (, then, (, ... I think you get the picture.
First group will be text1, second group will be text2, third group will be text3.
You can also just make a more generix regex that matches something like "(text1)", "(text1)text2(text3)" or "text1(text2)" when applied several times:
(?:^|[()])([^()])(?:[()]|$)
This matches the beginning of the string or ( or ), then characters that are not ( or ), then ( or ) or the end of the string. :? is for non-capturing group, so the first group will have the string. Something more complex is necessary to match ( with ) every time, i.e., it can match "(text1(".