I'm looking for the accumulation of possibly overlapping matches of a regex (the final goal being to do further searches in the resulting substrings).
I want to skip the matches that have already been "accumulated", while avoiding to make copies with substr (I might be wrong about avoiding substr), but the condition that I wrote for it with pos($...) = ... and a next if $... =~ /.../ doesn't work:
#!/usr/bin/env perl
# user inputs
$regexp = "abc|cba|b";
$string = "_abcbabc_bacba";
$length = length($string);
$result = "0" x $length;
while ( pos($string) < $length and $string =~ /$regexp/go ) {
pos($string) = $-[0] + 1;
next unless ($len = $+[0] - $-[0]);
# The failing condition is here:
# pos($result) = $-[0];
# next if $result =~ /1{$len}/;
substr($result, $-[0], $len) = "1" x $len;
printf "%s\n", $string;
printf "%".$-[0]."s%s\n", "", "^" x $len;
}
printf "%s\n", $result;
By commenting those lines I can get the desired result which is 01111111010111:
_abcbabc_bacba
^^^
_abcbabc_bacba
^
_abcbabc_bacba
^^^
_abcbabc_bacba
^
_abcbabc_bacba
^^^
_abcbabc_bacba
^
_abcbabc_bacba
^
_abcbabc_bacba
^^^
_abcbabc_bacba
^
01111111010111
But my expected output (with a working condition) would be:
_abcbabc_bacba
^^^
_abcbabc_bacba
^^^
_abcbabc_bacba
^^^
_abcbabc_bacba
^
_abcbabc_bacba
^^^
01111111010111
notes:
for each iteration I print the original string; the ^ just below show the characters that have been matched in the current iteration.
the 0 & 1 at the end represent the overall result. The characters that have been matched at least once during the process are set to 1.
My commented condition is meant to skip the current match when its corresponding characters are already set to 1 in the result.
I think you really want to find the longest overlapping sub match. If you can guarantee that the alternation will have the substrings in the order that you prefer them, that approach might work, but it also requires to know a lot about what is happening besides the match, and in future matches. That is, you don't know if you can output anything until you've the future matches that might overlap, and you can't tell how far into the future you need to look.
You can mess around with pos, but I think I'd just match each substring separately, remember the starting positions, then compare later. Decompose the problem into separate tasks for finding the matching positions and for deciding which ones you want.
Even if I wrote the same code you presented, it's unlikely that I'd remember everything that must happen just right to make it all work out if I had to see it again after a long absence (even if I did highlight #- and #+ in the first chapter of Mastering Perl ;)
use v5.10;
use strict;
my $target = "_abcbabc_bacba";
my #looking_for = qw( abc cba b );
my #found;
foreach my $want ( #looking_for ) {
my $pos = 0;
while( my $found_at = index $target, $want, $pos ) {
last if $found_at == -1;
push #found, $found_at;
$pos = $found_at + 1;
}
}
my #found = sort { $a->[1] <=> $b->[1] } #found;
use Data::Dumper;
say Dumper( \#found );
Now you have a data structure that you can massage any way that you like instead of thinking about all this stuff while in regex land. How you decide to do that is left as an exercise for the reader.
$VAR1 = [
[
'abc',
1
],
[
'b',
2
],
[
'cba',
3
],
[
'b',
4
],
[
'abc',
5
],
[
'b',
6
],
[
'b',
9
],
[
'cba',
11
],
[
'b',
12
]
];
Part of this may be inline. You can build up this data structure to the point where you know that everything you have so far can produce output (i.e. the thing you just matched does not overlap with the prior thing).
Related
I have a long string, containing alphabetic words and each delimited by one single character ";" . The whole string also starts and ends with a ";" .
How do I count the number of occurrences of a pattern (started with ";") if index of a success match is divisible by 5.
Example:
$String = ";the;fox;jumped;over;the;dog;the;duck;and;the;frog;"
$Pattern = ";the(?=;f)"
OUTPUT: 1
Since:
Note 1: In above case, the $Pattern ;the(?=;f) exists as the 1st and 10th words in the $String; however; the output result would be 1, since only the index of second match (10) is divisible by 5.
Note 2: Every word delimited by ";" counts toward the index set.
Index of the = 1 -> this does not match since 1 is not divisible by 5
Index of fox = 2
Index of jumped = 3
Index of over = 4
Index of the = 5 -> this does not match since the next word (dog) starts with "d" not "f"
Index of dog = 6
Index of the = 7 -> this does not match since 7 is not divisible by 5
Index of duck = 8
Index of and = 9
Index of the = 10 -> this does match since 10 is divisible by 5 and the next word (frog) starts with "f"
Index of frog = 11
If possible, I am wondering if there is a way to do this with a single pattern matching without using list or array as the $String is extremely long.
Use Backtracking control verbs to process the string 5 words at a time
One solution is to add a boundary condition that the pattern is preceded by 4 other words.
Then setup an alteration so that if your pattern is not matched, the 5th word is gobbled and then skipped using backtracking control verbs.
The following demonstrates:
#!/usr/bin/env perl
use strict;
use warnings;
my $string = ";the;fox;jumped;over;the;dog;the;duck;and;the;frog;";
my $pattern = qr{;the(?=;f)};
my #matches = $string =~ m{
(?: ;[^;]* ){4} # Preceded by 4 words
(
$pattern # Match Pattern
|
;(*SKIP)(*FAIL) # Or consume 5th word and skip to next part of string.
)
}xg;
print "Number of Matches = " . #matches . "\n";
Outputs:
Number of Matches = 1
Live Demo
Supplemental Example using Numbers 1 through 100 in words
For additional testing, the following constructs a string of all numbers in word format from 1 to 100 using Lingua::EN::Numbers.
For the pattern it looks for a number that's a single word with the next number that begins with the letter S.
use Lingua::EN::Numbers qw(num2en);
my $string = ';' . join( ';', map { num2en($_) } ( 1 .. 100 ) ) . ';';
my $pattern = qr{;\w+(?=;s)};
my #matches = $string =~ m{(?:;[^;]*){4}($pattern|;(*SKIP)(*FAIL))}g;
print "#matches\n";
Outputs:
;five ;fifteen ;sixty ;seventy
Reference for more techniques
The following question from last month is a very similar problem. However, I provided 5 different solutions in addition to the one demonstrated here:
In Perl, how to count the number of occurences of successful matches based on a condition on their absolute positions
You can count the number of semicolons in each substring up to the matching position. For a million-word string, it takes 150 seconds.
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
my $string = join ';', q(),
map { qw( the fox jumped over the dog the duck and the frog)[int rand 11] }
1 .. 1000;
$string .= ';';
my $pattern = qr/;the(?=;f)/;
while ($string =~ /$pattern/g) {
my $count = substr($string, 0, pos $string) =~ tr/;//;
say $count if 0 == $count % 5;
}
Revised Answer
One relatively simple way to achieve what you want is by replacing the delimiters in the original text that occur on a 5-word-index boundary:
$text =~ s/;/state $idx++ % 5 ? ',' : ';'/eg;
Now you just need to trivially adjust your $pattern to look for ;the,f instead of ;the;f. You can use the =()= pseudo-operator to return the count:
my $count =()= $text =~ /;the(?=,f)/g;
Original answer after the break. (Thanks to #choroba for pointing out the correct interpretation of the question.)
Character-Based Answer
This uses the /g regex modifier in combination with pos() to look at matching words. For illustration, I print out all matches (not just those on 5-character boundaries), but I print (match) beside those on 5-char boundaries. The output is:
;the;fox;jumped;over;the;dog;the;duck;and;the;frog
^....^....^....^....^....^....^....^....^....^....
`the' #0 (match)
`the' #41
And the code is:
#!/usr/bin/env perl
use 5.010;
my $text = ';the;fox;jumped;over;the;dog;the;duck;and;the;frog';
say $text;
say '^....^....' x 5;
my $pat = qr/;(the)(?=;f)/;
#$pat = qr/;([^;]+)/;
while ($text =~ /$pat/g) {
my $pos = pos($text) - length($1) - 1;
say "`$1' \#$pos". ($pos % 5 ? '' : ' (match)');
}
First of, pos is also possible as a left hand side expression. You could make use of the \G assertion in combination with index (since speed is of concern for you). I expanded your example to showcase that it only "matches" for divisibles of 5 (your example also allowed for indices not divisible by 5 to be 1 a solution, too). Since you only wanted the number of matches, I only used a $count variable and incremented. If you want something more, use the normal if {} clause and do something in the block.
my $string = ";the;fox;jumped;over;the;dog;the;duck;and;the;frog;or;the;fish";
my $pattern = qr/;the(?=;f)/;
my ($index,$count, $position) = (0,0,0);
while(0 <= ($position = index $string, ';',$position)){
pos $string = $position++; #add one to $position, to terminate the loop
++$count if (!(++$index % 5) and $string =~/\G$pattern/);
}
say $count; # says 1, not 2
You could use the experimental features of regexes to solve you problem (especially the (?{}) blocks). Before you do, you really should read the corresponding section in the perldocs.
my ($index, $count) = (0,0);
while ($string =~ /; # the `;'
(?(?{not ++$index % 5}) # if with a code condition
the(?=;f) # almost your pattern, but we'll have to count
|(*FAIL)) # else fail
/gx) {
$count++;
}
Perl's regexp matching is left-greedy, so that the regexp
/\A (a+) (.+) \z/x
matching the string 'aaab', will set $1='aaa' and $2='b'.
(The \A and \z are just to force start and end of the string.)
You can also give non-greedy qualifiers, as
/\A (a+?) (.+?) \z/x
This will still match, but give $1='a' and $2='aab'.
But I would like to check all possible ways to generate the string, which are
$1='aaa' $2='b'
$1='aa' $2='ab'
$1='a' $2='aab'
The first way corresponds to the default left-greedy behaviour, and the third way corresponds to making the first match non-greedy, but there may be ways in between those extremes. Is there a regexp engine (whether Perl's, or some other such as PCRE or RE2) which can be made to try all possible ways that the regexp specified generates the given string?
Among other things, this would let you implement 'POSIX-compatible' regexp matching where the longest total match is picked. In my case I really would like to see every possibility.
(One way would be to munge the regexp itself, replacing the + modifier with {1,1} on the first attempt, then {1,2}, {1,3} and so on - for each combination of + and * modifiers in the regexp. That is very laborious and slow, and it's not obvious when to stop. I hope for something smarter.)
Background
To answer Jim G.'s question on what problem this might solve, consider a rule-based translation system between two languages, given by the rules
translate(any string of one or more 'a' . y) = 'M' . translate(y)
translate('ab') = 'U'
Then there is a possible result of translate('aaab'), namely 'MU'.
You might try to put these rules into Perl code based on regexps, as
our #m;
my #rules = (
[ qr/\A (a+) (.*) \z/x => sub { 'M' . translate($m[1]) } ],
[ qr/\A ab \z/x => sub { 'U' } ],
);
where translate runs over each of #rules and tries to apply them in turn:
sub translate {
my $in = shift;
foreach (#rules) {
my ($lhs, $rhs) = #$_;
$in =~ $lhs or next;
local #m = ($1, $2);
my $r = &$rhs;
next if index($r, 'fail') != -1;
return $r;
}
return 'fail';
}
However, calling translate('aaab') returns 'fail'. This is because
it tries to apply the first rule matching (a+)(.*) and the regexp
engine finds the match with the longest possible string of 'a'.
Using the answer suggested by ikegami, we can try all ways in which
the regular expression generates the string:
use re 'eval';
sub translate {
my $in = shift;
foreach (#rules) {
my ($lhs, $rhs) = #$_;
local our #matches;
$in =~ /$lhs (?{ push #matches, [ $1, $2 ] }) (*FAIL)/x;
foreach (#matches) {
local #m = #$_;
my $r = &$rhs;
next if index($r, 'fail') != -1;
return $r;
}
}
return 'fail';
}
Now translate('aaab') returns 'MU'.
local our #matches;
'aaab' =~ /^ (a+) (.+) \z (?{ push #matches, [ $1, $2 ] }) (*FAIL)/x;
Is possible to force a Perl 5 regexp match longest possible string, if the regexp is, for example:
a|aa|aaa
I found is probably default in perl 6, but in perl 5, how i can get this behavior?
EXAMPLE pattern:
[0-9]|[0-9][0-9]|[0-9][0-9][0-9][0-9]
If I have string 2.10.2014, then first match will be 2, which is ok; but the next match will be 1, and this is not ok because it should be 10. Then 2014 will be 4 subsequently matches 2,0,1,4, but it should be 2014 using [0-9][0-9][0-9][0-9]. I know I could use [0-9]+, but I can't.
General solution: Put the longest one first.
my ($longest) = /(aaa|aa|a)/
Specific solution: Use
my ($longest) = /([0-9]{4}|[0-9]{1,2})/
If you can't edit the pattern, you'll have to find every possibility and find the longest of them.
my $longest;
while (/([0-9]|[0-9][0-9]|[0-9][0-9][0-9][0-9])/g) {
$longest = $1 if length($1) > length($longest);
}
The sanest solution I can see for unknown patterns is to match every possible pattern, look at the length of the matched substrings and select the longest substring:
my #patterns = (qr/a/, qr/a(a)/, qr/b/, qr/aaa/);
my $string = "aaa";
my #substrings = map {$string =~ /($_)/; $1 // ()} #patterns;
say "Matched these substrings:";
say for #substrings;
my $longest_token = (sort { length $b <=> length $a } #substrings)[0];
say "Longest token was: $longest_token";
Output:
Matched these substrings:
a
aa
aaa
Longest token was: aaa
For known patterns, one would sort them manually so that first-match is the same as longest-match:
"aaa" =~ /(aaa|aa|b|a)/;
say "I know that this was the longest substring: $1";
The alternation will use the first alternative that matches, so just write /aaa|aa|a/ instead.
For the example you have shown in your question, just put the longest alternative first like I said:
[0-9][0-9][0-9][0-9]|[0-9][0-9]|[0-9]
perl -Mstrict -Mre=/xp -MData::Dumper -wE'
{package Data::Dumper;our($Indent,$Sortkeys,$Terse,$Useqq)=(1)x4}
sub _dump { Dumper(shift) =~ s{(\[.*?\])}{$1=~s/\s+/ /gr}srge }
my ($count, %RS);
my $s= "aaaabbaaaaabbab";
$s =~ m{ \G a+b? (?{ $RS{ $+[0] - $-[0] } //= [ ${^MATCH}, $-[0] ]; $count++ }) (*FAIL) };
say sprintf "RS: %s", _dump(\%RS);
say sprintf "count: %s", $count;
'
RS: {
"1" => [ "a", 0 ],
"2" => [ "aa", 0 ],
"3" => [ "aaa", 0 ],
"4" => [ "aaaa", 0 ],
"5" => [ "aaaab", 0 ]
}
count: 5
The value of $s is dynamic. I need to extract the values that occur after the last | in between each [].
my $s = "[0|0|{A=145,B=2,C=12,D=18}|!][0|0|{A=167,B=2,C=67,D=17}|.1iit][196|0|{A=244,B=6,C=67,D=12}|10:48AM][204|0|{A=9,B=201,C=61,D=11}|Calculator][66|0|{A=145,B=450,C=49,D=14}|Coffee]";
my #parts = split(/\]/, $s);
foreach my $part (#parts)
{
# Need to extract the values that occur after the last '|'
# (for example: !, .1iit, 10:48AM, Calculator, Coffee)
# and store each of the values separately in a hash
}
Could someone help me out in this?
Thanks,
Best to transform the string into a more useful data structure, then take the needed elements. Why is this best? Because right now you need the last element, but perhaps next time you will need some other part. Since its not harder to do it right, why not?
#!/usr/bin/perl
use strict;
use warnings;
# Only needed for Dumper
use Data::Dumper;
my $s = "[0|0|{A=145,B=2,C=12,D=18}|!][0|0|{A=167,B=2,C=67,D=17}|.1iit][196|0|{A=244,B=6,C=67,D=12}|10:48AM][204|0|{A=9,B=201,C=61,D=11}|Calculator][66|0|{A=145,B=450,C=49,D=14}|Coffee]";
# Extract each group between []
# Then transform each group into an array reference by splitting on |
my #groups = map { [ split /\|/ ] } ($s =~ /\[([^\]]*)\]/g);
# Inspect the data structure
print Dumper \#groups;
# Print only the last element of each sub-array
print "$_\n" for map {$_->[-1]} #groups;
If needed the third elements of the sub-arrays could be transformed into hashrefs quite easily too. ,however since that wasn't needed, I leave that as an exercise for the reader (I always love saying that when I get the chance!).
Edit: since I found it interesting I ended up creating these hashrefs, here is the code that would replace the my #groups line:
my #groups = map { [ map { /\{([^\}]*)\}/ ? { split /(?:=|,)/, $1 } : $_ } (split /\|/) ] } ($s =~ /\[([^\]]*)\]/g);
or more properly commented (map commands are read from the back, so the comments start at the bottom and follow by number, comments like #/N pair with those like #N)
my #groups = map { #/1
[ #/2
map { #/3
/\{([^\}]*)\}/ #4 ... and if any element (separated by pipes in #3)
# is surrounded by curly braces
? { #5 ... then return a hash ref
split /(?:=|,)/, $1 #6 ... whose elements are given
# pairwise between '=' or ',' signs
} #/5
: $_ #7 ... otherwise (from 'if' in #4 ) return the element as is
} (split /\|/) #3 ... where each element is separated by pipes (i.e. |)
] #2 ... return an array ref
} ($s =~ /\[([^\]]*)\]/g); #1 For each element between sqr braces (i.e. [])
The generic way:
#subparts = split /\|/, $part;
$tail = $subparts[$#subparts];
If you only ever need the last part separately:
$part =~ /([^\|]*)$/ and $tail = $1;
my ($value) = $part =~ m/[^|]\|(.+)$/;
print "$part => $value\n";
and another way:
my $s =
"[0|0|{A=145,B=2,C=12,D=18}|!][0|0|{A=167,B=2,C=67,D=17}|.1iit][196|0|{A=244,B=6,C=67,D=12}|10:48AM][204|0|{A=9,B=201,C=61,D=11}|Calculator][66|0|{A=145,B=450,C=49,D=14}|Coffee]";
my #parts = $s =~ m/\|([^|]+)]/g;
print join( "\n", #parts );
Since you insist on a regex:
#matches = $s =~ /\|([^|]+?)]/g
Using /g will dump all matches into the array #matches
You really don't need a regex... just use split(). The results are stored in %results
my $s = "[0|0|{A=145,B=2,C=12,D=18}|!][0|0|{A=167,B=2,C=67,D=17}|.1iit][196|0|{A=244,B=6,C=67,D=12}|10:48AM][204|0|{A=9,B=201,C=61,D=11}|Calculator][66|0|{A=145,B=450,C=49,D=14}|Coffee]";
foreach my $part (split(/\]/, $s))
{
#pieces = split(/\|/, $part);
$results{$pieces[-1]} = $pieces[-1];
}
With regexes, when you think “I want the last of,” you should immediately think of the pattern .* because regex greed does just what you want.
For example, matching /^(.*)a(.*)$/ chops up "abababab" into
ababab in $1
a matched by the literal in the pattern
b in $2
Let's think through the process of the match. Imagine .* as Augustus Gloop.
Augustus: Ausgezeichnet! The ^ anchor means I get to start at the beginning. From there, I shall eat all the candies!
Willie Wonka: But, my dear Augustus, you must share with the other children.
Augustus: Fine, I get "abababa" and they get "b". Happy?
Willie Wonka: But the next child in line doesn't like b candies.
Augustus: Then I shall keep "ababab" for myself and leave "ab" for the others.
At this point, Augustus has his big pile, humble little Charlie Bucket gets his single a, and Veruca Salt—although scowling about the meager quantity—gets at least something now.
In other words, $2 contains everything after the last a. To be persnickety, the ^ and $ anchors are redundant, but I like keeping them for added emphasis.
Putting this into action, you could write
#! /usr/bin/env perl
use strict;
use warnings;
sub last_fields {
local($_) = #_;
my #last;
push #last, $1 =~ /^.*\|(.+)$/ ? $1 : undef
while /\[(.*?)\]/g;
wantarray ? #last : \#last;
}
The outer while breaks up the string into [...] chunks and assumes that right square-bracket cannot occur inside a chunk. Within each chunk, we use /^.*\|(.+)$/ to capture in $1 everything after the last pipe.
Testing it with your example looks like
my $s = "[0|0|{A=145,B=2,C=12,D=18}|!]" .
"[0|0|{A=167,B=2,C=67,D=17}|.1iit]" .
"[196|0|{A=244,B=6,C=67,D=12}|10:48AM]" .
"[204|0|{A=9,B=201,C=61,D=11}|Calculator]" .
"[66|0|{A=145,B=450,C=49,D=14}|Coffee]";
use Test::More tests => 6;
my #lasts = last_fields $s;
# yes, is_deeply could do this in a single call,
# but it's laid out explicitly here for expository benefit
is $lasts[0], "!";
is $lasts[1], ".1iit";
is $lasts[2], "10:48AM";
is $lasts[3], "Calculator";
is $lasts[4], "Coffee";
is scalar #lasts, 5;
All the tests pass:
$ ./match-last-of
1..6
ok 1
ok 2
ok 3
ok 4
ok 5
ok 6
The output of prove is nicer. Run it yourself to see the color coding.
$ prove ./match-last-of
./match-last-of .. ok
All tests successful.
Files=1, Tests=6, 0 wallclock secs ( 0.02 usr 0.01 sys + 0.02 cusr 0.00 csys = 0.05 CPU)
Result: PASS
I have a text file filled with sentences with unique pattern. The unique pattern is:
NAME [ e_NAME ]
simple rule: the "NAME" must follow after "e_" if the "e_" appearers inside the brackets!
The problem comes out when the string is complicated. I'll show the end point situations that may be hard to analyse:
Lines that won't match the rule:
(1) NAME1[blabla+NAME2[blabla+e_BAD2]+e_NAME1]
(2) NAME1[blabla] + NAME2[e_BAD2]
(3) NAME1[NAME2[blabla+e_BAD2]] + NAME3[e_BAD3]
(4) NAME1[e_NAME1BAD1] -> means it has to be only NAME1
Lines that match the rule:
(1) FOO1[blabla + 1]
(2) [blalbla] + bla
(3) bla + blabla
(4) FOO1[ccc + ddd + FOO2[e_FOO2]] = 123
(5) FOO1[cc + FOO2[ dd ] ] + FOO3[e_FOO3]
I already asked this question but I couldn't catch this end points...
Edited after requirements were clarified
Either Text::Balanced or Regexp::Common might be useful. I initially posted an answer using the former but didn't like it very much. The following example uses Regexp::Common and seems fairly straightforward.
use strict;
use warnings;
use Regexp::Common;
my $PRE = '[^[]*?';
my $VAR = '\w+';
my $BRACK = $RE{balanced}{-parens=>'[]'};
my $POST = '.*';
while (<DATA>){
my ($bad, $full);
# Brackets, if any, must balance
$bad = 1 unless s/\[/[/g == s/\]/]/g;
$full = $_;
until ($bad){
# Find some bracketed text and store all components.
my ($pre, $var, $brack, $post) =
$full =~ /^($PRE)($VAR)($BRACK)($POST)$/;
last unless defined $brack;
# Create a copy of the bracketed text, removing both the outer
# brackets and all instances of inner-bracketed text.
chop (my $clean = substr $brack, 1);
$clean =~ s/$BRACK/ /g;
# If e_FOO exists, FOO must equal $var.
$bad = 1 if $clean =~ /e_(\w+)/ and $1 ne $var;
# Remove the part of $full we've already checked.
substr($full, 0, length($pre) + length($var) + 1, '');
}
print if $bad;
}
# Your test data, with some trailing comments.
__DATA__
NAME1[blabla+NAME2[blabla+e_BAD2]+e_NAME1] NOT OK 1
NAME1[blabla] + NAME2[e_BAD2] NOT OK 2
NAME1[NAME2[blabla+e_BAD2]] + NAME3[e_BAD3] NOT OK 3
NAME1[e_NAME1BAD1] NOT OK 4
FOO1[blabla + 1] OK 1
[blalbla] + bla OK 2
bla + blabla OK 3
FOO1[ccc + ddd + FOO2[e_FOO2]] = 123 OK 4
FOO1[cc + FOO2[ dd ] ] + FOO3[e_FOO3] OK 5
Maybe you are looking for something like:
if ($string =~ /(\w+)\[e\\_(\w+)/ && $1 eq $2) {
print "Pattern '$1' contained in string '$string'\n";
}
Based on the accepted answer to your first question, I came up with this:
use strict;
use warnings;
while (<DATA>) {
my $l = $_;
while (s/(\w+)\[([^\[\]]*)\]//) {
my ($n, $chk) = ($1, $2);
unless ($chk =~ /\be_$n\b/) {
warn "Bad line: $l";
last;
}
}
}
The \b checks for a word boundary. This version still doesn't check for unbalanced brackets, but it does seem to catch all the examples you gave, and will also complain when the e_NAME1 is inside another nested block, like so:
NAME1[stuff + NAME2[e_NAME1 + e_NAME2] + morestuff]
use Text::Balanced;
CPAN is wonderful.