Capturing output with Perl, until a specific pattern is found - regex

I feel like I am missing something very simple here, but this is the first time I've needed to do this and am having trouble finding an example.
I have a giant foreach loop that walks through output logs and extracts various bits of information based on matching regular expressions. My main problem is that a few larger types of output have a header and footer, like *** Begin bangle tracking log*** followed by several lines of gibberish and then a ***End bangle tracking log***.
Is there a way, from within a foreach loop, to have an inner loop that stores all the lines until a footer is found?
foreach my $line( #parseme )
{
if( $line =~ m/***Begin bangle tracking log***/ )
{
#Help! Push all lines into an array until bangle tracking footer is found.
}
if( $line =~ m/Other stuff I am tracking/ )
{
#Do other things
}
}

You could use the range operator, which acts as a flip-flop in scalar context:
foreach ( #parseme ) {
if ( /Begin bangle tracking log/ .. /End bangle tracking log/ ) {
push #array, $_;
}
# other stuff...
}
I used $_ for the foreach loop because it allows for more concise syntax. You can use another variable if you like, but then you'll have to write the condition as something like:
if ( $line =~ /Begin .../ .. $line =~ /End .../ ) {
which might be more readable with some extra parentheses:
if ( ($line =~ /Begin .../) .. ($line =~ /End .../) ) {
One issue to note about the flip-flop operator is that it remembers its state even after the loop ends. This means that, if you intend to run the loop again, you really ought to make sure that the #parseme array ends with a line that matches the /End .../ regexp, so that the flip-flop will be in a known state when the loop starts the next time.
Edit: Per DVK's comment below, if you want to process the collected lines as soon as you reach the footer line, you can do that by checking the return value of the .. operator, which will end with E0 on the last line:
foreach ( #parseme ) {
my $in_block = /Begin bangle tracking log/ .. /End bangle tracking log/;
if ( $in_block ) {
push #array, $_;
}
if ( $in_block =~ /E0$/ ) { # last line
# process the lines in #array
#array = ();
}
# other stuff...
}

You can do that somewhat easily by implementing a primitive state machine:
my $inside_bangle = 0; # 0=outside block, 1=inside
my #buffer;
foreach my $line( #parseme ) {
if ($line =~ m/***Begin bangle tracking log***/ ) {
$inside_bangle = 1;
next;
}
if ($line =~ m/***End bangle tracking log***/ ) {
$inside_bangle = 0;
# PROCESS #buffer somehow
next;
}
if ($inside_bangle) {
push #buffer, $line;
next;
}
if ($line =~ m/other stuff i am tracking/ ) {
#Do other things
}
}
Another option is to use flip-flop (..)

You're probably looking for the .. operator, which has some magical properties when applied with regular expressions. The following example is stolen from PLEAC:
while (<>) {
if (/BEGIN PATTERN/ .. /END PATTERN/) {
# line falls between BEGIN and END in the
# text, inclusive.
}
}
Within the block, append to your array variable as you see fit.

And now you can store multiple instances of your log snippet, if it occurs more than once (DVK's original code):
my $inside_bangle = 0; # 0=outside block, 1=inside
my %buffer;
my $index = 0;
foreach my $line( #parseme ) {
if ($line =~ m/***Begin bangle tracking log***/ ) {
$inside_bangle = 1;
next;
}
if ($line =~ m/***End bangle tracking log***/ ) {
$inside_bangle = 0;
$index++;
# PROCESS #buffer somehow
next;
}
if ($inside_bangle) {
push #{ $buffer{$index} }, $line;
next;
}
if ($line =~ m/other stuff i am tracking/ ) {
#Do other things
}
}
Here's what I wrote initially but I thought DVK's code is more readable and neat:
open FILE, "<", 'testfile.log';
#parseme = <FILE>;
my $initialize = shift #parseme;
my $startLogging = $initialize =~ m/^Log Start$/ ? 1 : 0; # test if first line of array is start of log
my %storage = ();
my $index = 0;
foreach my $line (#parseme) {
$startLogging = 1 if $line =~ m/^Log Start$/;
if ($startLogging == 1) {
push( #{ $storage{$index} }, $line ) unless $line =~ m/(Log Start|Log End)$/;
if ($line =~ m/^Log End$/) {
$startLogging = 0;
$index++;
}
}
}

Related

In perl, Is there a more compact way to search for a number of patterns, and for each one, substitute with an expression

In perl, I am reading a line and trying to replace a set of strings with corresponding expressions using a sequence of if statements. For example:
my #shiftInstructions=("lsr", "lsl", "rol", "ror");
while (my $line = <>) {
if ($line =~ /\$sh/) {
my $r = int(rand(6));
$line =~ s/\$sh/$r/;
}
if ($line =~ /\$ish/) {
my $r = $shiftInstructions[rand(4)]
$line =~ s/\$ish/$r/;
}
}
I don't like this approach for a number of reasons. First, it is repetitive. I have to first check if the pattern exists, and then if it does, execute a function to generate a replacement value, then substitute. So it is both verbose, and slow (2 regex searches per pattern, perhaps eventually dozens of pattern strings).
I thought of a map where a number of codes are mapped to corresponding code to execute.
I can imagine mapping to a string and then using eval but then I can't check the code except at runtime. Is there any cleaner way of doing this?
I found the execute option in regex. What about writing a set of subroutines to process each regex, then creating a mapping:
my %regexMap = (
"\$fn", &foundFunc,
"\$hw", &hex8,
"\$hb", &hex2,
"\$sh", &rand6,
"\$ish", &shiftInst,
);
while (my $line = <>) {
$line =~ s/(\$fn|\$hw|\$hb|\$sh|\$ish|)/$regexMap{$1}/e;
print $line;
}
if ($line =~ /\$sh/) {
my $r = int(rand(6));
$line =~ s/\$sh/$r/;
}
is a poor way of writing
$line =~ s/\$sh/ int(rand(6)) /e;
So
my #shiftInstructions=("lsr", "lsl", "rol", "ror");
while (my $line = <>) {
if ($line =~ /\$sh/) {
my $r = int(rand(6));
$line =~ s/\$sh/$r/;
}
if ($line =~ /\$ish/) {
my $r = $shiftInstructions[rand(4)]
$line =~ s/\$ish/$r/;
}
print($line);
}
can be written as
my #shiftInstructions = qw( lsr lsl rol ror );
while (my $line = <>) {
$line =~ s/\$sh/ int(rand(6)) /e;
$line =~ s/\$ish/ $shiftInstructions[rand(#shiftInstructions)] /e;
print($line);
}
But that means you are scanning the string over and over again. Let's avoid that.
my #shiftInstructions = qw( lsr lsl rol ror );
while (my $line = <>) {
$line =~ s/\$(sh|ish)/
if ( $1 eq "sh" ) { int(rand(6)) }
elsif ( $1 eq "ish" ) { $shiftInstructions[rand(#shiftInstructions)] }
/eg;
print($line);
}
Unfortunately, that reintroduces repetition. We can solve that using a dispatch table.
my #shiftInstructions = qw( lsr lsl rol ror );
my %replacements = (
sh => sub { int(rand(6)) },
ish => sub { $shiftInstructions[rand(#shiftInstructions)] },
);
my $alt = join '|', map quotemeta, keys(%replacements);
my $re = qr/\$($alt)/;
while (my $line = <>) {
print $line =~ s/$re/ $replacements{$1}->() /reg;
}
Now we have an efficient solution that can be extended without slowing down the matching, all while avoiding repetition.
The solution you added to your question was close, but it had two bugs.
&foo calls foo. To get a reference to it, use \&foo.
my %regexMap = (
"\$fn", \&foundFunc,
"\$hw", \&hex8,
"\$hb", \&hex2,
"\$sh", \&rand6,
"\$ish", \&shiftInst,
);
$regexMap{$1} now returns the reference. You want to call the referenced sub, which can be done using $regexMap{$1}->().
while (my $line = <>) {
$line =~ s/(\$fn|\$hw|\$hb|\$sh|\$ish|)/ $regexMap{$1}->() /e;
print $line;
}
In these cases, I often make some sort of data structure that holds the patterns and their actions:
my #tuples = (
[ qr/.../, sub { ... } ]
[ ... ].
);
Now the meat of the process stays the same no matter how many patterns I want to try:
while( <> ) {
foreach $tuple ( #tuples ) {
$tuple->[1]() if /$tuple[0]/
}
}
Abstract this a little further with a subroutine that takes the data structure. Then you can pass it different tables depending on what you would like to do:
sub some_sub {
my #tuples = #_;
while( <> ) {
foreach $tuple ( #tuples ) {
$tuple->[1]() if /$tuple[0]/
}
}
}
I've written about this sort of thing in Mastering Perl and Effective Perl Programming, and it's the sort of thing that does into my obscure modules like Brick and Data::Constraint.
I've been thinking about this more, and I wonder if regexes are actually part of what you are trying to do. It looks like you are matching literal strings, but using the match operator to do it. You don't give details of the input, so I'm guessing hereā€”it looks like there's an operation (e.g. $fn, and you want to match exactly that operation. The problem is finding that string then mapping it onto code. That looks something like this (and ikegami's answer is another form of this idea). Instead of an alternation, I match anything that might look like the string:
while( <> ) {
# find the string. Need example input to guess better
if( m/(\$[a-z]+)/ ) {
$table{$1}->() if exists $table{$1};
}
}
But again, it's dependent on the input, how many actual substrings you might want to match (so, the number of branches in an alternation), how many lines you want to process, and so on. There was a wonderful talk about processing apache log files with Regex::Trie and the various experiments they tried to make things faster. I've forgotten all the details, but very small adjustments made noticeable differences over tens of millions of lines.
Interesting reading:
Maybe this talk? An exploration of trie regexp matching
http://taint.org/2006/07/07/184022a.html
Matching a long list of phrases
OP's code can be written in following form
use strict;
use warnings;
use feature 'say';
my %regexMap = (
'$fn' => \&foundFunc,
'$hw' => \&hex8,
'$hb' => \&hex2,
'$sh' => \&rand6,
'$ish' => \&shiftInst,
);
my #keys = map { "\\$_" } keys %regexMap;
my $re = join('|', #keys);
while (<DATA>) {
chomp;
next unless /($re)/;
$regexMap{$1}->();
}
sub foundFunc { say 'sub_foundFunc' }
sub hex8 { say 'sub_hex8' }
sub hex2 { say 'sub_hex2' }
sub rand6 { say 'sub_rand6' }
sub shiftInst { say 'sub_shiftInst' }
__DATA__
$fn
$hw
$ac
$hb
$sh
$fn
$mf
$hb
$ish
$hw
Output
sub_foundFunc
sub_hex8
sub_hex2
sub_rand6
sub_foundFunc
sub_hex2
sub_shiftInst
sub_hex8

Perl: Trying to speed up parsing a delimited file

I have a large flat text file with lines that hold name/value pairs ("varname=value"). These pairs are seperated by a multi-character delimiter. So a single line in this file might look like this:
var1=value1|^|var2=value2|^|var3=value3|^|var4=value4
Each line holds about 50 name/value pairs.
I need to iterate through the lines of this file (there are about 100,000 lines) and store the name/value pairs in a hash so that
$field{'var1'} = value1
$field{'var2'} = value2
etc...
What I did was this:
# $line holds a single line from the file
my #fields = split( /\Q|^|\E/, $line );
foreach my $field (#fields) {
my ($name, $value) = split( /=/, $field );
$hash{$name} = $value;
}
Doing this for each line of the entire file takes (on my PC) about 2 seconds. This doesn't seem like a long time, but I really want to speed this up by quite a bit.
Of this 2 seconds, the first split takes about 0.6 seconds, while the foreach loop takes about 1.4 seconds. So I thought I'd get rid of the foreach loop and put it all in a single split:
%hash = split( /\Q|^|\E|=/, $line );
Much to my surprise, parsing the entire file this way took a full second longer! My question isn't really why this takes longer (although it would be a nice bonus to understand why), but my question is if there are any other (faster) ways to get the job done.
Thanks in advance.
------ Edit below this line ------
I just found out that changing this:
%hash = split( /\Q|^|\E|=/, $line );
into this:
$line =~ s/\Q|^|\E/=/g;
%hash = split( /=/, $line );
makes it three times faster! Parsing the entire file this way now takes just over a second...
------ Snippet below this line ------
use strict;
use Time::HiRes qw( time );
my $line = "a=1|^|b=2|^|c=3|^|d=4|^|e=5|^|f=6|^|g=7|^|h=8|^|i=9|^|j=10|^|k=11|^|l=12|^|m=13|^|n=14|^|o=15|^|p=16|^|q=17|^|r=18|^|s=19|^|t=20|^|u=21|^|v=22|^|w=23|^|x=24|^|y=25|^|z=26|^|aa=27|^|ab=28|^|ac=29|^|ad=30|^|ae=31|^|af=32|^|ag=33|^|ah=34|^|ai=35|^|aj=36|^|ak=37|^|al=38|^|am=39|^|an=40|^|ao=41|^|ap=42|^|aq=43|^|ar=44|^|as=45|^|at=46|^|au=47|^|av=48|^|aw=49|^|ax=50";
ResetTimer();
my %hash;
for( my $i = 1; $i <= 100000; $i++ ) {
my #fields = split( /\Q|^|\E/, $line );
foreach my $field (#fields) {
my ($name, $value) = split( /=/, $field );
$hash{$name} = $value;
}
}
print Elapsed() . "\n";
ResetTimer();
%hash = ();
for( my $i = 1; $i <= 100000; $i++ ) {
%hash = split( /\Q|^|\E|=/, $line );
}
print Elapsed() . "\n";
ResetTimer();
%hash = ();
for( my $i = 1; $i<=100000; $i++ ) {
$line =~ s/\Q|^|\E/=/g;
%hash = split( /=/, $line );
}
print Elapsed() . "\n";
################################################################################################################################
BEGIN {
my $startTime;
sub ResetTimer {
$startTime = time();
return $startTime;
}
sub Elapsed {
return time() - $startTime;
}
}
I can't easily answer your performance question, because I'd need a test case. But I'd guess that it's to do with how the regular expression is being processed.
You can see what that's doing with use re 'debug';, and that'll print the regular expression steps.
But for the broader question - I'd probably just tackle it with a global (assuming your data is as simple as the example):
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
while ( <DATA> ) {
my %row = m/(\w+)=(\w+)/g;
print Dumper \%row;
}
__DATA__
var1=value1|^|var2=value2|^|var3=value3|^|var4=value4
You can use lookahead/behind to match delimiters if you've got more complicated things in there, but because it's one regex per line, you're invoking the regex engine less often, and that'll probably be faster. (But I can't tell you for sure without a test case).
If your data is more complicated, then perhaps:
my %row = s/\Q|^|\E/\n/rg =~ m/(.*)=(.*)/g;
This will 'force' splitting the input into a new line, and then match 'anything' = 'anything'. But that's probably overkill unless your values include whitespace/pipes/metachars.
With editing your test case to use Benchmark:
#!/usr/bin/env perl
use strict;
use warnings;
use Benchmark qw ( cmpthese );
my $line =
"a=1|^|b=2|^|c=3|^|d=4|^|e=5|^|f=6|^|g=7|^|h=8|^|i=9|^|j=10|^|k=11|^|l=12|^|m=13|^|n=14|^|o=15|^|p=16|^|q=17|^|r=18|^|s=19|^|t=20|^|u=21|^|v=22|^|w=23|^|x=24|^|y=25|^|z=26|^|aa=27|^|ab=28|^|ac=29|^|ad=30|^|ae=31|^|af=32|^|ag=33|^|ah=34|^|ai=35|^|aj=36|^|ak=37|^|al=38|^|am=39|^|an=40|^|ao=41|^|ap=42|^|aq=43|^|ar=44|^|as=45|^|at=46|^|au=47|^|av=48|^|aw=49|^|ax=50";
sub double_split {
my %hash;
my #fields = split( /\Q|^|\E/, $line );
foreach my $field (#fields) {
my ( $name, $value ) = split( /=/, $field );
$hash{$name} = $value;
}
}
sub single_split {
my %hash = split( /\Q|^|\E|=/, $line );
}
sub re_replace_then_split {
$line =~ s/\Q|^|\E/=/g;
my %hash = split( /=/, $line );
}
sub single_regex {
my %hash = $line =~ m/(\w+)=(\w+)/g;
}
sub compound {
my %hash = $line =~ s/\Q|^|\E/\n/rg =~ m/(.*)=(.*)/g;
}
cmpthese(
1_000_000,
{ "Double Split" => \&double_split,
"single split with regex" => \&single_split,
"Replace then split" => \&re_replace_then_split,
"Single Regex" => \&single_regex,
"regex to linefeed them match" => \&compound
}
);
Looks like the results come out like:
Rate Double Split single split with regex Single Regex Replace then split regex to linefeed them match
Double Split 18325/s -- -4% -34% -56% -97%
single split with regex 19050/s 4% -- -31% -54% -97%
Single Regex 27607/s 51% 45% -- -34% -96%
Replace then split 41733/s 128% 119% 51% -- -93%
regex to linefeed them match 641026/s 3398% 3265% 2222% 1436% --
... I'm a bit suspicious of that last, because that's absurdly faster. There's probably caching of results happening there.
But looking at it, what's slowing you down is the alternation in the regex:
sub single_split_with_alt {
my %hash = split( /\Q|^|\E|=/, $line );
}
sub single_split {
my %hash = split( /[\|\^\=]+/, $line );
}
(I know that latter might not be quite what you want, but it's for illustrative purposes)
Gives:
Rate alternation single split
alternation 19135/s -- -37%
single split 30239/s 58% --
But there does come a point where this is moot, because your limiting factor is disk IO, not CPU.

perl count line in double looping, if match regular expression plus 1

I open a file by putting the line to an array. Inside this file based on the regular expression that contains a duplicate value. If the regular expression is a match I want to count it. The regular expression may look like this
$b =~ /\/([^\/]+)##/. I want to match $1 value.
my #array = do
{
open my $FH, '<', 'abc.txt' or die 'unable to open the file\n';
<$FH>;
};
Below is the way I do, it will get the same line in my file. Thank for help.
foreach my $b (#array)
{
$conflictTemp = 0;
$b =~ /\/([^\/]+)##/;
$b = $1;
#print "$b\n";
foreach my $c (#array)
{
$c =~ /\/([^\/]+)##/;
$c = $1;
if($b eq $c)
{
$conflictTemp ++;
#print "$b , $c \n"
#if($conflictTemp > 1)
#{
# $conflict ++;
#}
}
}
}
Below is the some sample data, two sentences are duplicates
/a/b/c/d/code/Debug/atlantis_digital/c/d/code/Debug/atlantis_digital.map##/main/place.09/2
/a/b/c/d/code/C5537_mem_map.cmd##/main/place.09/0
/a/b/c/d/code/.settings/org.eclipse.cdt.managedbuilder.core.prefs##/main/4
/a/b/c/d/code/.project_initial##/main/2
/a/b/c/d/code/.project##/main/CSS5/5
/a/b/c/d/code/.cproject##/main/CSS5/10
/a/b/c/d/code/.cdtproject##/main/place.09/0
/a/b/c/d/code/.cdtproject##/main/place.09/0
/a/b/c/d/code/.cdtbuild_initial##/main/2
/a/b/c/d/code/.**cdtbuild##**/main/CSS5/2
/a/b/c/d/code/.**cdtbuild##**/main/CSS5/2
/a/b/c/d/code/.ccsproject##/main/CSS5/3
It looks like you're trying to iterate each element of the array, select some data via pattern match, and then count dupes. Is that correct?
Would it not be easier to:
my %count_of;
while ( <$FH> ) {
my ( $val ) = /\/([^\/]+)##/;
$count_of{$val}++;
}
And then, for the variables that have more than one (e.g. there's a duplicate):
print join "\n", grep { $count_of{$_} > 1 } keys %count_of;
Alternatively, if you're just wanting to play 'spot the dupe':
#!/usr/bin/env perl
use strict;
use warnings;
my %seen;
my $match = qr/\/([^\/]+)##/;
while ( <DATA> ) {
my ( $value ) = m/$match/ or next;
print if $seen{$value}++;
}
__DATA__
/a/b/c/d/code/Debug/atlantis_digital/c/d/code/Debug/atlantis_digital.map##/main/place.09/2
/a/b/c/d/code/C5537_mem_map.cmd##/main/place.09/0
/a/b/c/d/code/.settings/org.eclipse.cdt.managedbuilder.core.prefs##/main/4
/a/b/c/d/code/.project_initial##/main/2
/a/b/c/d/code/.project##/main/CSS5/5
/a/b/c/d/code/.cproject##/main/CSS5/10
/a/b/c/d/code/.cdtproject##/main/place.09/0
/a/b/c/d/code/.cdtproject##/main/place.09/0
/a/b/c/d/code/.cdtbuild_initial##/main/2
/a/b/c/d/code/.cdtbuild##/main/CSS5/2
/a/b/c/d/code/.cdtbuild##/main/CSS5/2
/a/b/c/d/code/.ccsproject##/main/CSS5/3
The problem has been solved by the previous answer - I just want to offer an alternate flavour that;
Spells out the regex
Uses the %seen hash to record the line the pattern first appears; to enable
slightly more detailed reporting
use v5.12;
use warnings;
my $regex = qr/
\/ # A literal slash followed by
( # Capture to $1 ...
[^\/]+ # ... anything that's not a slash
) # close capture to $1
## # Must be immdiately followed by literal ##
/x;
my %line_num ;
while (<>) {
next unless /$regex/ ;
my $pattern = $1 ;
if ( $line_num{ $pattern } ) {
say "'$pattern' appears on lines ", $line_num{ $pattern }, " and $." ;
next ;
}
$line_num{ $pattern } = $. ; # Record the line number
}
# Ran on data above will produce;
# '.cdtproject' appears on lines 7 and 8
# '.cdtbuild' appears on lines 10 and 11

Negate regular expression in Perl

I am splitting a text file into blocks in order to extract those blocks which do not contain a certain line by using a regular expression.
The text file looks like this:
[Term]
id: id1
name: name1
xref: type1:aab
xref: type2:cdc
[Term]
id: id2
name: name2
xref: type1:aba
xref: type3:fee
Someone helped me a few days ago by showing me how to extract those blocks which do contain a certain regular expression (for example "xref: type3"):
while (<MYFILE>) {
BEGIN { $/ = q|| }
my #lines = split /\n/;
for my $line ( #lines ) {
if ( $line =~ m/xref:\s*type3/ ) {
printf NEWFILE qq|%s|, $_;
last;
}
}
}
Now I want to write all blocks in a new file which do not contain "xref: type3". I tried to do this by simply negating the regex
if ( $line !~ m/xref:\s*type3/ )
or alternatively by negating the if statement by using
unless ( $line =~ m/xref:\s*type3/ )
Unfortunately it doesn't work - the output file is the same as the the original one. Any ideas what I'm doing wrong?
You have:
For every line, print this block if this line doesn't match the pattern.
But you want:
For every line, print this line if none of the other lines in the block match the pattern.
As such, you can't start printing the block before you examined every line in the block (or at all lines until you find a matching line).
local $/ = q||;
while (<MYFILE>) {
my #lines = split /\n/;
my $skip = 0;
for my $line ( #lines ) {
if ( $line =~ m/^xref:\s*type3/ ) {
$skip = 1;
last;
}
}
if (!$skip) {
for my $line ( #lines ) {
print NEWFILE $line;
}
}
}
But there's no need to split into lines. We can check and print the whole block at once.
local $/ = q||;
while (<MYFILE>) {
print NEWFILE $_ if !/^xref:\s*type3/m;
}
(Note the /m to make ^ match the start of any line.)
The problem is that you are using unless with !~ which is interpreted as if $line does not NOT match do this. ( a double negative )
When using the unless block with the normal pattern matching operator =~ you code worked perfectly, that is I see the first block as output because it does not contain type3.
LOOP:
while (<$MYFILE>) {
BEGIN { $/ = q|| }
my #lines = split /\n/;
for my $line ( #lines ) {
unless ( $line =~ m/xref:\s*type3/ ) {
printf qq|%s|, $_;
last LOOP;
}
}
}
# prints
# [Term]
# id: id1
# name: name1
# xref: type1:aab
# xref: type2:cdc
Do not process the records line by line. Use a paragraph mode:
{ local $/ = q();
while (<MYFILE>) {
if (! /xref:\s*type3/ ) {
printf NEWFILE qq|%s|, $_;
last;
}
}

Extracting a block of text where the closing expression depends on the opening one

I have a text string structured like this:
= Some Heading (1)
Some text
== Some Sub-Heading (2)
Some more text
=== Some Sub-sub-heading (3)
Some details here
= Some other Heading (4)
I want to extract the content of second heading, including any subsection. I do not know beforehand what is the depth of the second heading, so I need to match from there to the next heading that is of the same depth, or shallower, or the end of the string.
In the example above, this would yield:
== Some Sub-Heading (2)
Some more text
=== Some Sub-sub-heading (3)
Some details here
This is where I get stuck. How can I use the matched sub-expression opening the second heading as part of the sub-expression for closing the section.
I'd skip trying to use a complex regex. Instead write a simple parser and build up a tree.
Here's a rough and ready implementation. It's only optimized for lazy coding. You may want to use libraries from CPAN to build your parser and your tree nodes.
#!/usr/bin/perl
use strict;
use warnings;
my $document = Node->new();
my $current = $document;
while ( my $line = <DATA> ) {
if ( $line =~ /^=+\s/ ) {
my $current_depth = $current->depth;
my $line_depth = Node->Heading_Depth( $line );
if ( $line_depth > $current_depth ) {
# child node.
my $line_node = Node->new();
$line_node->heading( $line );
$line_node->parent( $current );
$current->add_children( $line_node );
$current = $line_node;
}
else {
my $line_node = Node->new();
while ( my $parent = $current->parent ) {
if ( $line_depth == $current_depth ) {
# sibling node.
$line_node->heading( $line );
$line_node->parent( $parent );
$current = $line_node;
$parent->add_children( $current );
last;
}
# step up one level.
$current = $parent;
}
}
}
else {
$current->add_children( $line );
}
}
use Data::Dumper;
print Dumper $document;
BEGIN {
package Node;
use Scalar::Util qw(weaken blessed );
sub new {
my $class = shift;
my $self = {
children => [],
parent => undef,
heading => undef,
};
bless $self, $class;
}
sub heading {
my $self = shift;
if ( #_ ) {
$self->{heading} = shift;
}
return $self->{heading};
}
sub depth {
my $self = shift;
return $self->Heading_Depth( $self->heading );
}
sub parent {
my $self = shift;
if ( #_ ) {
$self->{parent} = shift;
weaken $self->{parent};
}
return $self->{parent};
}
sub children {
my $self = shift;
return #{ $self->{children} || [] };
}
sub add_children {
my $self = shift;
push #{$self->{children}}, #_;
}
sub stringify {
my $self = shift;
my $text = $self->heading;
foreach my $child ( $self->children ) {
no warnings 'uninitialized';
$text .= blessed($child) ? $child->stringify : $child;
}
return $text;
}
sub Heading_Depth {
my $class = shift;
my $heading = shift || '';
$heading =~ /^(=*)/;
my $depth = length $1;
return $depth;
}
}
__DATA__
= Heading (1)
Some text
= Heading (2)
Some more text
== Subheading (3)
Some details here
== Subheading (3)
Some details here
= Heading (4)
#!/usr/bin/perl
my $all_lines = join "", <>;
# match a Heading that ends with (2) and read everything between that match
# and the next heading of the same depth (\1 matches the 1st matched group)
if ( $all_lines =~ /(=+ Heading )\([2]\)(.*?)\1/s ) {
print "$2";
}
This splits the file in sections:
my #all = split /(?=^= )/m, join "", <$filehandle>;
shift #all;
daotoad and jrockway are absolutely right.
If you're trying to parse a tree-like data structure, bending regex to your will only results in a brittle inscrutable and still-not-general-enough intricate blob of code.
If you insist, though, here's a revised snippet that works. Matching up to same-depth separator OR end of string is one complication. Matching strings at depths less then or equal the current depth is more challenging and needed a two-step.
#!/usr/bin/perl
my $all_lines = join "", <>;
# match a Heading that ends with (2) and read everything between that match
# and the next heading of the same depth (\2 matches the 2nd parenthesized group)
if ( $all_lines =~ m/((=+) [^\n]*\(2\)(.*?))(\n\2 |\z)/s ) {
# then trim it down to just the point before any heading at lesser depth
my $some_lines = $1;
my $depth = length($2);
if ($some_lines =~ m/(.*?)(\n={1,$depth} |\z)/s) {
print "$1\n";
}
}
But my advice is to avoid this route and parse it with something readable and maintainable!
Just for a giggle:
/^(?>(=+).*\(2\))(?>[\r\n]+(?=\1=|[^=]).*)*/m
The lookahead ensures that, if a line starts with an equals sign, there is at least one more equals sign than in the prefix of the original line. Notice that the second part of the lookahead matches any character other than an equals sign, including a linefeed or carriage return. That lets it match an empty line, but not the end of the string.