Find matching between 2 files (how to improve efficiency)

Find matching between 2 files (how to improve efficiency) - regex

#file1 contains only startpoint-endpoint pair, each indices represent each pair. file2 is a text file, for #file2 each indices represents each line. I am trying to search each pair from #file1 in #file2 line by line. When the exact match is found, I would then try to extract information1 from file2 and print it out. But for now, I am trying to search for the matched pair in file2. The format of the matching pattern is as below:
Match case
From $file1[0]
Startpoint: /source/in_out/map (positive-triggered)
Endpoint: /output/end/scan_all (positive-triggered)
match if file2 contains:
Line with other stuff
Startpoint: /source/in_out/map (positive-triggered)
Endpoint: /output/end/scan_all (positive-triggered)
information1:
information2:
Lines with other stuff
Unmatch Case:
From file1:
Startpoint: /source/in_out/map (positive-triggered)
Endpoint: /output/end/scan_all (positive-triggered)
From file2:
Startpoint: /source/in_out/map (positive-triggered)
Endpoint: /different endpoint pair/ (positive-triggered)
information1:
information2:
For text files2, I stored it in #file2. For files1, I have successfully extracted and stored every Startpoint and the next line Endpoint as the format above in #file1. (No problem in extracting and storing each pair, so I wont be showing the code for this, it took around 4mins here) Then I split each element of #address, which are the startpoint and endpoint. Checking line by line in files2, if startpoint match, then I will move on next line to check endpoint, it is only considered match if the next line after Startpoint match the Endpoint, else try to search again until the end line of files2. This script does the job but it took 3 and a half hours to complete(there are around 60k pairs from file1 and 800k lines to check in file2). Is there any other efficient way to do this?
I am new in Perl scripting, I apologize for any silly mistakes, both in my explanation and my coding.
Here's the codes:
#!usr/bin/perl
use warnings;
my $report = '/home/dir/file2';
open ( $DATA,$report ) || die "Error when opening";
chomp (#file2 = <$DATA>);
#No problem in extracting Start-Endpoint pair from file1 into #file1, so I wont include
#the code for this
$size = scalar#file1;
$size2 = scalar#file2;
for ( $total=0; $total<$size; $total++ ) {
my #file1_split = split('\n',$file1[$total]);
chomp #file1_split;
my $match_endpoint = 0;
my $split = 0;
LABEL2: for ( $count=0; $count<$size2; $count++ ) {
if ( $match_endpoint == 1) {
if ( grep { $_ eq "file1_split[$split]" } $file2[$count] )
print"Pair($total):Match Pair\n";
last LABEL2; #move on to check next start-endpoint
#pair
}
else {
$split = 0; #reset back to check the same startpoint
and continue searching until match found or end line of file2
$match_endpoint = 0;
}
}
elsif ( grep { $_ eq "$address_array[$split]"} $array[$count] )
{
$match_endpoint = 1;#enable search for endpoint in next line
$split = 1; #move on next line to match endpoint
next;
}
elsif ( $count==$size2-1 ) {
print"no matching found for Path($total)\n";
}
}
}

If I'm understanding what your code is trying to do,
it looks like it would be more efficient to do it this way:
my %split=#file1;
my %total;
#total{#file1}=(0..$#file1);
my $split;
for( #file2 ){
if( $split ){
if( $_ eq $split ){
print"Pair($total{$split}):Match Pair\n";
}else{
$split{$split}="";
}
}
$split=$split{$_};
delete $split{$_};
}
for( keys %split ){
print"no matching found for Path($total{$_})\n";
}

If I have understood your spec (show matches), I'm betting this will complete in less than 5 seconds, unless you're using an old Dell D333. To further minimize the response time, you would write some extra code to drive the while loop by the hash with the fewest keys (you implied file1). If you use references to hashes, then you can write a small if-else statement to swap the hash references without having to code duplicate while statements.
use strict;
use warnings;
sub makeHash($) {
my ($filename) = #_;
open(DATA, $filename) || die;
my %result;
my ($start, $line);
while (<DATA>) {
if ($_ =~ /^Startpoint: (.*)/) {
$start = $1; # captured group in regular expression
$line = $.; # current line number
} elsif ($_ =~ /^Endpoint: (.*)/) {
my $end = $1;
if (defined $line && $. == ($line + 1)) {
my $key = "$start::$end";
# can distinguish start and end lines if necessary
$result{$key} = {start=>$start, end=>$end, line=>$line};
}
}
}
close(DATA);
return %result;
}
my %file1 = makeHash("file1");
my %file2 = makeHash("file2");
my $fmt = "%10s %10s %s\n";
my $nmatches = 0;
printf $fmt, "File1", "File2", "Key";
while (my ($key, $f1h) = each %file1) {
my $f2h = $file2{$key};
if (defined $f2h) {
# You have access to hash members start and end if you need to distinguish further
printf $fmt, $f1h->{line}, $f2h->{line}, $key;
$nmatches++;
}
}
print "Found $nmatches matches\n";
Below, is my test data generator(thanks). I generated a worst-case scenario of 1,000,000 matches between two equal files. The matching code above finished on my MBP in under 20 seconds using the generated test data.
use strict;
use warnings;
sub rndStr { join'', #_[ map{ rand #_ } 1 .. shift ] }
open(F1, ">file1") || die;
open(F2, ">file2") || die;
for (1..1000000) {
my $start = rndStr(30, 'A'..'Z');
my $end = rndStr(30, 'A'..'Z');
print F1 "Startpoint: $start\n";
print F1 "Endpoint: $end\n";
print F2 "Startpoint: $start\n";
print F2 "Endpoint: $end\n";
}
close(F1);
close(F2);

Related

Matching a file line by line

$text_file = '/homedir/report';
open ( $DATA,$text_file ) || die "Error!"; #open the file
#ICC2_array = <$DATA>;
$total_line = scalar#ICC2_array; # total number of lines
#address_array = split('\n',$address[6608]); # The first content is what I want and it is correct, I have checked using print
LABEL2:
for ( $count=0; $count < $total_line; $count++ ) {
if ( grep { $_ eq "$address_array[0]" } $ICC2_array[$count] ) {
print "This address is found!\n";
last LABEL2;
}
elsif ( $count == $total_line - 1 ) { # if not found in all lines
print "No matching is found for this address\n";
last LABEL2;
}
}
I am trying to match the 6609th address in #ICC2_array line by line. I am certain that this address is in $text_file but it is exactly the same format.
Something like this:
$address[6608] contains
Startpoint: port/start/input_output (triggered by clock3)
Endpoint: port/end/input_output (rising edge-triggered)
$address_array[0] contains
Startpoint: port/start/input_output (triggered by clock3)
There's a line in $text_file that is
Startpoint: port/start/input_output (triggered by clock3)
However the output is "no matching found for this address", can anybody point out my mistakes?

All of the elements in #ICC2_array will have new-line characters at the end.
As $address_array[0] is created by splitting data on \n it is guaranteed not to contain a new-line character.
A string that ends in a new-line can never be equal to a string that doesn't contain a new-line.
I suggest replacing:
#ICC2_array = <$DATA>;
With:
chomp(#ICC2_array = <$DATA>);
Update: Another problem I've just spotted. You are incrementing $count twice on each iteration. You increment it in the loop control code ($count++) and you're also incrementing it in the else clause ($count += 1). So you're probably only checking every other element in #ICC2_array.

I think your code should look like this
The any operator from core module List::Util is like grep except that it stops searching as soon as it finds a match, so on average should be twice as fast. Early iterations of List::Util did not contain any, and you can simply use grep instead if that applies to you
I've removed the _array from your array identifier as the # indicates that it's an array and it's just unwanted noise
use List::Util 'any';
my $text_file = '/homedir/report';
my #ICC2 = do {
open my $fh,'<', $text_file or die qq{Unable to open "$text_file" for input: $!};
<$fh>;
};
chomp #ICC2;
my ( $address ) = split /\n/, $address[6608], 2;
if ( any { $_ eq $address } #ICC2 ) {
print "This address is found\n"
}
else {
print "No match is found for this address\n";
}

Using iterated variables with regex

The point of the overall script is to:
step 1) open a single column file and read off first entry.
step 2) open a second file containing lots of rows and columns, read off EACH line one at a time, and find anything in that line that matches the first entry from the first file.
step3) if a match is found, then "do something constructive", and if not, go to the first file and take the second entry and repeat step 2 and step 3, and so on...
here is the script:
#!/usr/bin/perl
use strict; #use warnings;
unless(#ARGV) {
print "\usage: $0 filename\n\n"; # $0 name of the program being executed
exit;
}
my $list = $ARGV[0];
chomp( $list );
unless (open(LIST, "<$list")) {
print "\n I can't open your list of genes!!! \n";
exit;
}
my( #list ) = (<LIST>);
close LIST;
open (CHR1, "<acembly_chr_sorted_by_exon_count.txt") or die;
my(#spreadsheet) = (<CHR1>);
close CHR1;
for (my $i = 0; $i < scalar #list; $i++ ) {
print "$i in list is $list[$i]\n";
for (my $j = 1; $j < scalar #spreadsheet; $j++ ) {
#print "$spreadsheet[$j]\n";
if ( $spreadsheet[$j] ) {
print "will $list[$i] match with $spreadsheet[$j]?\n";
}
else { print "no match\n" };
} #for
} #for
I plan to use a regex in the line if ( $spreadsheet[$j] ) { but am having a problem at this step as it is now. On the first interation, the line print "will $list[$i] match with $spreadsheet[$j]?\n"; prints $list[$i] OK but does not print $spreadsheet[$j]. This line will print both variables correctly on the second and following iterations. I do not see why?

At first glance nothing looks overtly incorrect. As mentioned in the comments the $j = 1 looks questionable but perhaps you are skipping the first row on purpose.
Here is a more perlish starting point that is tested. If it does not work then you have something going on with your input files.
Note the extended trailing whitespace removal. Sometimes if you open a WINDOWS file on a UNIX machine and use chomp, you can have embedded \r in your text that causes weird things to happen to printed output.
#!/usr/bin/perl
use strict; #use warnings;
unless(#ARGV) {
print "\usage: $0 filename\n\n"; # $0 name of the program being executed
exit;
}
my $list = shift;
unless (open(LIST, "<$list")) {
print "\n I can't open your list of genes!!! \n";
exit;
}
open(CHR1, "<acembly_chr_sorted_by_exon_count.txt") or die;
my #spreadsheet = map { s/\s+$//; $_ } <CHR1>;
close CHR1;
# s/\s+$//; is like chomp but trims all trailing whitespace even
# WINDOWS files opened on a UNIX system.
for my $item (<LIST>) {
$item =~ s/\s+$//; # trim all trailing whitespace
print "==> processing '$item'\n";
for my $row (#spreadsheet) {
if ($row =~ /\Q$item\E/) { # see perlre for \Q \E
print "match '$row'\n";
}
else {
print "no match '$row'\n";
}
}
}
close LIST;

Negate regular expression in Perl

I am splitting a text file into blocks in order to extract those blocks which do not contain a certain line by using a regular expression.
The text file looks like this:
[Term]
id: id1
name: name1
xref: type1:aab
xref: type2:cdc
[Term]
id: id2
name: name2
xref: type1:aba
xref: type3:fee
Someone helped me a few days ago by showing me how to extract those blocks which do contain a certain regular expression (for example "xref: type3"):
while (<MYFILE>) {
BEGIN { $/ = q|| }
my #lines = split /\n/;
for my $line ( #lines ) {
if ( $line =~ m/xref:\s*type3/ ) {
printf NEWFILE qq|%s|, $_;
last;
}
}
}
Now I want to write all blocks in a new file which do not contain "xref: type3". I tried to do this by simply negating the regex
if ( $line !~ m/xref:\s*type3/ )
or alternatively by negating the if statement by using
unless ( $line =~ m/xref:\s*type3/ )
Unfortunately it doesn't work - the output file is the same as the the original one. Any ideas what I'm doing wrong?

You have:
For every line, print this block if this line doesn't match the pattern.
But you want:
For every line, print this line if none of the other lines in the block match the pattern.
As such, you can't start printing the block before you examined every line in the block (or at all lines until you find a matching line).
local $/ = q||;
while (<MYFILE>) {
my #lines = split /\n/;
my $skip = 0;
for my $line ( #lines ) {
if ( $line =~ m/^xref:\s*type3/ ) {
$skip = 1;
last;
}
}
if (!$skip) {
for my $line ( #lines ) {
print NEWFILE $line;
}
}
}
But there's no need to split into lines. We can check and print the whole block at once.
local $/ = q||;
while (<MYFILE>) {
print NEWFILE $_ if !/^xref:\s*type3/m;
}
(Note the /m to make ^ match the start of any line.)

The problem is that you are using unless with !~ which is interpreted as if $line does not NOT match do this. ( a double negative )
When using the unless block with the normal pattern matching operator =~ you code worked perfectly, that is I see the first block as output because it does not contain type3.
LOOP:
while (<$MYFILE>) {
BEGIN { $/ = q|| }
my #lines = split /\n/;
for my $line ( #lines ) {
unless ( $line =~ m/xref:\s*type3/ ) {
printf qq|%s|, $_;
last LOOP;
}
}
}
# prints
# [Term]
# id: id1
# name: name1
# xref: type1:aab
# xref: type2:cdc

Do not process the records line by line. Use a paragraph mode:
{ local $/ = q();
while (<MYFILE>) {
if (! /xref:\s*type3/ ) {
printf NEWFILE qq|%s|, $_;
last;
}
}

How can I detect a blank line in Perl?

How do I check a line ($_ value) is a blank line in Perl? Or another
good method to check it instead of using $_?
I want to code like this
if ($_ eq '') # Check current line is a blank line (no any characters)
{
$x = 0;
}
I updated some code with a question solution below.
My test.txt for parsing:
constant fixup private GemAlarmFileName = <A "C:\\TMP\\ALARM.LOG">
vid = 0
name = ""
units = ""
constant fixup private GemConfigAlarms = <U1 0> /* my Comment */
vid = 1
name = "CONFIGALARMS"
units = ""
min = <U1 0>
max = <U1 2>
default = <U1 0>
My code is below.
That's why I need to initially set $x = 0. I am not sure if it is a normal
solution or not.
sub ConstantParseAndPrint
{
if (/^$/) // SOLUTION!
{
$x = 0;
}
if ($x == 0)
{
if (/^\s*(constant)\s*(fixup|\/\*fixup\*\/|)\s*(private|)\s*(\w+)\s+=\s+<([a-zA-Z0-9]+)\s+(["']?)([a-zA-Z0-9.:\\]+)\6>\s*(\/\*\s*(.*?)\s*\*\/|)(\r|\n|\s)/)
{
$name1 = $1; # Constant
$name2 = $2; # Fixup
$name3 = $3; # Private
$name4 = $4;
$name5 = $5;
$name6 = $7;
$name7 = $8;
# start print
if (!$name7 eq '')
{
print DEST_XML_FILE "<!-- $name7-->\n";
}
print DEST_XML_FILE " <ECID";
print DEST_XML_FILE " logicalName=\"$name4\"";
print DEST_XML_FILE " valueType=\"$name5\"";
print DEST_XML_FILE " value=\"$name6\"";
$x = 1;
}
}
elsif ($x == 1)
{
if(/\s*vid\s*=\s*(.*?)(\s|\n|\r)/)
{
$nID = $1;
print DEST_XML_FILE " vid=\"$nID\"";
$x = 2;
}
}
elsif ($x == 2)
{
if(/\s*name\s*=\s*(.*?)(\s|\n|\r)/)
{
$nName = $1;
print DEST_XML_FILE " name=$nName";
$x = 3;
}
}
elsif ($x == 3)
{
if (/\s*units\s*=\s*(.*?)(\s|\n|\r)/)
{
$nUnits = $1;
print DEST_XML_FILE " units=$nUnits";
$x = 4;
}
}
elsif ($x == 4)
{
# \s+<([a-zA-Z0-9]+)\s+([a-zA-Z0-9]+)>\
if (/\s*min\s*=\s+<([a-zA-Z0-9]+)\s+([a-zA-Z0-9]+)>(\s|\n|\r)/)
{
#$nMinName1 = $1;
$nMinName2 = $2; # Find the nMin Value
#$nMinName3 = $3;
#$nMinName4 = $4;
print DEST_XML_FILE " min=\"$nMinName2\"";
$x = 5;
}
else
{
print DEST_XML_FILE "></ECID>\n";
$x = 0; # There is no line 4 and line 5
}
}
elsif ($x == 5)
{
if (/\s*max\s*=\s+<([a-zA-Z0-9]+)\s+([a-zA-Z0-9]+)>(\s|\n|\r)/)
{
#$nMaxName1 = $1;
$nMaxName2 = $2; # Find the nMax Value
#$nMaxName3 = $3;
#$nMaxName4 = $4;
print DEST_XML_FILE " max=\"$nMaxName2\"";
$x = 6;
}
}
elsif ($x == 6)
{
if (/\s*default\s*=\s+<([a-zA-Z0-9]+)\s+([a-zA-Z0-9]+)>(\s|\n|\r)/)
{
#$nDefault1 = $1;
$nDefault2 = $2; # Find the default Value
#$nDefault3 = $3;
#$nDefault4 = $4;
print DEST_XML_FILE " default=\"$nDefault2\">";
print DEST_XML_FILE "</ECID>\n";
$x = 0;
}
}
}

if ($_ =~ /^\s*$/) {
# blank
}
checks for 0 or more whitespaces (\s*) bound by beginning(^)/end($) of line. That's checking for a blank line (i.e. may have whitespace). If you want an empty line check, just remove the \s*.
The check against $_ can be implicit, so you can reduce the above to if (/^\s*$/) for conciseness.

Against my better judgment I will try to help you again.
The issue is not how to find a blank line. The issue is not which regex to use. The fundamental issue is understanding how to analyze a problem and turn that analysis into code.
In this case the problem is "How do I parse this format?"
I've written a parser for you. I have also taken the time to write a detailed description of the process I used to write it.
WARNING: The parser is not carefully tested for all cases. It does not have enough error handling built in. For those features, you can request a rate card or write them yourself.
Here's the data sample you provided (I'm not sure which of your several questions I pulled this from):
constant fixup GemEstabCommDelay = <U2 20>
vid = 6
name = "ESTABLISHCOMMUNICATIONSTIMEOUT"
units = "s"
min = <U2 0>
max = <U2 1800>
default = <U2 20>
constant fixup private GemConstantFileName = <A "C:\\TMP\\CONST.LOG">
vid = 4
name = "" units = ""
constant fixup private GemAlarmFileName = <A "C:\\TMP\\ALARM.LOG">
vid = 0
name = ""
units = ""
Before you can write a parser for a data file, you need to have a description the structure of the file. If you are using a standard format (say XML) you can read the existing specification. If you are using some home-grown format, you get to write it yourself.
So, based on the sample data, we can see that:
data is broken into blocks.
each block starts with the word constant in column 0.
each block ends with a blank line.
a block consists of a start line, and zero or more additional lines.
The start line consists of the keyword constant followed by one or more whitespace delimited words, an '=' sign and an <> quoted data value.
The last keyword appears to be the name of the constant. Call it constant_name
The <>-quoted data appears to be a combined type/value specifier.
earlier keywords appear to specify additional metadata about the constant. Let's call those options.
The additional lines specify additional key value pairs. Let's call them attributes. Attributes may have a single value or they may have a type/value specifier.
One or more attributes may appear in a single line.
Okay, so now we have a rough spec. What do we do with it?
How is the format structured? Consider the logical units of organization from largest to smallest. These will determine the structure and flow of our code.
A FILE is made of BLOCKS.
BLOCKS are made of LINES.
So our parser should decompose a file into blocks, and then handle the blocks.
Now we rough out a parser in comments:
# Parse a constant spec file.
# Until file is done:
# Read in a whole block
# Parse the block and return key/value pairs for a hash.
# Store a ref to the hash in a big hash of all blocks, keyed by constant_name.
# Return ref to big hash with all block data
Now we start to fill in some code:
# Parse a constant spec file.
sub parse_constant_spec {
my $fh = shift;
my %spec;
# Until file is done:
# Read in a whole block
while( my $block = read_block($fh) ) {
# Parse the and return key/value pairs for a hash.
my %constant = parse_block( $block );
# Store a ref to the hash in a big hash of all blocks, keyed by constant_name.
$spec{ $constant{name} } = \%constant;
}
# Return ref to big hash with all block data
return \%spec;
}
But it won't work. The parse_block and read_block subs haven't been written yet. At this stage that's OK. The point is to rough in features in small, understandable chunks. Every once in a while, to keep things readable you need to gloss over the details drop in a subroutine--otherwise you wind up with monstrous 1000 line subs that are impossible to debug.
Now we know we need to write a couple of subs to finish up, et viola:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my $fh = \*DATA;
print Dumper parse_constant_spec( $fh );
# Parse a constant spec file.
# Pass in a handle to process.
# As long as it acts like a file handle, it will work.
sub parse_constant_spec {
my $fh = shift;
my %spec;
# Until file is done:
# Read in a whole block
while( my $block = read_block($fh) ) {
# Parse the and return key/value pairs for a hash.
my %constant = parse_block( $block );
# Store a ref to the hash in a big hash of all blocks, keyed by constant_name.
$spec{ $constant{const_name} } = \%constant;
}
# Return ref to big hash with all block data
return \%spec;
}
# Read a constant definition block from a file handle.
# void return when there is no data left in the file.
# Otherwise return an array ref containing lines to in the block.
sub read_block {
my $fh = shift;
my #lines;
my $block_started = 0;
while( my $line = <$fh> ) {
$block_started++ if $line =~ /^constant/;
if( $block_started ) {
last if $line =~ /^\s*$/;
push #lines, $line;
}
}
return \#lines if #lines;
return;
}
sub parse_block {
my $block = shift;
my ($start_line, #attribs) = #$block;
my %constant;
# Break down first line:
# First separate assignment from option list.
my ($start_head, $start_tail) = split /=/, $start_line;
# work on option list
my #options = split /\s+/, $start_head;
# Recover constant_name from options:
$constant{const_name} = pop #options;
$constant{options} = \#options;
# Now we parse the value/type specifier
#constant{'type', 'value' } = parse_type_value_specifier( $start_tail );
# Parse attribute lines.
# since we've already got multiple per line, get them all at once.
chomp #attribs;
my $attribs = join ' ', #attribs;
# we have one long line of mixed key = "value" or key = <TYPE VALUE>
#attribs = $attribs =~ /\s*(\w+\s+=\s+".*?"|\w+\s+=\s+<.*?>)\s*/g;
for my $attrib ( #attribs ) {
warn "$attrib\n";
my ($name, $value) = split /\s*=\s*/, $attrib;
if( $value =~ /^"/ ) {
$value =~ s/^"|"\s*$//g;
}
elsif( $value =~ /^</ ) {
$value = [ parse_type_value_specifier( $start_tail ) ];
}
else {
warn "Bad line";
}
$constant{ $name } = $value;
}
return %constant;
}
sub parse_type_value_specifier {
my $tvs = shift;
my ($type, $value) = $tvs =~ /<(\w+)\s+(.*?)>/;
return $type, $value;
}
__DATA__
constant fixup GemEstabCommDelay = <U2 20>
vid = 6
name = "ESTABLISHCOMMUNICATIONSTIMEOUT"
units = "s"
min = <U2 0>
max = <U2 1800>
default = <U2 20>
constant fixup private GemConstantFileName = <A "C:\\TMP\\CONST.LOG">
vid = 4
name = "" units = ""
constant fixup private GemAlarmFileName = <A "C:\\TMP\\ALARM.LOG">
vid = 0
name = ""
units = ""
The above code is far from perfect. IMO, parse_block is too long and ought to be broken into smaller subs. Also, there isn't nearly enough validation and enforcement of well-formed input. Variable names and descriptions could be clearer, but I don't really understand the semantics of your data format. Better names would more closely match the semantics of the data format.
Despite these issues, it does parse your format and produce a big handy data structure that can be stuffed into whatever output format you want.
If you use this format in many places, I recommend putting the parsing code into a module. See perldoc perlmod for more info.
Now, please stop using global variables and ignoring good advice. Please start reading the perldoc, read Learning Perl and Perl Best Practices, use strict, use warnings. While I am throwing reading lists around go read Global Variables are Bad and then wander around the wiki to read and learn. I learned more about writing software by reading c2 than I did in school.
If you have questions about how this code works, why it is laid out as it is, what other choices could have been made, speak up and ask. I am willing to help a willing student.
Your English is good, but it is clear you are not a native speaker. I may have used too many complex sentences. If you need parts of this written in simple sentences, I can try to help. I understand that working in a foreign language is very difficult.

The answer depends on what you mean by a blank line (whether it contains no characters apart from a newline or whether it contains only whitespace). An idiomatic way to deal with this is to use a negative match against \S which matches in both of these cases:
if ( ! /\S/ ) {
...
}
If you are only looking for the former than your own answer is fine.
You often see this technique used as a filter:
while (<>) {
next unless /\S/; # Ignore blank lines.
...
}

You can use:
if ($_ =~ /^$/)
or even just
if (/^$/)
since Perl assumes checking against $_

If you just want to check if the current value of $_ or $var is a blank (or at least all-whitespace) line, then something like
if (/^\s*$/) { ... }
if ($var =~ /^\s*$/){ ... }
as several others have already mentioned.
However, I find that I most commonly want to ignore blank lines while processing input in a loop. I do that like this:
while (<>) {
next if /^\s*$/;
...
}
If I want to allow the traditional shell-style comments, I usually add
s/\s*#.*$//;
just before the check for a blank line.

while (<>){
chomp;
if ($_ eq ""){
print "blank at $.\n";
}
}

The way you showed - if ( $_ eq '' ) is perfectly sane. Perhaps you should describe what is your problem with it?

if(/^\s*$/)
{
$x = 0;
}

Extracting a block of text where the closing expression depends on the opening one

I have a text string structured like this:
= Some Heading (1)
Some text
== Some Sub-Heading (2)
Some more text
=== Some Sub-sub-heading (3)
Some details here
= Some other Heading (4)
I want to extract the content of second heading, including any subsection. I do not know beforehand what is the depth of the second heading, so I need to match from there to the next heading that is of the same depth, or shallower, or the end of the string.
In the example above, this would yield:
== Some Sub-Heading (2)
Some more text
=== Some Sub-sub-heading (3)
Some details here
This is where I get stuck. How can I use the matched sub-expression opening the second heading as part of the sub-expression for closing the section.

I'd skip trying to use a complex regex. Instead write a simple parser and build up a tree.
Here's a rough and ready implementation. It's only optimized for lazy coding. You may want to use libraries from CPAN to build your parser and your tree nodes.
#!/usr/bin/perl
use strict;
use warnings;
my $document = Node->new();
my $current = $document;
while ( my $line = <DATA> ) {
if ( $line =~ /^=+\s/ ) {
my $current_depth = $current->depth;
my $line_depth = Node->Heading_Depth( $line );
if ( $line_depth > $current_depth ) {
# child node.
my $line_node = Node->new();
$line_node->heading( $line );
$line_node->parent( $current );
$current->add_children( $line_node );
$current = $line_node;
}
else {
my $line_node = Node->new();
while ( my $parent = $current->parent ) {
if ( $line_depth == $current_depth ) {
# sibling node.
$line_node->heading( $line );
$line_node->parent( $parent );
$current = $line_node;
$parent->add_children( $current );
last;
}
# step up one level.
$current = $parent;
}
}
}
else {
$current->add_children( $line );
}
}
use Data::Dumper;
print Dumper $document;
BEGIN {
package Node;
use Scalar::Util qw(weaken blessed );
sub new {
my $class = shift;
my $self = {
children => [],
parent => undef,
heading => undef,
};
bless $self, $class;
}
sub heading {
my $self = shift;
if ( #_ ) {
$self->{heading} = shift;
}
return $self->{heading};
}
sub depth {
my $self = shift;
return $self->Heading_Depth( $self->heading );
}
sub parent {
my $self = shift;
if ( #_ ) {
$self->{parent} = shift;
weaken $self->{parent};
}
return $self->{parent};
}
sub children {
my $self = shift;
return #{ $self->{children} || [] };
}
sub add_children {
my $self = shift;
push #{$self->{children}}, #_;
}
sub stringify {
my $self = shift;
my $text = $self->heading;
foreach my $child ( $self->children ) {
no warnings 'uninitialized';
$text .= blessed($child) ? $child->stringify : $child;
}
return $text;
}
sub Heading_Depth {
my $class = shift;
my $heading = shift || '';
$heading =~ /^(=*)/;
my $depth = length $1;
return $depth;
}
}
__DATA__
= Heading (1)
Some text
= Heading (2)
Some more text
== Subheading (3)
Some details here
== Subheading (3)
Some details here
= Heading (4)

#!/usr/bin/perl
my $all_lines = join "", <>;
# match a Heading that ends with (2) and read everything between that match
# and the next heading of the same depth (\1 matches the 1st matched group)
if ( $all_lines =~ /(=+ Heading )\([2]\)(.*?)\1/s ) {
print "$2";
}

This splits the file in sections:
my #all = split /(?=^= )/m, join "", <$filehandle>;
shift #all;

daotoad and jrockway are absolutely right.
If you're trying to parse a tree-like data structure, bending regex to your will only results in a brittle inscrutable and still-not-general-enough intricate blob of code.
If you insist, though, here's a revised snippet that works. Matching up to same-depth separator OR end of string is one complication. Matching strings at depths less then or equal the current depth is more challenging and needed a two-step.
#!/usr/bin/perl
my $all_lines = join "", <>;
# match a Heading that ends with (2) and read everything between that match
# and the next heading of the same depth (\2 matches the 2nd parenthesized group)
if ( $all_lines =~ m/((=+) [^\n]*\(2\)(.*?))(\n\2 |\z)/s ) {
# then trim it down to just the point before any heading at lesser depth
my $some_lines = $1;
my $depth = length($2);
if ($some_lines =~ m/(.*?)(\n={1,$depth} |\z)/s) {
print "$1\n";
}
}
But my advice is to avoid this route and parse it with something readable and maintainable!

Just for a giggle:
/^(?>(=+).*\(2\))(?>[\r\n]+(?=\1=|[^=]).*)*/m
The lookahead ensures that, if a line starts with an equals sign, there is at least one more equals sign than in the prefix of the original line. Notice that the second part of the lookahead matches any character other than an equals sign, including a linefeed or carriage return. That lets it match an empty line, but not the end of the string.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Find matching between 2 files (how to improve efficiency) - regex

Related

Matching a file line by line

Using iterated variables with regex

Negate regular expression in Perl

How can I detect a blank line in Perl?

Extracting a block of text where the closing expression depends on the opening one

Categories

Resources