How can I detect a blank line in Perl? - regex

How do I check a line ($_ value) is a blank line in Perl? Or another
good method to check it instead of using $_?
I want to code like this
if ($_ eq '') # Check current line is a blank line (no any characters)
{
$x = 0;
}
I updated some code with a question solution below.
My test.txt for parsing:
constant fixup private GemAlarmFileName = <A "C:\\TMP\\ALARM.LOG">
vid = 0
name = ""
units = ""
constant fixup private GemConfigAlarms = <U1 0> /* my Comment */
vid = 1
name = "CONFIGALARMS"
units = ""
min = <U1 0>
max = <U1 2>
default = <U1 0>
My code is below.
That's why I need to initially set $x = 0. I am not sure if it is a normal
solution or not.
sub ConstantParseAndPrint
{
if (/^$/) // SOLUTION!
{
$x = 0;
}
if ($x == 0)
{
if (/^\s*(constant)\s*(fixup|\/\*fixup\*\/|)\s*(private|)\s*(\w+)\s+=\s+<([a-zA-Z0-9]+)\s+(["']?)([a-zA-Z0-9.:\\]+)\6>\s*(\/\*\s*(.*?)\s*\*\/|)(\r|\n|\s)/)
{
$name1 = $1; # Constant
$name2 = $2; # Fixup
$name3 = $3; # Private
$name4 = $4;
$name5 = $5;
$name6 = $7;
$name7 = $8;
# start print
if (!$name7 eq '')
{
print DEST_XML_FILE "<!-- $name7-->\n";
}
print DEST_XML_FILE " <ECID";
print DEST_XML_FILE " logicalName=\"$name4\"";
print DEST_XML_FILE " valueType=\"$name5\"";
print DEST_XML_FILE " value=\"$name6\"";
$x = 1;
}
}
elsif ($x == 1)
{
if(/\s*vid\s*=\s*(.*?)(\s|\n|\r)/)
{
$nID = $1;
print DEST_XML_FILE " vid=\"$nID\"";
$x = 2;
}
}
elsif ($x == 2)
{
if(/\s*name\s*=\s*(.*?)(\s|\n|\r)/)
{
$nName = $1;
print DEST_XML_FILE " name=$nName";
$x = 3;
}
}
elsif ($x == 3)
{
if (/\s*units\s*=\s*(.*?)(\s|\n|\r)/)
{
$nUnits = $1;
print DEST_XML_FILE " units=$nUnits";
$x = 4;
}
}
elsif ($x == 4)
{
# \s+<([a-zA-Z0-9]+)\s+([a-zA-Z0-9]+)>\
if (/\s*min\s*=\s+<([a-zA-Z0-9]+)\s+([a-zA-Z0-9]+)>(\s|\n|\r)/)
{
#$nMinName1 = $1;
$nMinName2 = $2; # Find the nMin Value
#$nMinName3 = $3;
#$nMinName4 = $4;
print DEST_XML_FILE " min=\"$nMinName2\"";
$x = 5;
}
else
{
print DEST_XML_FILE "></ECID>\n";
$x = 0; # There is no line 4 and line 5
}
}
elsif ($x == 5)
{
if (/\s*max\s*=\s+<([a-zA-Z0-9]+)\s+([a-zA-Z0-9]+)>(\s|\n|\r)/)
{
#$nMaxName1 = $1;
$nMaxName2 = $2; # Find the nMax Value
#$nMaxName3 = $3;
#$nMaxName4 = $4;
print DEST_XML_FILE " max=\"$nMaxName2\"";
$x = 6;
}
}
elsif ($x == 6)
{
if (/\s*default\s*=\s+<([a-zA-Z0-9]+)\s+([a-zA-Z0-9]+)>(\s|\n|\r)/)
{
#$nDefault1 = $1;
$nDefault2 = $2; # Find the default Value
#$nDefault3 = $3;
#$nDefault4 = $4;
print DEST_XML_FILE " default=\"$nDefault2\">";
print DEST_XML_FILE "</ECID>\n";
$x = 0;
}
}
}

if ($_ =~ /^\s*$/) {
# blank
}
checks for 0 or more whitespaces (\s*) bound by beginning(^)/end($) of line. That's checking for a blank line (i.e. may have whitespace). If you want an empty line check, just remove the \s*.
The check against $_ can be implicit, so you can reduce the above to if (/^\s*$/) for conciseness.

Against my better judgment I will try to help you again.
The issue is not how to find a blank line. The issue is not which regex to use. The fundamental issue is understanding how to analyze a problem and turn that analysis into code.
In this case the problem is "How do I parse this format?"
I've written a parser for you. I have also taken the time to write a detailed description of the process I used to write it.
WARNING: The parser is not carefully tested for all cases. It does not have enough error handling built in. For those features, you can request a rate card or write them yourself.
Here's the data sample you provided (I'm not sure which of your several questions I pulled this from):
constant fixup GemEstabCommDelay = <U2 20>
vid = 6
name = "ESTABLISHCOMMUNICATIONSTIMEOUT"
units = "s"
min = <U2 0>
max = <U2 1800>
default = <U2 20>
constant fixup private GemConstantFileName = <A "C:\\TMP\\CONST.LOG">
vid = 4
name = "" units = ""
constant fixup private GemAlarmFileName = <A "C:\\TMP\\ALARM.LOG">
vid = 0
name = ""
units = ""
Before you can write a parser for a data file, you need to have a description the structure of the file. If you are using a standard format (say XML) you can read the existing specification. If you are using some home-grown format, you get to write it yourself.
So, based on the sample data, we can see that:
data is broken into blocks.
each block starts with the word constant in column 0.
each block ends with a blank line.
a block consists of a start line, and zero or more additional lines.
The start line consists of the keyword constant followed by one or more whitespace delimited words, an '=' sign and an <> quoted data value.
The last keyword appears to be the name of the constant. Call it constant_name
The <>-quoted data appears to be a combined type/value specifier.
earlier keywords appear to specify additional metadata about the constant. Let's call those options.
The additional lines specify additional key value pairs. Let's call them attributes. Attributes may have a single value or they may have a type/value specifier.
One or more attributes may appear in a single line.
Okay, so now we have a rough spec. What do we do with it?
How is the format structured? Consider the logical units of organization from largest to smallest. These will determine the structure and flow of our code.
A FILE is made of BLOCKS.
BLOCKS are made of LINES.
So our parser should decompose a file into blocks, and then handle the blocks.
Now we rough out a parser in comments:
# Parse a constant spec file.
# Until file is done:
# Read in a whole block
# Parse the block and return key/value pairs for a hash.
# Store a ref to the hash in a big hash of all blocks, keyed by constant_name.
# Return ref to big hash with all block data
Now we start to fill in some code:
# Parse a constant spec file.
sub parse_constant_spec {
my $fh = shift;
my %spec;
# Until file is done:
# Read in a whole block
while( my $block = read_block($fh) ) {
# Parse the and return key/value pairs for a hash.
my %constant = parse_block( $block );
# Store a ref to the hash in a big hash of all blocks, keyed by constant_name.
$spec{ $constant{name} } = \%constant;
}
# Return ref to big hash with all block data
return \%spec;
}
But it won't work. The parse_block and read_block subs haven't been written yet. At this stage that's OK. The point is to rough in features in small, understandable chunks. Every once in a while, to keep things readable you need to gloss over the details drop in a subroutine--otherwise you wind up with monstrous 1000 line subs that are impossible to debug.
Now we know we need to write a couple of subs to finish up, et viola:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my $fh = \*DATA;
print Dumper parse_constant_spec( $fh );
# Parse a constant spec file.
# Pass in a handle to process.
# As long as it acts like a file handle, it will work.
sub parse_constant_spec {
my $fh = shift;
my %spec;
# Until file is done:
# Read in a whole block
while( my $block = read_block($fh) ) {
# Parse the and return key/value pairs for a hash.
my %constant = parse_block( $block );
# Store a ref to the hash in a big hash of all blocks, keyed by constant_name.
$spec{ $constant{const_name} } = \%constant;
}
# Return ref to big hash with all block data
return \%spec;
}
# Read a constant definition block from a file handle.
# void return when there is no data left in the file.
# Otherwise return an array ref containing lines to in the block.
sub read_block {
my $fh = shift;
my #lines;
my $block_started = 0;
while( my $line = <$fh> ) {
$block_started++ if $line =~ /^constant/;
if( $block_started ) {
last if $line =~ /^\s*$/;
push #lines, $line;
}
}
return \#lines if #lines;
return;
}
sub parse_block {
my $block = shift;
my ($start_line, #attribs) = #$block;
my %constant;
# Break down first line:
# First separate assignment from option list.
my ($start_head, $start_tail) = split /=/, $start_line;
# work on option list
my #options = split /\s+/, $start_head;
# Recover constant_name from options:
$constant{const_name} = pop #options;
$constant{options} = \#options;
# Now we parse the value/type specifier
#constant{'type', 'value' } = parse_type_value_specifier( $start_tail );
# Parse attribute lines.
# since we've already got multiple per line, get them all at once.
chomp #attribs;
my $attribs = join ' ', #attribs;
# we have one long line of mixed key = "value" or key = <TYPE VALUE>
#attribs = $attribs =~ /\s*(\w+\s+=\s+".*?"|\w+\s+=\s+<.*?>)\s*/g;
for my $attrib ( #attribs ) {
warn "$attrib\n";
my ($name, $value) = split /\s*=\s*/, $attrib;
if( $value =~ /^"/ ) {
$value =~ s/^"|"\s*$//g;
}
elsif( $value =~ /^</ ) {
$value = [ parse_type_value_specifier( $start_tail ) ];
}
else {
warn "Bad line";
}
$constant{ $name } = $value;
}
return %constant;
}
sub parse_type_value_specifier {
my $tvs = shift;
my ($type, $value) = $tvs =~ /<(\w+)\s+(.*?)>/;
return $type, $value;
}
__DATA__
constant fixup GemEstabCommDelay = <U2 20>
vid = 6
name = "ESTABLISHCOMMUNICATIONSTIMEOUT"
units = "s"
min = <U2 0>
max = <U2 1800>
default = <U2 20>
constant fixup private GemConstantFileName = <A "C:\\TMP\\CONST.LOG">
vid = 4
name = "" units = ""
constant fixup private GemAlarmFileName = <A "C:\\TMP\\ALARM.LOG">
vid = 0
name = ""
units = ""
The above code is far from perfect. IMO, parse_block is too long and ought to be broken into smaller subs. Also, there isn't nearly enough validation and enforcement of well-formed input. Variable names and descriptions could be clearer, but I don't really understand the semantics of your data format. Better names would more closely match the semantics of the data format.
Despite these issues, it does parse your format and produce a big handy data structure that can be stuffed into whatever output format you want.
If you use this format in many places, I recommend putting the parsing code into a module. See perldoc perlmod for more info.
Now, please stop using global variables and ignoring good advice. Please start reading the perldoc, read Learning Perl and Perl Best Practices, use strict, use warnings. While I am throwing reading lists around go read Global Variables are Bad and then wander around the wiki to read and learn. I learned more about writing software by reading c2 than I did in school.
If you have questions about how this code works, why it is laid out as it is, what other choices could have been made, speak up and ask. I am willing to help a willing student.
Your English is good, but it is clear you are not a native speaker. I may have used too many complex sentences. If you need parts of this written in simple sentences, I can try to help. I understand that working in a foreign language is very difficult.

The answer depends on what you mean by a blank line (whether it contains no characters apart from a newline or whether it contains only whitespace). An idiomatic way to deal with this is to use a negative match against \S which matches in both of these cases:
if ( ! /\S/ ) {
...
}
If you are only looking for the former than your own answer is fine.
You often see this technique used as a filter:
while (<>) {
next unless /\S/; # Ignore blank lines.
...
}

You can use:
if ($_ =~ /^$/)
or even just
if (/^$/)
since Perl assumes checking against $_

If you just want to check if the current value of $_ or $var is a blank (or at least all-whitespace) line, then something like
if (/^\s*$/) { ... }
if ($var =~ /^\s*$/){ ... }
as several others have already mentioned.
However, I find that I most commonly want to ignore blank lines while processing input in a loop. I do that like this:
while (<>) {
next if /^\s*$/;
...
}
If I want to allow the traditional shell-style comments, I usually add
s/\s*#.*$//;
just before the check for a blank line.

while (<>){
chomp;
if ($_ eq ""){
print "blank at $.\n";
}
}

The way you showed - if ( $_ eq '' ) is perfectly sane. Perhaps you should describe what is your problem with it?

if(/^\s*$/)
{
$x = 0;
}

Related

Find matching between 2 files (how to improve efficiency)

#file1 contains only startpoint-endpoint pair, each indices represent each pair. file2 is a text file, for #file2 each indices represents each line. I am trying to search each pair from #file1 in #file2 line by line. When the exact match is found, I would then try to extract information1 from file2 and print it out. But for now, I am trying to search for the matched pair in file2. The format of the matching pattern is as below:
Match case
From $file1[0]
Startpoint: /source/in_out/map (positive-triggered)
Endpoint: /output/end/scan_all (positive-triggered)
match if file2 contains:
Line with other stuff
Startpoint: /source/in_out/map (positive-triggered)
Endpoint: /output/end/scan_all (positive-triggered)
information1:
information2:
Lines with other stuff
Unmatch Case:
From file1:
Startpoint: /source/in_out/map (positive-triggered)
Endpoint: /output/end/scan_all (positive-triggered)
From file2:
Startpoint: /source/in_out/map (positive-triggered)
Endpoint: /different endpoint pair/ (positive-triggered)
information1:
information2:
For text files2, I stored it in #file2. For files1, I have successfully extracted and stored every Startpoint and the next line Endpoint as the format above in #file1. (No problem in extracting and storing each pair, so I wont be showing the code for this, it took around 4mins here) Then I split each element of #address, which are the startpoint and endpoint. Checking line by line in files2, if startpoint match, then I will move on next line to check endpoint, it is only considered match if the next line after Startpoint match the Endpoint, else try to search again until the end line of files2. This script does the job but it took 3 and a half hours to complete(there are around 60k pairs from file1 and 800k lines to check in file2). Is there any other efficient way to do this?
I am new in Perl scripting, I apologize for any silly mistakes, both in my explanation and my coding.
Here's the codes:
#!usr/bin/perl
use warnings;
my $report = '/home/dir/file2';
open ( $DATA,$report ) || die "Error when opening";
chomp (#file2 = <$DATA>);
#No problem in extracting Start-Endpoint pair from file1 into #file1, so I wont include
#the code for this
$size = scalar#file1;
$size2 = scalar#file2;
for ( $total=0; $total<$size; $total++ ) {
my #file1_split = split('\n',$file1[$total]);
chomp #file1_split;
my $match_endpoint = 0;
my $split = 0;
LABEL2: for ( $count=0; $count<$size2; $count++ ) {
if ( $match_endpoint == 1) {
if ( grep { $_ eq "file1_split[$split]" } $file2[$count] )
print"Pair($total):Match Pair\n";
last LABEL2; #move on to check next start-endpoint
#pair
}
else {
$split = 0; #reset back to check the same startpoint
and continue searching until match found or end line of file2
$match_endpoint = 0;
}
}
elsif ( grep { $_ eq "$address_array[$split]"} $array[$count] )
{
$match_endpoint = 1;#enable search for endpoint in next line
$split = 1; #move on next line to match endpoint
next;
}
elsif ( $count==$size2-1 ) {
print"no matching found for Path($total)\n";
}
}
}
If I'm understanding what your code is trying to do,
it looks like it would be more efficient to do it this way:
my %split=#file1;
my %total;
#total{#file1}=(0..$#file1);
my $split;
for( #file2 ){
if( $split ){
if( $_ eq $split ){
print"Pair($total{$split}):Match Pair\n";
}else{
$split{$split}="";
}
}
$split=$split{$_};
delete $split{$_};
}
for( keys %split ){
print"no matching found for Path($total{$_})\n";
}
If I have understood your spec (show matches), I'm betting this will complete in less than 5 seconds, unless you're using an old Dell D333. To further minimize the response time, you would write some extra code to drive the while loop by the hash with the fewest keys (you implied file1). If you use references to hashes, then you can write a small if-else statement to swap the hash references without having to code duplicate while statements.
use strict;
use warnings;
sub makeHash($) {
my ($filename) = #_;
open(DATA, $filename) || die;
my %result;
my ($start, $line);
while (<DATA>) {
if ($_ =~ /^Startpoint: (.*)/) {
$start = $1; # captured group in regular expression
$line = $.; # current line number
} elsif ($_ =~ /^Endpoint: (.*)/) {
my $end = $1;
if (defined $line && $. == ($line + 1)) {
my $key = "$start::$end";
# can distinguish start and end lines if necessary
$result{$key} = {start=>$start, end=>$end, line=>$line};
}
}
}
close(DATA);
return %result;
}
my %file1 = makeHash("file1");
my %file2 = makeHash("file2");
my $fmt = "%10s %10s %s\n";
my $nmatches = 0;
printf $fmt, "File1", "File2", "Key";
while (my ($key, $f1h) = each %file1) {
my $f2h = $file2{$key};
if (defined $f2h) {
# You have access to hash members start and end if you need to distinguish further
printf $fmt, $f1h->{line}, $f2h->{line}, $key;
$nmatches++;
}
}
print "Found $nmatches matches\n";
Below, is my test data generator(thanks). I generated a worst-case scenario of 1,000,000 matches between two equal files. The matching code above finished on my MBP in under 20 seconds using the generated test data.
use strict;
use warnings;
sub rndStr { join'', #_[ map{ rand #_ } 1 .. shift ] }
open(F1, ">file1") || die;
open(F2, ">file2") || die;
for (1..1000000) {
my $start = rndStr(30, 'A'..'Z');
my $end = rndStr(30, 'A'..'Z');
print F1 "Startpoint: $start\n";
print F1 "Endpoint: $end\n";
print F2 "Startpoint: $start\n";
print F2 "Endpoint: $end\n";
}
close(F1);
close(F2);

How to read from specific line to another line - Perl

I'm doing a Perl script and I have a log file where I need to extract data from. I want to know how to read from a specific line to another line(not end of file).
I tried it this way by putting a last if if it reaches the line that I want to stop at but it doesn't work. The line that I want to start reading from is <TEST_HEAD TH 1> and stops at </TEST_HEAD TH 1>. I'm doing this because my regular expression captures data that I do not need, so I tried to read from a specific line to another line.
This is what I've done so far:
while(<$log_fh>)
{
if($. =~ /\<TEST_HEAD TH 1\>/)
{
if ( /Computer Name:\s*(\S+)(-\d+)/i )
{
$details{tester_name} = $1 . $2;
$details{tester_type} = $1;
push #{$details{tester_arr}}, $1 . $2;
}
elsif ( /Operating System:\s*(.*\S)/i )
{
$details{op_sys} = $1;
}
elsif ( /IG-XL Version:\s*([^;]*)/i )
{
$details{igxl_vn} = $1;
}
elsif ( /^([\d]+)\.\d\s+(\S+)\s+([\d-]*)\s+([\d|\w]*)(?=\s)/ )
{
push #{$details{slot}}, $1;
push #{$details{board_name}}, $2;
push #{$details{part_no}}, $3;
push #{$details{serial_no}}, $4;
}
last if $. == /\<\/TEST_HEAD TH 1\>/;
}
}
Just a modified sample of the raw data file:
<TEST_HEAD TH 1> #Start reading here
(Lines containing data to be captured)
</TEST_HEAD TH 1> #end reading here
Without going much into the nested matching logic you may want to change
if($. =~ /\<TEST_HEAD TH 1\>/)
into
if (/<TEST_HEAD TH 1>/ .. /<\/TEST_HEAD TH 1>/)
What you ask is actually XY problem and it would be better to process xml like document with xml parser. Parsing complex XML in Perl
Without knowing specifically how your data looks, I'd also offer another approach.
Set $/ to a record separator, and then you grab a chunk of text in one go. You can then apply a bunch of different regexes to it all at once.
E.g.:
local $/ = 'TEST_HEAD';
while (<$log_fh>) {
next unless m/^\s*TH/;
my ( $tester_name, $tester_id ) = (m/Computer Name:\s*(\S+)(-\d+)/i);
my ($op_sys) = (m/Operating System:\s*(.*\S)/i);
my ( $slot, $board, $part, $serial ) =
(m/^([\d]+)\.\d\s+(\S+)\s+([\d-]*)\s+([\d|\w]*)(?=\s)/m);
# etc.
# then validate and update your array:
$details{$tester_name} = $tester_name;
## etc.
}

Sorting lines with regex in perl

I am trying to read a log file and write all the error logs to a new file. I must also keep track of how many errors there are and the number of messages in general. I must assume that the logs will be broken up onto multiple lines, so I have been using regex and series a variables to search for all possibilities and write to the appropriate file.
My file handles are: FILE, ERRORFILE, and SUCCESSFILE.
use strict;
use warnings;
my $totalcount = 0;
my $errorcount = 0;
my $log = "s"; # $log controls what what should be written where,
# incase it doesn't start with code.
# "s" = SuccessFile, "e" = ErrorFile
my $logStart = "y"; # used with m/^I/ or m/^E/ instead of full code
# incase the code is broken into pieces.
my $dash = 0;
while (<FILE>) {
$dash += () = $_ =~ m/-/g; # can't use tr/// because it counts at compile
if ( $dash lt 25 ) { next; } # this line skips "---Begin <Repository>---"
elsif ( m/[a-zA-Z <>]/ && $dash lt 25 ) { next; }
elsif ( $dash >= 26 ) { last; } #Ends loop at "---End <Repository>---"
if ( m/^I/ && $logStart eq "y" ) {
$log = "s";
$logStart = "n";
$totalcount++;
next;
} #Ignores nonerror logs
elsif ( m/^E/ && $logStart eq "y" ) {
chomp $_;
print ERRORFILE "$_";
$errorcount++;
$totalcount++;
$log = "e";
$logStart = "n";
}
elsif (m/ \.\n$/) { #End of log
if ( $log eq "s" ) { $logStart = "y"; next; }
print ERRORFILE "$_\n" if $log eq "e";
$logStart = "y";
}
else { #line doesn't start with code or end in " .\n"
chomp $_;
print ERRORFILE "$_" if $log eq "e";
next if $log eq "s";
}
}
print "\nThere are $errorcount error logs.\n";
print "There are $totalcount logs in the full log file.\n";
I know that the non-error logs start with I00020036 and the errors start with E03020039. Both end in " .\n"
---------- Begin <Load Repository> ---------------
I00020036: Loaded C:\Documents and Settings\dorja03\Desktop\DSMProduct\external\etpki\Linux_2.4_x86\redistrib\readme.txt into \DSM R11\external\etpki\Linux_2.4_x86\redistrib\readme.txt .
E03020039: Unable to load C:\Documents and Settings\dorja03\Desktop\DSMProduct\external\etpki\Linux_2.4_x86\redistrib\etpki_install_lib.sh into \DSM R11\external\etpki\Linux_2.4_x86\redistrib\etpki_install_lib.sh . Text file contains invalid characters .
---------- End <Load Repository> ---------------
I have been running a test sample with two lines. If the error comes up first, it will print it to the error file, along with the non-error log, and on the same line. If the non-error goes first, it doesn't recognize the error.
Is this because I'm using m// wrong or something else entirely?
Edit: Test input has been added. I also added the code to skip the header and footer.
Test output: If the non-error comes first, there are 0 errors and 1 log total.
If the non-error comes first, there is 1 error and 1 log total.
If this worked, it should have said there was 1 error and 2 logs. It also would have only printed the error to the ERRORFILE.
This won't answer why your code isn't working, but here's how I would approach the problem:
Since the logs can span over multiple lines, modify the default line-by-line behavior by tweaking $/.
Use appropriate data structures to filter the errors from non-errors. This will also allow you to defer printing till later.
The code would then look something like this:
use strict;
use warnings;
my %logs;
local $/ = " .\n";
while ( <> ) { # Now $_ is the full (multi-line) log
next if /--- Begin/; # Skip if /Begin/
last if /--- End/; # Stop processing if /End/
if ( m/^I/ ) {
push #{ $logs{nonerror} }, $_;
}
if ( m/^E/ ) {
push #{ $logs{error} }, $_;
}
}
printf "There are %d error logs\n.", scalar #{ $logs{error} // [] } ;
printf "There are %d logs in the full logfile.\n",
#{$logs{error} // []} + #{$logs{nonerror} // []};
Things I like about this approach:
Perl takes care of deciding when each log message ends (eliminates the $logStart variable altogether).
The logic is much easier to extend.
The while loop is dedicated to processing the log file (no need to ++ anything).
Use of sensibly-labeled data structures instead of temporary variables makes for easier code maintenance.
To make a formal answer, I scrapped this code and replaced it. I instead fed the file into var with a delimiter, then just split it into an array. It was much easier and cleaner. I don't however have the code anymore due to a lost flashdrive.

regular expression code

I need to find match between two tab delimited files files like this:
File 1:
ID1 1 65383896 65383896 G C PCNXL3
ID1 2 56788990 55678900 T A ACT1
ID1 1 56788990 55678900 T A PRO55
File 2
ID2 34 65383896 65383896 G C MET5
ID2 2 56788990 55678900 T A ACT1
ID2 2 56788990 55678900 T A HLA
what I would like to do is to retrive the matching line between the two file. What I would like to match is everyting after the gene ID
So far I have written this code but unfortunately perl keeps giving me the error:
use of "Use of uninitialized value in pattern match (m//)"
Could you please help me figure out where i am doing it wrong?
Thank you in advance!
use strict;
open (INA, $ARGV[0]) || die "cannot to open gene file";
open (INB, $ARGV[1]) || die "cannot to open coding_annotated.var files";
my #sample1 = <INA>;
my #sample2 = <INB>;
foreach my $line (#sample1) {
my #tab = split (/\t/, $line);
my $chr = $tab[1];
my $start = $tab[2];
my $end = $tab[3];
my $ref = $tab[4];
my $alt = $tab[5];
my $name = $tab[6];
foreach my $item (#sample2){
my #fields = split (/\t/,$item);
if ( $fields[1] =~ m/$chr(.*)/
&& $fields[2] =~ m/$start(.*)/
&& $fields[4] =~ m/$ref(.*)/
&& $fields[5] =~ m/$alt(.*)/
&& $fields[6] =~ m/$name(.*)/
) {
print $line, "\n", $item;
}
}
}
On its surface your code seems to be fine (although I didn't debug it). If you don't have an error I cannot spot, could be that the input data has RE special character, which will confuse the regular expression engine when you put it as is (e.g. if any of the variable has the '$' character). Could also be that instead of tab you have spaces some where, in which case you'll indeed get an error, because your split will fail.
In any case, you'll be better off composing just one regular expression that contains all the fields. My code below is a little bit more Perl Idiomatic. I like using the implicit $_ which in my opinion makes the code more readable. I just tested it with your input files and it does the job.
use strict;
open (INA, $ARGV[0]) or die "cannot open file 1";
open (INB, $ARGV[1]) or die "cannot open file 2";
my #sample1 = <INA>;
my #sample2 = <INB>;
foreach (#sample1) {
(my $id, my $chr, my $start, my $end, my $ref, my $alt, my $name) =
m/^(ID\d+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)/;
my $rex = "^ID\\d+\\s+$chr\\s+$start\\s+$end\\s+$ref\\s+$alt\\s+$name\\s+";
#print "$rex\n";
foreach (#sample2) {
if( m/$rex/ ) {
print "$id - $_";
}
}
}
Also, how regular is the input data? Do you have exactly one tab between the fields? If that is the case, there is no point to split the lines into 7 different fields - you only need two: the ID portion of the line, and the rest. The first regex would be
(my $id, my $restOfLine) = m/^(ID\d+)\s+(.*)$/;
And you are searching $restOfLine within the second file in a similar technique as above.
If your files are huge and performance is an issue, you should consider putting the first regular expressions (or strings) in a map. That will give you O(n*log(m)) where n and m are the number of lines in each file.
Finally, I have a similar challenge when I need to compare logs. The logs are supposed to be identical, with the exception of a time mark at the beginning of each line. But more importantly: most lines are the same and in order. If this is what you have, and it make sense for you, you can:
First remove the IDxxx from each line: perl -pe "s/ID\d+ +//" file >cleanfile
Then use BeyondCompare or Windiff to compare the files.
I played a bit with your code. What you wrote there was actually three loops:
one over the lines of the first file,
one over the lines of the second file, and
one over all fields in these lines. You manually unrolled this loop.
The rest of this answer assumes that the files are strictly tab-seperated and that any other whitespace matters (even at the end of fields and lines).
Here is a condensed version of the code (assumes open filehandles $file1, $file2, and use strict):
my #sample2 = <$file2>;
SAMPLE_1:
foreach my $s1 (<$file1>) {
my (undef, #fields1) = split /\t/, $s1;
my #regexens = map qr{\Q$_\E(.*)}, #fields1;
SAMPLE_2:
foreach my $s2 (#sample2) {
my (undef, #fields2) = split /\t/, $s2;
for my $i (0 .. $#regexens) {
$fields2[$i] =~ $regexens[$i] or next SAMPLE_2;
}
# only gets here if all regexes matched
print $s1, $s2;
}
}
I did some optimisations: precompiling the various regexes and storing them in an array, quoting the contents of the fields etc. However, this algorithm is O(n²), which is bad.
Here is an elegant variant of that algorithm that knows that only the first field is different — the rest of the line has to be the same character for character:
my #sample2 = <$file2>;
foreach my $s1 (<$file1>) {
foreach my $s2 (#sample2) {
print $s1, $s2 if (split /\t/, $s1, 2)[1] eq (split /\t/, $s2, 2)[1];
}
}
I just test for string equality of the rest of the line. While this algorithm is still O(n²), it outperforms the first solution roughly by an order of magnitude simply by avoiding braindead regexes here.
Finally, here is an O(n) solution. It is a variant of the previous one, but executes the loops after each other, not inside each other, therefore finishing in linear time. We use hashes:
# first loop via map
my %seen = map {reverse(split /\t/, $_, 2)}
# map {/\S/ ? $_ : () } # uncomment this line to handle empty lines
<$file1>;
# 2nd loop
foreach my $line (<$file2>) {
my ($id2, $key) = split /\t/, $line, 2;
if (defined (my $id1 = $seen{$key})) {
print "$id1\t$key";
print "$id2\t$key";
}
}
%seen is a hash that has the rest of the line as a key and the first field as a value. In the second loop, we retrieve the rest of the line again. If this line was present in the first file, we reconstruct the whole line and print it out. This solution is better than the others and scales well up- and downwards, because of its linear complexity
How about:
#!/usr/bin/perl
use File::Slurp;
use strict;
my ($ina, $inb) = #ARGV;
my #lines_a = File::Slurp::read_file($ina);
my #lines_b = File::Slurp::read_file($inb);
my $table_b = {};
my $ln = 0;
# Store all lines in second file in a hash with every different value as a hash key
# If there are several identical ones we store them also, so the hash values are lists containing the id and line number
foreach (#lines_b) {
chomp; # strip newlines
$ln++; # count current line number
my ($id, $rest) = split(m{[\t\s]+}, $_, 2); # split on whitespaces, could be too many tabs or spaces instead
if (exists $table_b->{$rest}) {
push #{ $table_b->{$rest} }, [$id, $ln]; # push to existing list if we already found an entry that is the same
} else {
$table_b->{$rest} = [ [$id, $ln] ]; # create new entry if this is the first one
}
}
# Go thru first file and print out all matches we might have
$ln = 0;
foreach (#lines_a) {
chomp;
$ln++;
my ($id, $rest) = split(m{[\t\s]+}, $_, 2);
if (exists $table_b->{$rest}) { # if we have this entry print where it is found
print "$ina:$ln:\t\t'$id\t$rest'\n " . (join '\n ', map { "$inb:$_->[1]:\t\t'$_->[0]\t$rest'" } #{ $table_b->{$rest} }) . "\n";
}
}

Parsing input to get specific values

I have input like this:
"[0|0|{A=145,B=2,C=12,D=18}|!][0|0|{A=167,B=2,C=67,D=17}|.1iit][196|0|{A=244,B=6,C=67,D=12}|10:48AM][204|0|{A=9,B=201,C=61,D=11}|Calculator][66|0|{A=145,B=450,C=49,D=14}|phone]0|0|{A=145,B=2,C=12,D=18}|!0|0|{A=167,B=2,C=67,D=17}|.1iit196|0|{A=244,B=6,C=67,D=12}|10:48AM204|0|{A=9,B=201,C=61,D=11}|Calculator66|0|{A=145,B=450,C=49,D=14}|phone";
It appears as a continuous line, there are no line breaks. I need the
largest value out of the values between [ and the first occurrence of
|. In this case, for example, the largest value is 204. Once
that is obtained, I want to print the contents of that element
between []. In this case, it would be "204|0|{A=9,B=201,C=61,D=11}|Calculator".
I've tried something like this, but it is not going anywhere:
my #array1;
my $data = "[0|0|{A=145,B=2,C=12,D=18}|!][0|0|{A=167,B=2,C=67,D=1
+7}|.1iit][196|0|{A=244,B=6,C=67,D=12}|10:48AM][204|0|{A=9,B=201,C=61,
+D=11}|Calculator][66|0|{A=145,B=450,C=49,D=14}|phone]0|0|{A=145,B=2,C
+=12,D=18}|!0|0|{A=167,B=2,C=67,D=17}|.1iit196|0|{A=244,B=6,C=67,D=12}
+|10:48AM204|0|{A=9,B=201,C=61,D=11}|Calculator66|0|{A=145,B=450,C=49,
+D=14}|phone";
my $high = 0;
my #values = split(/\[([^\]]+)\]/,$data) ;
print "Values is #values \n";
foreach (#values) {
# I want the value that preceeds the first occurence of | in each array
# element, i.e. 0,0,196,204, etc.
my ($conf,$rest)= split(/\|/,$_);
print "Conf is $conf \n";
print "Rest is $rest \n";
push(#array1, $conf);
push (#array2, $rest);
print "Array 1 is #array1 \n";
print "Array 2 is #array2 \n";
}
$conf = highest(#array1);
my $i=0;
# I want the index value of the element that contains the highest conf value,
# in this case 204.
for (#myarray1) { last if $conf eq $_; $i++; };
print "$conf=$i\n";
# I want to print the rest of the string that was split in the same index
# position.
$rest = #array2[$i];
print "Rest is $rest \n";
# To get the highest conf value
sub highest {
my #data = #_;
my $high = 0;
for(#data) {
$high = $_ if $_ > $high;
}
$high;
}
Maybe I should be using a different approach. Could someone help me, please?
One way of doing it:
#!/usr/bin/perl
use strict;
my $s = "[0|0|{A=145,B=2,C=12,D=18}|!][0|0|{A=167,B=2,C=67,D=17}|.1iit][196|0|{A=244,B=6,C=67,D=12}|10:48AM][204|0|{A=9,B=201,C=61,D=11}|Calculator][66|0|{A=145,B=450,C=49,D=14}|phone]";
my #parts = split(/\]/, $s);
my $max = 0;
my $data = "";
foreach my $part (#parts) {
if ($part =~ /\[(\d+)/) {
if ($1 > $max) {
$max = $1;
$data = substr($part, 1);
}
}
}
print $data."\n";
A couple of notes:
you can split your original string by \], so you get parts like [0|0|{A=145,B=2,C=12,D=18}|!
then you parse each part to get the integer after the initial [
the rest it's easy: keep track of the biggest integer and of the corresponding part, and output it at the end.
In shell script:
#!/bin/bash
MAXVAL=$(cat /tmp/data | tr [ "\\n" | cut -d"|" -f1 | sort -n | tail -1)
cat /tmp/data | tr [] "\\n" | grep ^$MAXVAL
The first line cuts your big mass of data into lines, extracts just the first field, sorts it and takes the max. The second line cuts the data into lines again and greps for that max val.
If you have a LOT of data, this could be slow, so you could put the "lined" data into a temp file or something.
split() is the Right Tool when you know what you want to throw away. Capturing or m//g is the Right Tool when you know what you want to keep. (paraphrased from a Randal Schwartz quote).
You want to specify what to keep (between square brackets) rather than what to throw away (nothing!).
Luckily, your data is "hash shaped" (ie. alternating keys and values), so load it into a hash, sort the keys, and output the value for the highest key:
my %data = $data =~ /\[
(\d+) # digits are the keys
([^]]+) # rest are the values
\]/gx;
my($highest) = sort {$b <=> $a} keys %data; # inefficent if $data is big
print $highest, $data{$highest}, "\n";
Another way of doing this :
#!/usr/bin/perl
use strict;
my $str = '[0|0|{A=145,B=2,C=12,D=18}|!][0|0|{A=167,B=2,C=67,D=17}|.1iit][196|0|{A=244,B=6,C=67,D=12}|10:48AM][204|0|{A=9,B=201,C=61,D=11}|Calculator][66|0|{A=145,B=450,C=49,D=14}|phone]0|0|{A=145,B=2,C=12,D=18}|!0|0|{A=167,B=2,C=67,D=17}|.1iit196|0|{A=244,B=6,C=67,D=12}|10:48AM204|0|{A=9,B=201,C=61,D=11}|Calculator66|0|{A=145,B=450,C=49,D=14}|phone';
my $maxval = 0;
my $pattern;
while ( $str =~ /(\[(\d+)\|.+?\])/g)
{
if ( $maxval < $2 ) {
$maxval = $2;
$pattern = $1;
}
}
print "Maximum value = $maxval and the associate pattern = $pattern \n";
# In this example $maxvalue = 204
# and $pattern = [204|0|{A=9,B=201,C=61,D=11}|Calculator]