Matching a file line by line - regex

$text_file = '/homedir/report';
open ( $DATA,$text_file ) || die "Error!"; #open the file
#ICC2_array = <$DATA>;
$total_line = scalar#ICC2_array; # total number of lines
#address_array = split('\n',$address[6608]); # The first content is what I want and it is correct, I have checked using print
LABEL2:
for ( $count=0; $count < $total_line; $count++ ) {
if ( grep { $_ eq "$address_array[0]" } $ICC2_array[$count] ) {
print "This address is found!\n";
last LABEL2;
}
elsif ( $count == $total_line - 1 ) { # if not found in all lines
print "No matching is found for this address\n";
last LABEL2;
}
}
I am trying to match the 6609th address in #ICC2_array line by line. I am certain that this address is in $text_file but it is exactly the same format.
Something like this:
$address[6608] contains
Startpoint: port/start/input_output (triggered by clock3)
Endpoint: port/end/input_output (rising edge-triggered)
$address_array[0] contains
Startpoint: port/start/input_output (triggered by clock3)
There's a line in $text_file that is
Startpoint: port/start/input_output (triggered by clock3)
However the output is "no matching found for this address", can anybody point out my mistakes?

All of the elements in #ICC2_array will have new-line characters at the end.
As $address_array[0] is created by splitting data on \n it is guaranteed not to contain a new-line character.
A string that ends in a new-line can never be equal to a string that doesn't contain a new-line.
I suggest replacing:
#ICC2_array = <$DATA>;
With:
chomp(#ICC2_array = <$DATA>);
Update: Another problem I've just spotted. You are incrementing $count twice on each iteration. You increment it in the loop control code ($count++) and you're also incrementing it in the else clause ($count += 1). So you're probably only checking every other element in #ICC2_array.

I think your code should look like this
The any operator from core module List::Util is like grep except that it stops searching as soon as it finds a match, so on average should be twice as fast. Early iterations of List::Util did not contain any, and you can simply use grep instead if that applies to you
I've removed the _array from your array identifier as the # indicates that it's an array and it's just unwanted noise
use List::Util 'any';
my $text_file = '/homedir/report';
my #ICC2 = do {
open my $fh,'<', $text_file or die qq{Unable to open "$text_file" for input: $!};
<$fh>;
};
chomp #ICC2;
my ( $address ) = split /\n/, $address[6608], 2;
if ( any { $_ eq $address } #ICC2 ) {
print "This address is found\n"
}
else {
print "No match is found for this address\n";
}

Related

Find matching between 2 files (how to improve efficiency)

#file1 contains only startpoint-endpoint pair, each indices represent each pair. file2 is a text file, for #file2 each indices represents each line. I am trying to search each pair from #file1 in #file2 line by line. When the exact match is found, I would then try to extract information1 from file2 and print it out. But for now, I am trying to search for the matched pair in file2. The format of the matching pattern is as below:
Match case
From $file1[0]
Startpoint: /source/in_out/map (positive-triggered)
Endpoint: /output/end/scan_all (positive-triggered)
match if file2 contains:
Line with other stuff
Startpoint: /source/in_out/map (positive-triggered)
Endpoint: /output/end/scan_all (positive-triggered)
information1:
information2:
Lines with other stuff
Unmatch Case:
From file1:
Startpoint: /source/in_out/map (positive-triggered)
Endpoint: /output/end/scan_all (positive-triggered)
From file2:
Startpoint: /source/in_out/map (positive-triggered)
Endpoint: /different endpoint pair/ (positive-triggered)
information1:
information2:
For text files2, I stored it in #file2. For files1, I have successfully extracted and stored every Startpoint and the next line Endpoint as the format above in #file1. (No problem in extracting and storing each pair, so I wont be showing the code for this, it took around 4mins here) Then I split each element of #address, which are the startpoint and endpoint. Checking line by line in files2, if startpoint match, then I will move on next line to check endpoint, it is only considered match if the next line after Startpoint match the Endpoint, else try to search again until the end line of files2. This script does the job but it took 3 and a half hours to complete(there are around 60k pairs from file1 and 800k lines to check in file2). Is there any other efficient way to do this?
I am new in Perl scripting, I apologize for any silly mistakes, both in my explanation and my coding.
Here's the codes:
#!usr/bin/perl
use warnings;
my $report = '/home/dir/file2';
open ( $DATA,$report ) || die "Error when opening";
chomp (#file2 = <$DATA>);
#No problem in extracting Start-Endpoint pair from file1 into #file1, so I wont include
#the code for this
$size = scalar#file1;
$size2 = scalar#file2;
for ( $total=0; $total<$size; $total++ ) {
my #file1_split = split('\n',$file1[$total]);
chomp #file1_split;
my $match_endpoint = 0;
my $split = 0;
LABEL2: for ( $count=0; $count<$size2; $count++ ) {
if ( $match_endpoint == 1) {
if ( grep { $_ eq "file1_split[$split]" } $file2[$count] )
print"Pair($total):Match Pair\n";
last LABEL2; #move on to check next start-endpoint
#pair
}
else {
$split = 0; #reset back to check the same startpoint
and continue searching until match found or end line of file2
$match_endpoint = 0;
}
}
elsif ( grep { $_ eq "$address_array[$split]"} $array[$count] )
{
$match_endpoint = 1;#enable search for endpoint in next line
$split = 1; #move on next line to match endpoint
next;
}
elsif ( $count==$size2-1 ) {
print"no matching found for Path($total)\n";
}
}
}
If I'm understanding what your code is trying to do,
it looks like it would be more efficient to do it this way:
my %split=#file1;
my %total;
#total{#file1}=(0..$#file1);
my $split;
for( #file2 ){
if( $split ){
if( $_ eq $split ){
print"Pair($total{$split}):Match Pair\n";
}else{
$split{$split}="";
}
}
$split=$split{$_};
delete $split{$_};
}
for( keys %split ){
print"no matching found for Path($total{$_})\n";
}
If I have understood your spec (show matches), I'm betting this will complete in less than 5 seconds, unless you're using an old Dell D333. To further minimize the response time, you would write some extra code to drive the while loop by the hash with the fewest keys (you implied file1). If you use references to hashes, then you can write a small if-else statement to swap the hash references without having to code duplicate while statements.
use strict;
use warnings;
sub makeHash($) {
my ($filename) = #_;
open(DATA, $filename) || die;
my %result;
my ($start, $line);
while (<DATA>) {
if ($_ =~ /^Startpoint: (.*)/) {
$start = $1; # captured group in regular expression
$line = $.; # current line number
} elsif ($_ =~ /^Endpoint: (.*)/) {
my $end = $1;
if (defined $line && $. == ($line + 1)) {
my $key = "$start::$end";
# can distinguish start and end lines if necessary
$result{$key} = {start=>$start, end=>$end, line=>$line};
}
}
}
close(DATA);
return %result;
}
my %file1 = makeHash("file1");
my %file2 = makeHash("file2");
my $fmt = "%10s %10s %s\n";
my $nmatches = 0;
printf $fmt, "File1", "File2", "Key";
while (my ($key, $f1h) = each %file1) {
my $f2h = $file2{$key};
if (defined $f2h) {
# You have access to hash members start and end if you need to distinguish further
printf $fmt, $f1h->{line}, $f2h->{line}, $key;
$nmatches++;
}
}
print "Found $nmatches matches\n";
Below, is my test data generator(thanks). I generated a worst-case scenario of 1,000,000 matches between two equal files. The matching code above finished on my MBP in under 20 seconds using the generated test data.
use strict;
use warnings;
sub rndStr { join'', #_[ map{ rand #_ } 1 .. shift ] }
open(F1, ">file1") || die;
open(F2, ">file2") || die;
for (1..1000000) {
my $start = rndStr(30, 'A'..'Z');
my $end = rndStr(30, 'A'..'Z');
print F1 "Startpoint: $start\n";
print F1 "Endpoint: $end\n";
print F2 "Startpoint: $start\n";
print F2 "Endpoint: $end\n";
}
close(F1);
close(F2);

Using iterated variables with regex

The point of the overall script is to:
step 1) open a single column file and read off first entry.
step 2) open a second file containing lots of rows and columns, read off EACH line one at a time, and find anything in that line that matches the first entry from the first file.
step3) if a match is found, then "do something constructive", and if not, go to the first file and take the second entry and repeat step 2 and step 3, and so on...
here is the script:
#!/usr/bin/perl
use strict; #use warnings;
unless(#ARGV) {
print "\usage: $0 filename\n\n"; # $0 name of the program being executed
exit;
}
my $list = $ARGV[0];
chomp( $list );
unless (open(LIST, "<$list")) {
print "\n I can't open your list of genes!!! \n";
exit;
}
my( #list ) = (<LIST>);
close LIST;
open (CHR1, "<acembly_chr_sorted_by_exon_count.txt") or die;
my(#spreadsheet) = (<CHR1>);
close CHR1;
for (my $i = 0; $i < scalar #list; $i++ ) {
print "$i in list is $list[$i]\n";
for (my $j = 1; $j < scalar #spreadsheet; $j++ ) {
#print "$spreadsheet[$j]\n";
if ( $spreadsheet[$j] ) {
print "will $list[$i] match with $spreadsheet[$j]?\n";
}
else { print "no match\n" };
} #for
} #for
I plan to use a regex in the line if ( $spreadsheet[$j] ) { but am having a problem at this step as it is now. On the first interation, the line print "will $list[$i] match with $spreadsheet[$j]?\n"; prints $list[$i] OK but does not print $spreadsheet[$j]. This line will print both variables correctly on the second and following iterations. I do not see why?
At first glance nothing looks overtly incorrect. As mentioned in the comments the $j = 1 looks questionable but perhaps you are skipping the first row on purpose.
Here is a more perlish starting point that is tested. If it does not work then you have something going on with your input files.
Note the extended trailing whitespace removal. Sometimes if you open a WINDOWS file on a UNIX machine and use chomp, you can have embedded \r in your text that causes weird things to happen to printed output.
#!/usr/bin/perl
use strict; #use warnings;
unless(#ARGV) {
print "\usage: $0 filename\n\n"; # $0 name of the program being executed
exit;
}
my $list = shift;
unless (open(LIST, "<$list")) {
print "\n I can't open your list of genes!!! \n";
exit;
}
open(CHR1, "<acembly_chr_sorted_by_exon_count.txt") or die;
my #spreadsheet = map { s/\s+$//; $_ } <CHR1>;
close CHR1;
# s/\s+$//; is like chomp but trims all trailing whitespace even
# WINDOWS files opened on a UNIX system.
for my $item (<LIST>) {
$item =~ s/\s+$//; # trim all trailing whitespace
print "==> processing '$item'\n";
for my $row (#spreadsheet) {
if ($row =~ /\Q$item\E/) { # see perlre for \Q \E
print "match '$row'\n";
}
else {
print "no match '$row'\n";
}
}
}
close LIST;

regular expression code

I need to find match between two tab delimited files files like this:
File 1:
ID1 1 65383896 65383896 G C PCNXL3
ID1 2 56788990 55678900 T A ACT1
ID1 1 56788990 55678900 T A PRO55
File 2
ID2 34 65383896 65383896 G C MET5
ID2 2 56788990 55678900 T A ACT1
ID2 2 56788990 55678900 T A HLA
what I would like to do is to retrive the matching line between the two file. What I would like to match is everyting after the gene ID
So far I have written this code but unfortunately perl keeps giving me the error:
use of "Use of uninitialized value in pattern match (m//)"
Could you please help me figure out where i am doing it wrong?
Thank you in advance!
use strict;
open (INA, $ARGV[0]) || die "cannot to open gene file";
open (INB, $ARGV[1]) || die "cannot to open coding_annotated.var files";
my #sample1 = <INA>;
my #sample2 = <INB>;
foreach my $line (#sample1) {
my #tab = split (/\t/, $line);
my $chr = $tab[1];
my $start = $tab[2];
my $end = $tab[3];
my $ref = $tab[4];
my $alt = $tab[5];
my $name = $tab[6];
foreach my $item (#sample2){
my #fields = split (/\t/,$item);
if ( $fields[1] =~ m/$chr(.*)/
&& $fields[2] =~ m/$start(.*)/
&& $fields[4] =~ m/$ref(.*)/
&& $fields[5] =~ m/$alt(.*)/
&& $fields[6] =~ m/$name(.*)/
) {
print $line, "\n", $item;
}
}
}
On its surface your code seems to be fine (although I didn't debug it). If you don't have an error I cannot spot, could be that the input data has RE special character, which will confuse the regular expression engine when you put it as is (e.g. if any of the variable has the '$' character). Could also be that instead of tab you have spaces some where, in which case you'll indeed get an error, because your split will fail.
In any case, you'll be better off composing just one regular expression that contains all the fields. My code below is a little bit more Perl Idiomatic. I like using the implicit $_ which in my opinion makes the code more readable. I just tested it with your input files and it does the job.
use strict;
open (INA, $ARGV[0]) or die "cannot open file 1";
open (INB, $ARGV[1]) or die "cannot open file 2";
my #sample1 = <INA>;
my #sample2 = <INB>;
foreach (#sample1) {
(my $id, my $chr, my $start, my $end, my $ref, my $alt, my $name) =
m/^(ID\d+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)/;
my $rex = "^ID\\d+\\s+$chr\\s+$start\\s+$end\\s+$ref\\s+$alt\\s+$name\\s+";
#print "$rex\n";
foreach (#sample2) {
if( m/$rex/ ) {
print "$id - $_";
}
}
}
Also, how regular is the input data? Do you have exactly one tab between the fields? If that is the case, there is no point to split the lines into 7 different fields - you only need two: the ID portion of the line, and the rest. The first regex would be
(my $id, my $restOfLine) = m/^(ID\d+)\s+(.*)$/;
And you are searching $restOfLine within the second file in a similar technique as above.
If your files are huge and performance is an issue, you should consider putting the first regular expressions (or strings) in a map. That will give you O(n*log(m)) where n and m are the number of lines in each file.
Finally, I have a similar challenge when I need to compare logs. The logs are supposed to be identical, with the exception of a time mark at the beginning of each line. But more importantly: most lines are the same and in order. If this is what you have, and it make sense for you, you can:
First remove the IDxxx from each line: perl -pe "s/ID\d+ +//" file >cleanfile
Then use BeyondCompare or Windiff to compare the files.
I played a bit with your code. What you wrote there was actually three loops:
one over the lines of the first file,
one over the lines of the second file, and
one over all fields in these lines. You manually unrolled this loop.
The rest of this answer assumes that the files are strictly tab-seperated and that any other whitespace matters (even at the end of fields and lines).
Here is a condensed version of the code (assumes open filehandles $file1, $file2, and use strict):
my #sample2 = <$file2>;
SAMPLE_1:
foreach my $s1 (<$file1>) {
my (undef, #fields1) = split /\t/, $s1;
my #regexens = map qr{\Q$_\E(.*)}, #fields1;
SAMPLE_2:
foreach my $s2 (#sample2) {
my (undef, #fields2) = split /\t/, $s2;
for my $i (0 .. $#regexens) {
$fields2[$i] =~ $regexens[$i] or next SAMPLE_2;
}
# only gets here if all regexes matched
print $s1, $s2;
}
}
I did some optimisations: precompiling the various regexes and storing them in an array, quoting the contents of the fields etc. However, this algorithm is O(n²), which is bad.
Here is an elegant variant of that algorithm that knows that only the first field is different — the rest of the line has to be the same character for character:
my #sample2 = <$file2>;
foreach my $s1 (<$file1>) {
foreach my $s2 (#sample2) {
print $s1, $s2 if (split /\t/, $s1, 2)[1] eq (split /\t/, $s2, 2)[1];
}
}
I just test for string equality of the rest of the line. While this algorithm is still O(n²), it outperforms the first solution roughly by an order of magnitude simply by avoiding braindead regexes here.
Finally, here is an O(n) solution. It is a variant of the previous one, but executes the loops after each other, not inside each other, therefore finishing in linear time. We use hashes:
# first loop via map
my %seen = map {reverse(split /\t/, $_, 2)}
# map {/\S/ ? $_ : () } # uncomment this line to handle empty lines
<$file1>;
# 2nd loop
foreach my $line (<$file2>) {
my ($id2, $key) = split /\t/, $line, 2;
if (defined (my $id1 = $seen{$key})) {
print "$id1\t$key";
print "$id2\t$key";
}
}
%seen is a hash that has the rest of the line as a key and the first field as a value. In the second loop, we retrieve the rest of the line again. If this line was present in the first file, we reconstruct the whole line and print it out. This solution is better than the others and scales well up- and downwards, because of its linear complexity
How about:
#!/usr/bin/perl
use File::Slurp;
use strict;
my ($ina, $inb) = #ARGV;
my #lines_a = File::Slurp::read_file($ina);
my #lines_b = File::Slurp::read_file($inb);
my $table_b = {};
my $ln = 0;
# Store all lines in second file in a hash with every different value as a hash key
# If there are several identical ones we store them also, so the hash values are lists containing the id and line number
foreach (#lines_b) {
chomp; # strip newlines
$ln++; # count current line number
my ($id, $rest) = split(m{[\t\s]+}, $_, 2); # split on whitespaces, could be too many tabs or spaces instead
if (exists $table_b->{$rest}) {
push #{ $table_b->{$rest} }, [$id, $ln]; # push to existing list if we already found an entry that is the same
} else {
$table_b->{$rest} = [ [$id, $ln] ]; # create new entry if this is the first one
}
}
# Go thru first file and print out all matches we might have
$ln = 0;
foreach (#lines_a) {
chomp;
$ln++;
my ($id, $rest) = split(m{[\t\s]+}, $_, 2);
if (exists $table_b->{$rest}) { # if we have this entry print where it is found
print "$ina:$ln:\t\t'$id\t$rest'\n " . (join '\n ', map { "$inb:$_->[1]:\t\t'$_->[0]\t$rest'" } #{ $table_b->{$rest} }) . "\n";
}
}

How to determine number of times a word appears in text?

How can I find the number of times a word is in a block of text in Perl?
For example my text file is this:
#! /usr/bin/perl -w
# The 'terrible' program - a poorly formatted 'oddeven'.
use constant HOWMANY => 4; $count = 0;
while ( $count < HOWMANY ) {
$count++;
if ( $count == 1 ) {
print "odd\n";
} elsif ( $count == 2 ) {
print "even\n";
} elsif ( $count == 3 ) {
print "odd\n";
} else { # at this point $count is four.
print "even\n";
}
}
I want to find the number of "count" word for that text file. File is named terrible.pl
Idealy it should use regex and with minimum number of line of code.
EDIT: This is what I have tried:
use IO::File;
my $fh = IO::File->new('terrible.pl', 'r') or die "$!\n";
my %words;
while (<$fh>) {
for my $word ($text =~ /count/g) {
print "x";
$words{$word}++;
}
}
print $words{$word};
Here's a complete solution. If this is homework, you learn more by explaining this to your teacher than by rolling your own:
perl -0777ne "print+(##=/count/g)+0" terrible.pl
If you are trying to count how many times appears the word "count", this will work:
my $count=0;
open(INPUT,"<terrible.pl");
while (<INPUT>) {
$count++ while ($_ =~ /count/g);
}
close(INPUT);
print "$count times\n";
I'm not actually sure what your example code is but you're almost there:
perl -e '$text = "lol wut foo wut bar wut"; $count = 0; $count++ while $text =~ /wut/g; print "$count\n";'
You can use the /g modifier to continue searching the string for matches. In the example above, it will return all instances of the word 'wut' in the $text var.
You can probably use something like so:
my $fh = IO::File->new('test.txt', 'r') or die "$!\n";
my %words;
while (<$fh>) {
for my $word (split / /) {
$words{$word}++;
}
}
That will give you an accurate count of every "word" (defined as a group of characters separated by a space), and store it in a hash which is keyed by the word with a value of the number of the word which was seen.
perdoc perlrequick has an answer. The term you want in that document is "scalar context".
Given that this appears to be a homework question, I'll point you at the documentation instead.
So, what are you trying to do? You want the number of times something appears in a block of text. You can use the Perl grep function. That will go through a block of text without needing to loop.
If you want an odd/even return value, you can use the modulo arithmetic function. You can do something like this:
if ($number % 2) {
print "$number is odd\n"; #Returns a "1" or true
}
else {
print "$number is even\n"; #Returns a "0" or false
}

Perl - Start reading from specific line, and only get first column of it line, until end

I have a text file that looks like the following:
Line 1
Line 2
Line 3
Line 4
Line 5
filename2.tif;Smpl/Pix & Bits/Smpl are missing.
There are 5 lines that are always the same, and on the 6th line is where I want to start reading data. Upon reading data, each line (starting from line 6) is delimited by semicolons. I need to just get the first entry of each line (starting on line 6).
For example:
Line 1
Line 2
Line 3
Line 4
Line 5
filename2.tif;Smpl/Pix & Bits/Smpl are missing.
filename4.tif;Smpl/Pix & Bits/Smpl are missing.
filename6.tif;Smpl/Pix & Bits/Smpl are missing.
filename8.tif;Smpl/Pix & Bits/Smpl are missing.
Output desired would be:
filename2.tif
filename4.tif
filename6.tif
filename8.tif
Is this possible, and if so, where do I begin?
This uses the Perl 'autosplit' (or 'awk') mode:
perl -n -F'/;/' -a -e 'next if $. <= 5; print "$F[0]\n";' < data.file
See 'perlrun' and 'perlvar'.
If you need to do this in a function which is given a file handle and a number of lines to skip, then you won't be using the Perl 'autosplit' mode.
sub skip_N_lines_read_column_1
{
my($fh, $N) = #_;
my $i = 0;
my #files = ();
while (my $line = <$fh>)
{
next if $i++ < $N;
my($file) = split /;/, $line;
push #files, $file;
}
return #files;
}
This initializes a loop, reads lines, skipping the first N of them, then splitting the line and capturing the first result only. That line with my($file) = split... is subtle; the parentheses mean that the split has a list context, so it generates a list of values (rather than a count of values) and assigns the first to the variable. If the parentheses were omitted, you would be providing a scalar context to a list operator, so you'd get the number of fields in the split output assigned to $file - not what you needed. The file name is appended to the end of the array, and the array is returned. Since the code did not open the file handle, it does not close it. An alternative interface would pass the file name (instead of an open file handle) into the function. You'd then open and close the file in the function, worrying about error handling.
And if you need the help with opening the file, etc, then:
use Carp;
sub open_skip_read
{
my($name) = #_;
open my $fh, '<', $name or croak "Failed to open file $name ($!)";
my #list = skip_N_lines_read_column_1($fh, 5);
close $fh or croak "Failed to close file $name ($!)";
return #list;
}
#!/usr/bin/env perl
#
# name_of_program - what the program does as brief one-liner
#
# Your Name <your_email#your_host.TLA>
# Date program written/released
#################################################################
use 5.10.0;
use utf8;
use strict;
use autodie;
use warnings FATAL => "all";
# ⚠ change to agree with your input: ↓
use open ":std" => IN => ":encoding(ISO-8859-1)",
OUT => ":utf8";
# ⚠ change for your output: ↑ — *maybe*, but leaving as UTF-8 is sometimes better
END {close STDOUT}
our $VERSION = 1.0;
$| = 1;
if (#ARGV == 0 && -t STDIN) {
warn "reading stdin from keyboard for want of file args or pipe";
}
while (<>) {
next if 1 .. 5;
my $initial_field = /^([^;]+)/ ? $1 : next;
# ╔═══════════════════════════╗
# ☞ your processing goes here ☜
# ╚═══════════════════════════╝
} continue {
close ARGV if eof;
}
__END__
Kinda ugly but, read out the dummy lines and then split on ; for the rest of them.
my $logfile = '/path/to/logfile.txt';
open(FILE, $logfile) || die "Couldn't open $logfile: $!\n";
for (my $i = 0 ; $i < 5 ; $i++) {
my $dummy = <FILE>;
}
while (<FILE>) {
my (#fields) = split /;/;
print $fields[0], "\n";
}
close(FILE);