File handle failed to read second time - regex

How can I read from a file handle for the second time inside foreach loop in Perl?
foreach $a (#b){
while(my $line = <IN>){
if($line = /$a/){
print $line;
}
}
}
The above code is not processing the second element from the list #b. How to make it possible?

Your inner loop, while(my $line = <IN>), extracts lines from the IN handle until it reaches the end of the file.
When your outer loop, foreach $a (#b), tries to read from IN again, it's still at end-of-file. The first iteration of the foreach loop consumes all lines from the file, leaving nothing for the other iterations.
There are several possible ways to fix this:
Seek back to the beginning of IN before you attempt to read from it again:
foreach $a (#b){
seek IN, 0, 0
or die "Cannot seek(): $!";
while (my $line = <IN>) {
...
}
}
However, this only works for real files, not pipes or sockets or terminals.
Read the whole file into memory up front, then iterate over a normal array:
my #lines = <IN>;
foreach $a (#b){
foreach my $line (#lines) {
...
}
}
However, if the file is big, this will use a lot of memory.
Switch the order of the two loops:
while (my $line = <IN>) {
foreach $a (#b) {
...
}
}
This is my favorite. Now you only need to read from the file once. #b is already in memory, so you can iterate over it as many times as you want.
Side notes:
Don't use bareword filehandles like IN. Normal variables (such as $IN) are pretty much superior in every way.
Don't use variables called $a or $b. They're a bit special because Perl uses them in sort.
My personal preference is to never use < >. It's weirdly overloaded (it can mean either readline or glob, depending on the exact syntax you use) and it isn't terribly intuitive. Using readline means there's never any syntactic ambiguity and even programmers with no Perl experience can figure out what it does.
With those changes:
while (my $line = readline $IN) {
foreach my $re (#regexes) {
if ($line =~ /$re/) {
print $line;
}
}
}

You read within a while loop until the filehandle is exhausted (reached EOF end of file).
If you don't close and reopen your filehandle you won't read anymore from it in the second iteration of the outer loop.
If the amount of data you read from the filehandle is no so large you could read the file into an array variable and iterate over the content of the array variable.
For example:
my #filecontent = <IN>;
foreach $item_of_b (#b){
foreach my $line_of_file (#filecontent){
if($line_of_file =~ /$item_of_b/){
print $line_of_file;
}
}
}
And $a and $b should not be used as variable names, they are special due to sorting.

Related

Can't grab value from subexpressions

I'm reading null-separated blocks of data from a socket; upon finding a null, the block read so far is handed for processing, and the buffer is truncated to whatever was left after the null (usually nothing).
do {
$in .= <$sock>
} while(!in =~ /^(.*?)\x00(.*)/) ;
print "A:[ $in, $block ]\n";
$block = $1;
$in = $2;
print "B:[ $in, $block ]";
Result:
A:[ {"hn":"ITtestKA","v":{"m":4,"u":4}}
, ]
B:[ , ]
Why can't I pick the data from the subexpressions $1, $2?
Another idea is to change the value $/ which determines how much will <$sock> read from $sock. So for example, if you do
local $/="\0"
then for that scope, <$sock> will read until the end of the next (null-separated) block.
Edit: If the input consists of null-separated blocks, it does not make sense to use <$sock> with $/ having its default value, because then you would be reading the input line by line. So I think there are 2 approaches:
Set $/ to \0 and use <$sock> to read the next block.
Read a fixed size chunk and then search for a null byte in it to extract the next block. In this scenario, you can either use <$sock> with $/ set to a reference of the chunk size (e.g. local $/=\4096), or use read/sysread. I would also use index/substr to search for the null and extract the chunk in that case.
There's a negated form of the binding operator:
while($in !~ /^(.*?)\x00(.*)/)
in and $in are different things. Do you use strict? If not, you're matching "in" against the regex and it never matches.
#! /usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
my #socket = ( "abcd\x00efgh",
"ijkl",
"mnop\x00qrst",
"uvwx\x00",
);
my ($in, $block);
while (#socket) {
do {
$in .= shift #socket;
} while ($in !~ /^(.*?)\x00(.*)/) ;
say "A:[ $in, $block ]";
($block, $in) = ($1, $2);
say "B:[ $in, $block ]";
}

Why does my program crash after one line of input?

I am writing a simple program which capitalizes each word in a sentence. It gets a multi-line input. I then loop through the input lines, split each word in the line, capitalize it and then join the line again. This works fine if the input is one sentence, but as soon as I input two lines my program crashes (and if I wait too long my computer freezes.)
Here is my code
#input = <STDIN>;
foreach(#input)
{
#reset #words
#words= ();
#readability
$lines =$_;
#split sentence
#words = split( / /, $lines );
#capitalize each word
foreach(#words){
$words[$k] = ucfirst;
$k++;
}
#join sentences again
$lines = join(' ', #words);
#create output line
$output[$i]=$lines;
$i++;
}
#print the result
print "\nResult:\n";
foreach(#output){
print $output[$j],"\n";
$j++;
}
Could someone please tell me why it crashes?
use strict (and be told about not properly handled variables like your indices)
use for var (array) to get a usable item without an index (Perl isn't Javascript)
What isn't there can't be wrong (e.g. push instead of index)
In code:
use strict; # always!
my #input = <STDIN>; # the loop need in- and output
my #output = ();
for my $line (#input) # for makes readability *and* terseness easy
{
chomp $line; # get rid of eol
#split sentence
my #words = split( / /, $line );
#capitalize each word
for my $word (#words){ # no danger of mishandling indices
$word = ucfirst($word);
}
#join sentences again
$line = join(' ', #words);
#create output line
push #output, $line;
}
#print the result
print "\nResult:\n";
for my $line (#output){
print $line, "\n";
}
The problem is that you are using global variables throughout, so they are keeping their values across iterations of the loop. You have reset #words to an empty list even though you didn't need to - it is overwritten when you assign the result of split to it - but $k is increasing endlessly.
$k is initially set to undef which evaluates as zero, so for the first sentence everything is fine. But you leave $k set to the number of elements in #words so it starts from there instead of from zero for the next sentence. Your loop over #words becomes endless because you are assigning to (and so creating) $words[$k] so the array is getting longer as fast as you are looping through it.
The same problem applies to $i and $j, but execution never gets as far as reusing those.
Alshtough this was the only way of working in Perl 4, over twenty years ago, Perl 5 has made programming very much nicer to write and debug. You can now declare variables with my, and you can use strict which (among other things) insists that every variable you use must be declared, otherwise your program won't compile. There is also use warnings which is just as invaluable. In this case it would have warned you that you were using an undefined variable $k etc. to index the arrays.
If I apply use strict and use warnings, declare all of your variables and initialise the counters to zero then I get a working program. It's still not very elegant, and there are much better ways of doing it, but the error has gone away.
use strict;
use warnings;
my #input = <STDIN>;
my #output;
my $i = 0;
foreach (#input) {
# readability
my $lines = $_;
# split sentence
my #words = split ' ', $lines;
# capitalize each word
my $k = 0;
foreach (#words) {
$words[$k] = ucfirst;
$k++;
}
# join sentences again
$lines = join ' ', #words;
#create output line
$output[$i] = $lines;
$i++;
}
print "\nResult:\n";
my $j = 0;
foreach (#output) {
print $output[$j], "\n";
$j++;
}

regular expression code

I need to find match between two tab delimited files files like this:
File 1:
ID1 1 65383896 65383896 G C PCNXL3
ID1 2 56788990 55678900 T A ACT1
ID1 1 56788990 55678900 T A PRO55
File 2
ID2 34 65383896 65383896 G C MET5
ID2 2 56788990 55678900 T A ACT1
ID2 2 56788990 55678900 T A HLA
what I would like to do is to retrive the matching line between the two file. What I would like to match is everyting after the gene ID
So far I have written this code but unfortunately perl keeps giving me the error:
use of "Use of uninitialized value in pattern match (m//)"
Could you please help me figure out where i am doing it wrong?
Thank you in advance!
use strict;
open (INA, $ARGV[0]) || die "cannot to open gene file";
open (INB, $ARGV[1]) || die "cannot to open coding_annotated.var files";
my #sample1 = <INA>;
my #sample2 = <INB>;
foreach my $line (#sample1) {
my #tab = split (/\t/, $line);
my $chr = $tab[1];
my $start = $tab[2];
my $end = $tab[3];
my $ref = $tab[4];
my $alt = $tab[5];
my $name = $tab[6];
foreach my $item (#sample2){
my #fields = split (/\t/,$item);
if ( $fields[1] =~ m/$chr(.*)/
&& $fields[2] =~ m/$start(.*)/
&& $fields[4] =~ m/$ref(.*)/
&& $fields[5] =~ m/$alt(.*)/
&& $fields[6] =~ m/$name(.*)/
) {
print $line, "\n", $item;
}
}
}
On its surface your code seems to be fine (although I didn't debug it). If you don't have an error I cannot spot, could be that the input data has RE special character, which will confuse the regular expression engine when you put it as is (e.g. if any of the variable has the '$' character). Could also be that instead of tab you have spaces some where, in which case you'll indeed get an error, because your split will fail.
In any case, you'll be better off composing just one regular expression that contains all the fields. My code below is a little bit more Perl Idiomatic. I like using the implicit $_ which in my opinion makes the code more readable. I just tested it with your input files and it does the job.
use strict;
open (INA, $ARGV[0]) or die "cannot open file 1";
open (INB, $ARGV[1]) or die "cannot open file 2";
my #sample1 = <INA>;
my #sample2 = <INB>;
foreach (#sample1) {
(my $id, my $chr, my $start, my $end, my $ref, my $alt, my $name) =
m/^(ID\d+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)/;
my $rex = "^ID\\d+\\s+$chr\\s+$start\\s+$end\\s+$ref\\s+$alt\\s+$name\\s+";
#print "$rex\n";
foreach (#sample2) {
if( m/$rex/ ) {
print "$id - $_";
}
}
}
Also, how regular is the input data? Do you have exactly one tab between the fields? If that is the case, there is no point to split the lines into 7 different fields - you only need two: the ID portion of the line, and the rest. The first regex would be
(my $id, my $restOfLine) = m/^(ID\d+)\s+(.*)$/;
And you are searching $restOfLine within the second file in a similar technique as above.
If your files are huge and performance is an issue, you should consider putting the first regular expressions (or strings) in a map. That will give you O(n*log(m)) where n and m are the number of lines in each file.
Finally, I have a similar challenge when I need to compare logs. The logs are supposed to be identical, with the exception of a time mark at the beginning of each line. But more importantly: most lines are the same and in order. If this is what you have, and it make sense for you, you can:
First remove the IDxxx from each line: perl -pe "s/ID\d+ +//" file >cleanfile
Then use BeyondCompare or Windiff to compare the files.
I played a bit with your code. What you wrote there was actually three loops:
one over the lines of the first file,
one over the lines of the second file, and
one over all fields in these lines. You manually unrolled this loop.
The rest of this answer assumes that the files are strictly tab-seperated and that any other whitespace matters (even at the end of fields and lines).
Here is a condensed version of the code (assumes open filehandles $file1, $file2, and use strict):
my #sample2 = <$file2>;
SAMPLE_1:
foreach my $s1 (<$file1>) {
my (undef, #fields1) = split /\t/, $s1;
my #regexens = map qr{\Q$_\E(.*)}, #fields1;
SAMPLE_2:
foreach my $s2 (#sample2) {
my (undef, #fields2) = split /\t/, $s2;
for my $i (0 .. $#regexens) {
$fields2[$i] =~ $regexens[$i] or next SAMPLE_2;
}
# only gets here if all regexes matched
print $s1, $s2;
}
}
I did some optimisations: precompiling the various regexes and storing them in an array, quoting the contents of the fields etc. However, this algorithm is O(n²), which is bad.
Here is an elegant variant of that algorithm that knows that only the first field is different — the rest of the line has to be the same character for character:
my #sample2 = <$file2>;
foreach my $s1 (<$file1>) {
foreach my $s2 (#sample2) {
print $s1, $s2 if (split /\t/, $s1, 2)[1] eq (split /\t/, $s2, 2)[1];
}
}
I just test for string equality of the rest of the line. While this algorithm is still O(n²), it outperforms the first solution roughly by an order of magnitude simply by avoiding braindead regexes here.
Finally, here is an O(n) solution. It is a variant of the previous one, but executes the loops after each other, not inside each other, therefore finishing in linear time. We use hashes:
# first loop via map
my %seen = map {reverse(split /\t/, $_, 2)}
# map {/\S/ ? $_ : () } # uncomment this line to handle empty lines
<$file1>;
# 2nd loop
foreach my $line (<$file2>) {
my ($id2, $key) = split /\t/, $line, 2;
if (defined (my $id1 = $seen{$key})) {
print "$id1\t$key";
print "$id2\t$key";
}
}
%seen is a hash that has the rest of the line as a key and the first field as a value. In the second loop, we retrieve the rest of the line again. If this line was present in the first file, we reconstruct the whole line and print it out. This solution is better than the others and scales well up- and downwards, because of its linear complexity
How about:
#!/usr/bin/perl
use File::Slurp;
use strict;
my ($ina, $inb) = #ARGV;
my #lines_a = File::Slurp::read_file($ina);
my #lines_b = File::Slurp::read_file($inb);
my $table_b = {};
my $ln = 0;
# Store all lines in second file in a hash with every different value as a hash key
# If there are several identical ones we store them also, so the hash values are lists containing the id and line number
foreach (#lines_b) {
chomp; # strip newlines
$ln++; # count current line number
my ($id, $rest) = split(m{[\t\s]+}, $_, 2); # split on whitespaces, could be too many tabs or spaces instead
if (exists $table_b->{$rest}) {
push #{ $table_b->{$rest} }, [$id, $ln]; # push to existing list if we already found an entry that is the same
} else {
$table_b->{$rest} = [ [$id, $ln] ]; # create new entry if this is the first one
}
}
# Go thru first file and print out all matches we might have
$ln = 0;
foreach (#lines_a) {
chomp;
$ln++;
my ($id, $rest) = split(m{[\t\s]+}, $_, 2);
if (exists $table_b->{$rest}) { # if we have this entry print where it is found
print "$ina:$ln:\t\t'$id\t$rest'\n " . (join '\n ', map { "$inb:$_->[1]:\t\t'$_->[0]\t$rest'" } #{ $table_b->{$rest} }) . "\n";
}
}

Perl regexp use of uninitialized pattern patch

My script loads some stuff from some files in some arrays, you enter a text from the keyboard, the script searches the relevant part of the text in those arrays, if it finds it, it does something, if not, well, another thing, at least in theory.
I get the following errors:
Use of uninitialized value in pattern match (m//) at emo_full_dynamic.pl line 120, <STDIN> chunk 2.
Modification of a read-only value attempted at emo_full_dynamic.pl line 121, <STDIN> chunk 2.
line 120 = $plm3 =~ /arr_(\w+.txt)/;
My problem, I think, is at $plm3 =~ /arr_(\w+.txt)/;. I used it so that I can store the name of an array in $1.
Here's my code:
#!/usr/bin/perl
use warnings;
$idx = 0;
$oldsep = $/;
opendir(DIR, 'c:/downloads/text_files/arrs/');
#files = readdir(DIR);
while ($idx <= $#files )
{
$value = $files[$idx];
if ( $value !~ m/^arr/i)
{
splice #files, $idx, 1;
}
else
{
$idx++;
}
}
foreach $plm (#files)
{
if($plm =~ m/txt$/)
{
open(ARR, "C:/downloads/text_files/arrs/$plm") or die $!;
while(<ARR>)
{ {
chomp($_);
$plm =~ m/arr_(\w+).txt/;
push(#{$1}, $_);
}
close ARR;
}
}
$plm = 0;
$idx = 0;
$stare = <STDIN>;
chomp($stare);
while($stare)
{
foreach $plm2 (#files)
{
if($plm2 =~ m/txt$/)
{
$plm2 =~ m/arr_(\w+).txt/;
if(grep $stare =~ m/$_/i, #{$1})
{
$flag = 1;
}
else
{
$flag = 0;
}
}
}
if($flag == 1)
{
$/ = "%\n";
$plm3 =~ /arr_(\w+.txt)/;
open SUPARARE, "C:/downloads/text_files/replies/$1" or die $!;
etc etc....
First of all, it's always a good idea to use strict pragma -- unless you have a valid reason to avoid it --.
Second, I don't see $plm3 initialized anywhere in your code. You have probably forgot to initialize it.
I think you are assigning something to variable $1 on line 121
Apparently there are some copy/paste issues which negates my initial answer.
Other mistakes, great and small:
You don't use strict. (fatal flaw)
Your opendir is used once, then never closed.
You use global filehandles, instead of lexical (e.g. open my $fh, ...)
Using a complext loop + splice instead of grep (#files=grep /^arr/i, #files)
Using chomp($_) when chomp per default chomps the $_ variable
I don't even know what this line means:
if(grep $stare =~ m/$_/i, #{$1}) {
You seem to be using a pattern match, where $_ is the pattern (which in this case is.. what? Nothing? Anything?), whose return value is used as a grep pattern for an array reference, that may or may not be initialized. A very horrible statement. If it indeed works as intended, the readability is very low.
Redeclaring $/ seems like a frivolous thing to do in this context, but I can't really tell, as the script ends there.

Is there any better way to "grep" from a large file than using `grep` in perl?

The $rvsfile is the path of a file about 200M. I want to count the number of line which has $userid in it. But using grep in a while loop seems very slowly. So is there any efficient way to do this? Because the $rvsfile is very large, I can't read it into memory using #tmp = <FILEHANDLE>.
while(defined($line = <SRCFILE>))
{
$line =~ /^([^\t]*)\t/;
$userid = $1;
$linenum = `grep '^$userid\$' $rvsfile | wc -l`;
chomp($linenum);
print "$userid $linenum\n";
if($linenum == 0)
{
print TARGETFILE "$line";
}
}
And how can I get the part before \t in a line without regex? For example, the line may like this:
2013123\tsomething
How can I get 2013123 without regex?
Yes, you are forking a shell on each loop invocation. This is slow. You also read the entire $rsvfile once for every user. This is too much work.
Read SRCFILE once and build a list of #userids.
Read $rvsfile once keeping a running count of each user id as you go.
Sketch:
my #userids;
while(<SRCFILE>)
{
push #userids, $1 if /^([^\t]*)\t/;
}
my $regex = join '|', #userids;
my %count;
while (<RSVFILE>)
{
++$count{$1} if /^($regex)$/o
}
# %count has everything you need...
Use hashes:
my %count;
while (<LARGEFILE>) {
chomp;
$count{$_}++;
};
# now $count{userid} is the number of occurances
# of $userid in LARGEFILE
Or if you fear using too much memory for the hash (i.e. you're interested in 6 users, and there are 100K more in the large file), do it another way:
my %count;
while (<SMALLFILE>) {
/^(.*?)\t/ and $count{$_} = 0;
};
while (<LARGEFILE>) {
chomp;
$count{$_}++ if defined $count{$_};
};
# now $count{userid} is the number of occurances
# of $userid in LARGEFILE, *if* userid is in SMALLFILE
You can search for the location of the first \t using index which will be faster. You could then use splice to get the match.
Suggest you benchmark various approaches.
If I read you correctly you want something like this:
#!/usr/bin/perl
use strict;
use warnings;
my $userid = 1246;
my $count = 0;
my $rsvfile = 'sample';
open my $fh, '<', $rsvfile;
while(<$fh>) {
$count++ if /$userid/;
}
print "$count\n";
or even, (and someone correct me if I am wrong, but this don't think this reads the whole file in):
#!/usr/bin/perl
use strict;
use warnings;
my $userid = 1246;
my $rsvfile = 'sample';
open my $fh, '<', $rsvfile;
my $count = grep {/$userid/} <$fh>;
print "$count\n";
If <SRCFILE> is relatively small, you could do it the other way round. Read in the larger file one line at a time, and check each userid per line, keeping a count of each userid using a hash sructure. Something like:
my %userids = map {($_, 0)} # use as hash key with init value of 0
grep {$_} # only return mataches
map {/^([^\t]+)/} <SRCFILE>; # extract ID
while (defined($line = <LARGEFILE>)) {
for (keys %userids) {
++$userids{$_} if $line =~ /\Q$_\E/; # \Q...\E escapes special chars in $_
}
}
This way, only the smaller data is read repeatedly and the large file is scanned once. You end up with a hash of every userid, and the value is the number of lines it occurred in.
if you have a choice, try it with awk
awk 'FNR==NR{a[$1];next} { for(i in a) { if ($0 ~ i) { print $0} } } ' $SRCFILE $rsvfile