Preventing "foo" from matching "foo-bar" with grep -w

Preventing "foo" from matching "foo-bar" with grep -w - regex

I am using grep inside my Perl script and I am trying to grep the exact keyword that I am giving. The problem is that "-w" doesn't recognize the "-" symbol as a separator.
example:
Let's say that I have these two records:
A1BG 0.0767377011073753
A1BG-AS1 0.233775553296782
if I give
grep -w "A1BG"
it returns both of them but I want only the exact one.
Any suggestions?
Many thanks in advance.
PS.
Here is my whole code.
The input file is a two-columns tab separated. So, I want to keep a unique value for each gene. In cases that I have more than one record, I calculate the average.
#!/usr/bin/perl
use strict;
use warnings;
#Find the average fc between common genes
sub avg {
my $total;
$total += $_ foreach #_;
return $total / #_;
}
my #mykeys = `cat G13_T.txt| awk '{print \$1}'| sort -u`;
foreach (#mykeys)
{
my #TSS = ();
my $op1 = 0;
my $key = $_;
chomp($key);
#print "$key\n";
my $command = "cat G13_T.txt|grep -E '([[:space:]]|^)$key([[:space:]]|\$)'";
#my $command = "cat Unique_Genes/G13_T.txt|grep -w $key";
my #belongs= `$command`;
chomp(#belongs);
my $count = scalar(#belongs);
if ($count == 1) {
print "$belongs[0]\n";
}
else {
for (my $i = 0; $i < $count; $i++) {
my #token = split('\t', $belongs[$i]);
my $lfc = $token[1];
push (#TSS, $lfc);
}
$op1 = avg(#TSS);
print $key ."\t". $op1. "\n";
}
}

If I got clarifications in comments right, the objective is to find the average of values (second column) for unique names in the first column. Then there is no need for external tools.
Read the file line by line and add up values for each name. The name uniqueness is granted by using a hash, with names being keys. Along with this also track their counts
use warnings;
use strict;
use feature 'say';
my $file = shift // die "Usage: $0 filename\n";
open my $fh, '<', $file or die "Can't open $file: $!";
my %results;
while (<$fh>) {
#my ($name, $value) = split /\t/;
my ($name, $value) = split /\s+/; # used for easier testing
$results{$name}{value} += $value;
++$results{$name}{count};
}
foreach my $name (sort keys %results) {
$results{$name}{value} /= $results{$name}{count}
if $results{$name}{count} > 1;
say "$name => $results{$name}{value}";
}
After the file is processed each accumulated value is divided by its count and overwritten by that, so by its average (/= divides and assigns), if count > 1 (as a small measure of efficiency).
If there is any use in knowing all values that were found for each name, then store them in an arrayref for each key instead of adding them
while (<$fh>) {
#my ($name, $value) = split /\t/;
my ($name, $value) = split /\s+/; # used for easier testing
push #{$results{$name}}, $value;
}
where now we don't need the count as it is given by the number of elements in the array(ref)
use List::Util qw(sum);
foreach my $name (sort keys %results) {
say "$name => ", sum(#{$results{$name}}) / #{$results{$name}};
}
Note that a hash built this way needs memory comparable to the file size (or may even exceed it), since all values are stored.
This was tested using the shown two lines of sample data, repeated and changed in a file. The code does not test the input in any way, but expects the second field to always be a number.
Notice that there is no reason to ever step out of our program and use external commands.

You may use a POSIX ERE regex with grep like this:
grep -E '([[:space:]]|^)A1BG([[:space:]]|$)' file
To return matches (not matching lines) only:
grep -Eo '([[:space:]]|^)A1BG([[:space:]]|$)' file
Details
([[:space:]]|^) - Group 1: a whitespace or start of line
A1BG - a substring
([[:space:]]|$) - Group 2: a whitespace or end of line

Related

How do I store culmultive length of strings into an array with Perl

I have a file which has a series of lines that are made up of A's, C's, G's and T's. I want to find the length of those lines, make a list of the culmultive lengths (adding the lengths together sequentially), and putting that into an array. So far I have:
#! /usr/bin/perl -w
use strict;
my $input = $ARGV[0];
my %idSeq;
my (#ID, #Seq);
open (my $INPUT, "<$input") or die "unable to open $input";
while (<$INPUT>) {
if (my $culm_length = /^([AGCT]\w+)\n$/) {
length($culm_length) = $_;
push (#Seq, $1);
}
}
bla bla bla....
So far I think what I have written gives me an array of the length of individual lines. I want the culmultive lengths, any ideas?

With reference to your previous question How do I read strings into a hash in Perl which was put on hold, I think perhaps you want a running total of the lengths of the lines
I would write it this way. It keeps the running total in $total and pushes its value onto the #lengths array every time it changes
use strict;
use warnings 'all';
my ( $input ) = #ARGV;
open my $fh, '<', $input or die qq{unable to open "$input" for input: $!};
my #lengths;
my $total = 0;
while ( <$fh> ) {
push #lengths, $total += length $1 if /^([ACGT]+)/;
}

#!/usr/bin/perl -w
use strict;
my $length = 0;
while (<>) {
$length += length($1) if /^([AGCT]\w+)$/;
}
my #length = $length; # But why would you want to do this?!
...

Perl apply partial match regex on a line in long text file using hash key

Input1: I have a chemicalnames hash.These names are short names and are the keys to hash.
Input2: I have a text book (I mean a very long text file) where above shortnames appear in full.
Task: Where ever the name appears in full in text file , if the next line is with "" then I have to replace this "" with relevant hash value description. $hash{key}{description}.
Example: if hash key = Y then it might appear in text file as either
X.Y.Z or just X.YZ or XYZ or XY2 or X_Y_Z02 .Its unpredictable but it appears somewhere in the middle or end.
That means the text file name is a partial match to hash key name.
My Trails: I tried keeping full file into array then tried to find where empty "" appears .Once it appear I do regex compare on previous line with hash key.But this doesnot work :( .Also the process is too slow.I have tried different kind of techniques with experts help but failed to reduce speed with other methods.Please help
My program is as follows:
use strict;
use warnings;
my $file = "Chemicalbook.txt"; #In text file full name might appear as Dihydrogen.monoxide.hoax_C
my $previous_line = "";
my %hash;
$hash{'monoxide'}{description} = "accelerates corrosion";
open(my $FILE,'<',$file) or die "Cannot open input file";
open(my $output,'>',"outfile.txt") or die "Cannot open output file";
my #file_in_array = <$FILE>;
foreach my $line (#file_in_array) {
my $name = $previous_line;
if($line =~ /""/) {
foreach my $shortname(keys %hash)
{
if($previous_line =~ /$shortname/) {
$line = s/""/$hash{$shortname}{description}/;
}
}
}
$previous_line = $line;
print {$output} $line ;
}
close($FILE);
close($output);

Looping over all keys for each line is hopeless(ly slow). Try replacing the entire inner foreach loop with this:
while ($previous_line =~ /(\w+)/g)
{
if (my $s = $hash{$1})
{
$line = $$s{description};
}
}
It will pick up shortnames as long as they're "standing alone" in the text.

my %hash;
my #arr=qw(X.Y.Z X.YZ XYZ XY2 ZZZ Chromium.trioxideChromic_02acid);
$hash{'Y'}='Hello';
$hash{'R'}='Hai';
$hash{'trioxide'}='Testing';
foreach my $line (#arr)
{
if( my($key)= grep { $line =~ /$_/ } keys(%hash)) {
print "$line - $hash{$key} \n";
}
else {
print "Unmatched $line\n";
}
}

Use Perl to count occurrences of all words in a file or in all files in a directory

So I am trying to write a Perl script which will take in 3 arguments.
First argument is the input file or directory.
If it is a file, it will count number of occurrences of all words
If it is a directory, it will recursively go through each directory and get all the number of occurrences for all words in the files within those directories
Second argument is a number that will be how many of the words to display with the highest number of occurrences.
This will print to the console only the number for each word
Print them to an output file which is the third argument in the command line.
It seems to be working as far as recursively searching through directories and finding all occurrences of the words in a file and prints them to the console.
How can I print these to an output file and also, how would I take the second argument, which is the number, say 5, and have it print to the console the number of words with the most occurrences while printing the words to the output file?
The following is what I have so far:
#!/usr/bin/perl -w
use strict;
search(shift);
my $input = $ARGV[0];
my $output = $ARGV[1];
my %count;
my $file = shift or die "ERROR: $0 FILE\n";
open my $filename, '<', $file or die "ERROR: Could not open file!";
if ( -f $filename ) {
print("This is a file!\n");
while ( my $line = <$filename> ) {
chomp $line;
foreach my $str ( $line =~ /\w+/g ) {
$count{$str}++;
}
}
foreach my $str ( sort keys %count ) {
printf "%-20s %s\n", $str, $count{$str};
}
}
close($filename);
if ( -d $input ) {
sub search {
my $path = shift;
my #dirs = glob("$path/*");
foreach my $filename (#dirs) {
if ( -f $filename ) {
open( FILE, $filename ) or die "ERROR: Can't open file";
while ( my $line = <FILE> ) {
chomp $line;
foreach my $str ( $line =~ /\w+/g ) {
$count{$str}++;
}
}
foreach my $str ( sort keys %count ) {
printf "%-20s %s\n", $str, $count{$str};
}
}
# Recursive search
elsif ( -d $filename ) {
search($filename);
}
}
}
}

I would suggest restructuring your program/script. What you have posted is a difficult to follow. A few comments might be helpful to follow what is happening. I'll try to go through how I would arrange things with some code snippets to hopefully help to explain items. I'll go through the three items you outlined in your question.
Since the first argument can be a file or directory, I would use -f and -d to check to determine what is the input. I would use an list/array to contain a list of file to be processed. IF it was only a file, I would just push it onto to the processing list. Otherwise, I would call a routine to return a list of files to be processed (similar to your search subroutine). Something like:
# List file files to process
my #fileList = ();
# if input is only a file
if ( -f $ARGV[0] )
{
push #fileList,$ARGV[0];
}
# If it is a directory
elsif ( -d $ARGV[0] )
{
#fileList = search($ARGV[0]);
}
So in your search subroutine, you need a list/array onto which to push items which are files and then return the array from the subroutine (after you have processed the list of files from the glob call). When you have a directory, you call search with the path (just as you are currently doing), pushing the elements on your current array, such as
# If it is a file, save it to the list to be returned
if ( -f $filename )
{
push #returnValue,$filename;
}
# else if a directory, get the files from the directory and
# add them to the list to be returned
elsif ( -d $filename )
{
push #returnValue, search($filename);
}
After you have the file list, loop through it processing each file (opening, reading lines in closing, processing the lines for the words). The foreach loop you have for processing each line works correctly. However, if your words have periods, commas or other punctuation, you may want to remove those items before counting the word in a hash.
For the next part, you asked about determining the words with the highest counts. In that case, you want make another hash which has a key of counts (for each word), and the value of that hash is a list/array of words associated with that number of counts. Something like:
# Hash with key being a number and value a list of words for that number
my %totals= ();
# Temporary variable to store occurrences (counts) of the word
my $wordTotal;
# $w is the words in the counts hash
foreach my $w ( keys %counts )
{
# Get the counts for the word
$wordTotal = $counts{$w};
# value of the hash is an array, so de-reference the array ( the #{ },
# and push the value of the counts array onto the array
push #{ $totals{$wordTotal} },$w; # the key to total is the value of the count hash
# for which the words ($w) are the keys
}
To get the words with the highest counts you need to get the keys from the total and reverse a sorted list (numerically sorted) to get the N number of highest. Since we have an array of values, we will have to count each output to get the N number of highest counts.
# Number of items outputted
my $current = 0;
# sort the total (keys) and reverse the list so the highest values are first
# and go through the list
foreach my $t ( reverse sort { $a <=> $b} keys %totals) # Use the numeric
# comparison in
# the sort
{
# Since each value of total hash is an array of words,
# loop through that array for the values and print out the number
foreach my $w ( sort #{$total{$t}}
{
# Print the number for the count of words
print "$t\n";
# Increment the number output
$current++;
# if this is the number to be printed, we are done
last if ( $current == $ARGV[1] );
}
# if this is the number to be printed, we are done
last if ( $current == $ARGV[1] );
}
The third part of printing to a file, it is unclear what "them" is (words, counts or both; limited to top ones or all of the words) from your question. I will leave that effort for you to open a file, print out the information to the file and close the file.

This will total up the occurrences of words in a directory or file given on the command line:
#!/usr/bin/env perl
# wordcounter.pl
use strict;
use warnings;
use IO::All -utf8;
binmode STDOUT, 'encoding(utf8)'; # you may not need this
my #allwords;
my %count;
die "Usage: wordcounter.pl <directory|filename> number \n" unless ~~#ARGV == 2 ;
if (-d $ARGV[0] ) {
push #allwords, $_->slurp for io($ARGV[0])->all_files;
}
elsif (-f $ARGV[0]) {
#allwords = io($ARGV[0])->slurp ;
}
while (my $line = shift #allwords) {
foreach ( split /\s+/, $line) {
$count{$_}++
}
}
my $count_to_show;
for my $word (sort { $count{$b} <=> $count{$a} } keys %count) {
printf "%-30s %s\n", $word, $count{$word};
last if ++$count_to_show == $ARGV[1];
}
By modifying the sort and/or io calls you can sort { } by number of occurrences, alphabetically by word, either for a file or for all files in a directory. Those options would be fairly easy to add as parameters. You can also filter or change how words are defined for inclusion in the %count hash by changing foreach ( split /\s+/, $line) to say, include a match/filter such as foreach ( grep { length le 5 } split /\s+/, $line) in order to only count words of five or fewer letters.
Sample run in current directory:
./wordcounter ./ 10
the 116
SV 87
i 66
my_perl 58
of 54
use 54
int 49
PerlInterpreter 47
sv 47
Inline 47
return 46
Caveats
you should probably add a test for file mimetypes, readability, etc.
pay attention to unicode
to write to a file just add > filename.txt to the end of your commandline ;-)
IO::All is not the standard CORE IO package I am only advertising and promoting it here ;-) (you could swap that bit out)
If you wanted to added a sort_by option (-n --numeric, -a --alphabetic etc.) Sort::Maker might be one way to make that manageable.
EDIT had neglected to add options as OP requested.

I have figured it out. The following is my solution. I'm not sure if it's the best way to do it, but it works.
# Check if there are three arguments in the commandline
if (#ARGV < 3) {
die "ERROR: There must be three arguments!\n";
exit;
}
# Open the file
my $file = shift or die "ERROR: $0 FILE\n";
open my $fh,'<', $file or die "ERROR: Could not open file!";
# Check if it is a file
if (-f $fh) {
print("This is a file!\n");
# Go through each line
while (my $line = <$fh>) {
chomp $line;
# Count the occurrences of each word
foreach my $str ($line =~ /\b[[:alpha:]]+\b/) {
$count{$str}++;
}
}
}
# Check if the INPUT is a directory
if (-d $input) {
# Call subroutine to search directory recursively
search_dir($input);
}
# Close the file
close($fh);
$high_count = 0;
# Open the file
open my $fileh,'>', $output or die "ERROR: Could not open file!\n";
# Sort the most occurring words in the file and print them
foreach my $str (sort {$count{$b} <=> $count{a}} keys %count) {
$high_count++;
if ($high_count <= $num) {
printf "%-31s %s\n", $str, $count{$str};
}
printf $fileh "%-31s %s\n", $str, $count{$str};
}
exit;
# Subroutine to search through each directory recursively
sub search_dir {
my $path = shift;
my #dirs = glob("$path/*");
# Loop through filenames
foreach my $filename (#dirs) {
# Check if it is a file
if (-f $filename) {
# Open the file
open(FILE, $filename) or die "ERROR: Can't open file";
# Go through each line
while (my $line = <FILE>) {
chomp $line;
# Count the occurrences of each word
foreach my $str ($line =~ /\b[[:alpha:]]+\b/) {
$count{$str}++;
}
}
# Close the file
close(FILE);
}
elsif (-d $filename) {
search_dir($filename);
}
}
}

regular expression code

I need to find match between two tab delimited files files like this:
File 1:
ID1 1 65383896 65383896 G C PCNXL3
ID1 2 56788990 55678900 T A ACT1
ID1 1 56788990 55678900 T A PRO55
File 2
ID2 34 65383896 65383896 G C MET5
ID2 2 56788990 55678900 T A ACT1
ID2 2 56788990 55678900 T A HLA
what I would like to do is to retrive the matching line between the two file. What I would like to match is everyting after the gene ID
So far I have written this code but unfortunately perl keeps giving me the error:
use of "Use of uninitialized value in pattern match (m//)"
Could you please help me figure out where i am doing it wrong?
Thank you in advance!
use strict;
open (INA, $ARGV[0]) || die "cannot to open gene file";
open (INB, $ARGV[1]) || die "cannot to open coding_annotated.var files";
my #sample1 = <INA>;
my #sample2 = <INB>;
foreach my $line (#sample1) {
my #tab = split (/\t/, $line);
my $chr = $tab[1];
my $start = $tab[2];
my $end = $tab[3];
my $ref = $tab[4];
my $alt = $tab[5];
my $name = $tab[6];
foreach my $item (#sample2){
my #fields = split (/\t/,$item);
if ( $fields[1] =~ m/$chr(.*)/
&& $fields[2] =~ m/$start(.*)/
&& $fields[4] =~ m/$ref(.*)/
&& $fields[5] =~ m/$alt(.*)/
&& $fields[6] =~ m/$name(.*)/
) {
print $line, "\n", $item;
}
}
}

On its surface your code seems to be fine (although I didn't debug it). If you don't have an error I cannot spot, could be that the input data has RE special character, which will confuse the regular expression engine when you put it as is (e.g. if any of the variable has the '$' character). Could also be that instead of tab you have spaces some where, in which case you'll indeed get an error, because your split will fail.
In any case, you'll be better off composing just one regular expression that contains all the fields. My code below is a little bit more Perl Idiomatic. I like using the implicit $_ which in my opinion makes the code more readable. I just tested it with your input files and it does the job.
use strict;
open (INA, $ARGV[0]) or die "cannot open file 1";
open (INB, $ARGV[1]) or die "cannot open file 2";
my #sample1 = <INA>;
my #sample2 = <INB>;
foreach (#sample1) {
(my $id, my $chr, my $start, my $end, my $ref, my $alt, my $name) =
m/^(ID\d+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)\s+(\w+)/;
my $rex = "^ID\\d+\\s+$chr\\s+$start\\s+$end\\s+$ref\\s+$alt\\s+$name\\s+";
#print "$rex\n";
foreach (#sample2) {
if( m/$rex/ ) {
print "$id - $_";
}
}
}
Also, how regular is the input data? Do you have exactly one tab between the fields? If that is the case, there is no point to split the lines into 7 different fields - you only need two: the ID portion of the line, and the rest. The first regex would be
(my $id, my $restOfLine) = m/^(ID\d+)\s+(.*)$/;
And you are searching $restOfLine within the second file in a similar technique as above.
If your files are huge and performance is an issue, you should consider putting the first regular expressions (or strings) in a map. That will give you O(n*log(m)) where n and m are the number of lines in each file.
Finally, I have a similar challenge when I need to compare logs. The logs are supposed to be identical, with the exception of a time mark at the beginning of each line. But more importantly: most lines are the same and in order. If this is what you have, and it make sense for you, you can:
First remove the IDxxx from each line: perl -pe "s/ID\d+ +//" file >cleanfile
Then use BeyondCompare or Windiff to compare the files.

I played a bit with your code. What you wrote there was actually three loops:
one over the lines of the first file,
one over the lines of the second file, and
one over all fields in these lines. You manually unrolled this loop.
The rest of this answer assumes that the files are strictly tab-seperated and that any other whitespace matters (even at the end of fields and lines).
Here is a condensed version of the code (assumes open filehandles $file1, $file2, and use strict):
my #sample2 = <$file2>;
SAMPLE_1:
foreach my $s1 (<$file1>) {
my (undef, #fields1) = split /\t/, $s1;
my #regexens = map qr{\Q$_\E(.*)}, #fields1;
SAMPLE_2:
foreach my $s2 (#sample2) {
my (undef, #fields2) = split /\t/, $s2;
for my $i (0 .. $#regexens) {
$fields2[$i] =~ $regexens[$i] or next SAMPLE_2;
}
# only gets here if all regexes matched
print $s1, $s2;
}
}
I did some optimisations: precompiling the various regexes and storing them in an array, quoting the contents of the fields etc. However, this algorithm is O(n²), which is bad.
Here is an elegant variant of that algorithm that knows that only the first field is different — the rest of the line has to be the same character for character:
my #sample2 = <$file2>;
foreach my $s1 (<$file1>) {
foreach my $s2 (#sample2) {
print $s1, $s2 if (split /\t/, $s1, 2)[1] eq (split /\t/, $s2, 2)[1];
}
}
I just test for string equality of the rest of the line. While this algorithm is still O(n²), it outperforms the first solution roughly by an order of magnitude simply by avoiding braindead regexes here.
Finally, here is an O(n) solution. It is a variant of the previous one, but executes the loops after each other, not inside each other, therefore finishing in linear time. We use hashes:
# first loop via map
my %seen = map {reverse(split /\t/, $_, 2)}
# map {/\S/ ? $_ : () } # uncomment this line to handle empty lines
<$file1>;
# 2nd loop
foreach my $line (<$file2>) {
my ($id2, $key) = split /\t/, $line, 2;
if (defined (my $id1 = $seen{$key})) {
print "$id1\t$key";
print "$id2\t$key";
}
}
%seen is a hash that has the rest of the line as a key and the first field as a value. In the second loop, we retrieve the rest of the line again. If this line was present in the first file, we reconstruct the whole line and print it out. This solution is better than the others and scales well up- and downwards, because of its linear complexity

How about:
#!/usr/bin/perl
use File::Slurp;
use strict;
my ($ina, $inb) = #ARGV;
my #lines_a = File::Slurp::read_file($ina);
my #lines_b = File::Slurp::read_file($inb);
my $table_b = {};
my $ln = 0;
# Store all lines in second file in a hash with every different value as a hash key
# If there are several identical ones we store them also, so the hash values are lists containing the id and line number
foreach (#lines_b) {
chomp; # strip newlines
$ln++; # count current line number
my ($id, $rest) = split(m{[\t\s]+}, $_, 2); # split on whitespaces, could be too many tabs or spaces instead
if (exists $table_b->{$rest}) {
push #{ $table_b->{$rest} }, [$id, $ln]; # push to existing list if we already found an entry that is the same
} else {
$table_b->{$rest} = [ [$id, $ln] ]; # create new entry if this is the first one
}
}
# Go thru first file and print out all matches we might have
$ln = 0;
foreach (#lines_a) {
chomp;
$ln++;
my ($id, $rest) = split(m{[\t\s]+}, $_, 2);
if (exists $table_b->{$rest}) { # if we have this entry print where it is found
print "$ina:$ln:\t\t'$id\t$rest'\n " . (join '\n ', map { "$inb:$_->[1]:\t\t'$_->[0]\t$rest'" } #{ $table_b->{$rest} }) . "\n";
}
}

Is there any better way to "grep" from a large file than using `grep` in perl?

The $rvsfile is the path of a file about 200M. I want to count the number of line which has $userid in it. But using grep in a while loop seems very slowly. So is there any efficient way to do this? Because the $rvsfile is very large, I can't read it into memory using #tmp = <FILEHANDLE>.
while(defined($line = <SRCFILE>))
{
$line =~ /^([^\t]*)\t/;
$userid = $1;
$linenum = `grep '^$userid\$' $rvsfile | wc -l`;
chomp($linenum);
print "$userid $linenum\n";
if($linenum == 0)
{
print TARGETFILE "$line";
}
}
And how can I get the part before \t in a line without regex? For example, the line may like this:
2013123\tsomething
How can I get 2013123 without regex?

Yes, you are forking a shell on each loop invocation. This is slow. You also read the entire $rsvfile once for every user. This is too much work.
Read SRCFILE once and build a list of #userids.
Read $rvsfile once keeping a running count of each user id as you go.
Sketch:
my #userids;
while(<SRCFILE>)
{
push #userids, $1 if /^([^\t]*)\t/;
}
my $regex = join '|', #userids;
my %count;
while (<RSVFILE>)
{
++$count{$1} if /^($regex)$/o
}
# %count has everything you need...

Use hashes:
my %count;
while (<LARGEFILE>) {
chomp;
$count{$_}++;
};
# now $count{userid} is the number of occurances
# of $userid in LARGEFILE
Or if you fear using too much memory for the hash (i.e. you're interested in 6 users, and there are 100K more in the large file), do it another way:
my %count;
while (<SMALLFILE>) {
/^(.*?)\t/ and $count{$_} = 0;
};
while (<LARGEFILE>) {
chomp;
$count{$_}++ if defined $count{$_};
};
# now $count{userid} is the number of occurances
# of $userid in LARGEFILE, *if* userid is in SMALLFILE

You can search for the location of the first \t using index which will be faster. You could then use splice to get the match.
Suggest you benchmark various approaches.

If I read you correctly you want something like this:
#!/usr/bin/perl
use strict;
use warnings;
my $userid = 1246;
my $count = 0;
my $rsvfile = 'sample';
open my $fh, '<', $rsvfile;
while(<$fh>) {
$count++ if /$userid/;
}
print "$count\n";
or even, (and someone correct me if I am wrong, but this don't think this reads the whole file in):
#!/usr/bin/perl
use strict;
use warnings;
my $userid = 1246;
my $rsvfile = 'sample';
open my $fh, '<', $rsvfile;
my $count = grep {/$userid/} <$fh>;
print "$count\n";

If <SRCFILE> is relatively small, you could do it the other way round. Read in the larger file one line at a time, and check each userid per line, keeping a count of each userid using a hash sructure. Something like:
my %userids = map {($_, 0)} # use as hash key with init value of 0
grep {$_} # only return mataches
map {/^([^\t]+)/} <SRCFILE>; # extract ID
while (defined($line = <LARGEFILE>)) {
for (keys %userids) {
++$userids{$_} if $line =~ /\Q$_\E/; # \Q...\E escapes special chars in $_
}
}
This way, only the smaller data is read repeatedly and the large file is scanned once. You end up with a hash of every userid, and the value is the number of lines it occurred in.

if you have a choice, try it with awk
awk 'FNR==NR{a[$1];next} { for(i in a) { if ($0 ~ i) { print $0} } } ' $SRCFILE $rsvfile

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Preventing "foo" from matching "foo-bar" with grep -w - regex

Related

How do I store culmultive length of strings into an array with Perl

Perl apply partial match regex on a line in long text file using hash key

Use Perl to count occurrences of all words in a file or in all files in a directory

regular expression code

Is there any better way to "grep" from a large file than using `grep` in perl?

Categories

Resources