Coutn unique occurance of match in a line - regex

I have file with entry:
(----) Manish Garg 74163: V2.0.1_I3_SIT: KeyStroke Logger decrypted file for key stroke displayed a difference of 4 hours from CCM time. - 74163: KeyStroke Logger decrypted file for key stroke displayed a difference of 4 hours from CCM time. 2014/07/04
I want to look for the unique count of id "74163" or any id in a line.
Currently it gives the output as :
updated_workitem value> "74163"
Count> "2"
But i want the count value as 1.(I dont want to include duplicate entry in count)
My code is
my $workitem;
$file = new IO::File;
$file->open("<compare.log") or die "Cannot open compare.log";
#file_list = <$file>;
$file->close;
foreach $line (#file_list) {
while ($line =~ m/(\d{4,}[,|:])/g ){
#temp = split(/[:|,]/, $1);
push #work_items, $temp[0];
}
}
my %count;
my #wi_to_built;
map { $count{$_}++ } #work_items;
foreach $workitem (sort keys (%count)) {
chomp($workitem);
print "updated_workitem value> \"$workitem\"\n";
print "Count> \"$count{$workitem}\"\n";
}

Use a hash to track unique ids found in a particular line:
foreach my $line (#file_list) {
my %line_ids;
while ($line =~ m/(\d{4,})[,|:]/g ){
$line_ids{$1} = 1; # Record unique ids
}
push #work_items, keys %line_ids; # Save the ids
}
Note, I've changed your regex slightly so you don't need to split to a temporary array.

You can remove duplicate from array before doing map { $count{$_}++ } #work_items;
#work_items = uniq(#work_items);
sub uniq {
my %seen;
grep !$seen{$_}++, #_;
}
Demo

Related

Preventing "foo" from matching "foo-bar" with grep -w

I am using grep inside my Perl script and I am trying to grep the exact keyword that I am giving. The problem is that "-w" doesn't recognize the "-" symbol as a separator.
example:
Let's say that I have these two records:
A1BG 0.0767377011073753
A1BG-AS1 0.233775553296782
if I give
grep -w "A1BG"
it returns both of them but I want only the exact one.
Any suggestions?
Many thanks in advance.
PS.
Here is my whole code.
The input file is a two-columns tab separated. So, I want to keep a unique value for each gene. In cases that I have more than one record, I calculate the average.
#!/usr/bin/perl
use strict;
use warnings;
#Find the average fc between common genes
sub avg {
my $total;
$total += $_ foreach #_;
return $total / #_;
}
my #mykeys = `cat G13_T.txt| awk '{print \$1}'| sort -u`;
foreach (#mykeys)
{
my #TSS = ();
my $op1 = 0;
my $key = $_;
chomp($key);
#print "$key\n";
my $command = "cat G13_T.txt|grep -E '([[:space:]]|^)$key([[:space:]]|\$)'";
#my $command = "cat Unique_Genes/G13_T.txt|grep -w $key";
my #belongs= `$command`;
chomp(#belongs);
my $count = scalar(#belongs);
if ($count == 1) {
print "$belongs[0]\n";
}
else {
for (my $i = 0; $i < $count; $i++) {
my #token = split('\t', $belongs[$i]);
my $lfc = $token[1];
push (#TSS, $lfc);
}
$op1 = avg(#TSS);
print $key ."\t". $op1. "\n";
}
}
If I got clarifications in comments right, the objective is to find the average of values (second column) for unique names in the first column. Then there is no need for external tools.
Read the file line by line and add up values for each name. The name uniqueness is granted by using a hash, with names being keys. Along with this also track their counts
use warnings;
use strict;
use feature 'say';
my $file = shift // die "Usage: $0 filename\n";
open my $fh, '<', $file or die "Can't open $file: $!";
my %results;
while (<$fh>) {
#my ($name, $value) = split /\t/;
my ($name, $value) = split /\s+/; # used for easier testing
$results{$name}{value} += $value;
++$results{$name}{count};
}
foreach my $name (sort keys %results) {
$results{$name}{value} /= $results{$name}{count}
if $results{$name}{count} > 1;
say "$name => $results{$name}{value}";
}
After the file is processed each accumulated value is divided by its count and overwritten by that, so by its average (/= divides and assigns), if count > 1 (as a small measure of efficiency).
If there is any use in knowing all values that were found for each name, then store them in an arrayref for each key instead of adding them
while (<$fh>) {
#my ($name, $value) = split /\t/;
my ($name, $value) = split /\s+/; # used for easier testing
push #{$results{$name}}, $value;
}
where now we don't need the count as it is given by the number of elements in the array(ref)
use List::Util qw(sum);
foreach my $name (sort keys %results) {
say "$name => ", sum(#{$results{$name}}) / #{$results{$name}};
}
Note that a hash built this way needs memory comparable to the file size (or may even exceed it), since all values are stored.
This was tested using the shown two lines of sample data, repeated and changed in a file. The code does not test the input in any way, but expects the second field to always be a number.
Notice that there is no reason to ever step out of our program and use external commands.
You may use a POSIX ERE regex with grep like this:
grep -E '([[:space:]]|^)A1BG([[:space:]]|$)' file
To return matches (not matching lines) only:
grep -Eo '([[:space:]]|^)A1BG([[:space:]]|$)' file
Details
([[:space:]]|^) - Group 1: a whitespace or start of line
A1BG - a substring
([[:space:]]|$) - Group 2: a whitespace or end of line

Perl apply partial match regex on a line in long text file using hash key

Input1: I have a chemicalnames hash.These names are short names and are the keys to hash.
Input2: I have a text book (I mean a very long text file) where above shortnames appear in full.
Task: Where ever the name appears in full in text file , if the next line is with "" then I have to replace this "" with relevant hash value description. $hash{key}{description}.
Example: if hash key = Y then it might appear in text file as either
X.Y.Z or just X.YZ or XYZ or XY2 or X_Y_Z02 .Its unpredictable but it appears somewhere in the middle or end.
That means the text file name is a partial match to hash key name.
My Trails: I tried keeping full file into array then tried to find where empty "" appears .Once it appear I do regex compare on previous line with hash key.But this doesnot work :( .Also the process is too slow.I have tried different kind of techniques with experts help but failed to reduce speed with other methods.Please help
My program is as follows:
use strict;
use warnings;
my $file = "Chemicalbook.txt"; #In text file full name might appear as Dihydrogen.monoxide.hoax_C
my $previous_line = "";
my %hash;
$hash{'monoxide'}{description} = "accelerates corrosion";
open(my $FILE,'<',$file) or die "Cannot open input file";
open(my $output,'>',"outfile.txt") or die "Cannot open output file";
my #file_in_array = <$FILE>;
foreach my $line (#file_in_array) {
my $name = $previous_line;
if($line =~ /""/) {
foreach my $shortname(keys %hash)
{
if($previous_line =~ /$shortname/) {
$line = s/""/$hash{$shortname}{description}/;
}
}
}
$previous_line = $line;
print {$output} $line ;
}
close($FILE);
close($output);
Looping over all keys for each line is hopeless(ly slow). Try replacing the entire inner foreach loop with this:
while ($previous_line =~ /(\w+)/g)
{
if (my $s = $hash{$1})
{
$line = $$s{description};
}
}
It will pick up shortnames as long as they're "standing alone" in the text.
my %hash;
my #arr=qw(X.Y.Z X.YZ XYZ XY2 ZZZ Chromium.trioxideChromic_02acid);
$hash{'Y'}='Hello';
$hash{'R'}='Hai';
$hash{'trioxide'}='Testing';
foreach my $line (#arr)
{
if( my($key)= grep { $line =~ /$_/ } keys(%hash)) {
print "$line - $hash{$key} \n";
}
else {
print "Unmatched $line\n";
}
}

Use Perl to count occurrences of all words in a file or in all files in a directory

So I am trying to write a Perl script which will take in 3 arguments.
First argument is the input file or directory.
If it is a file, it will count number of occurrences of all words
If it is a directory, it will recursively go through each directory and get all the number of occurrences for all words in the files within those directories
Second argument is a number that will be how many of the words to display with the highest number of occurrences.
This will print to the console only the number for each word
Print them to an output file which is the third argument in the command line.
It seems to be working as far as recursively searching through directories and finding all occurrences of the words in a file and prints them to the console.
How can I print these to an output file and also, how would I take the second argument, which is the number, say 5, and have it print to the console the number of words with the most occurrences while printing the words to the output file?
The following is what I have so far:
#!/usr/bin/perl -w
use strict;
search(shift);
my $input = $ARGV[0];
my $output = $ARGV[1];
my %count;
my $file = shift or die "ERROR: $0 FILE\n";
open my $filename, '<', $file or die "ERROR: Could not open file!";
if ( -f $filename ) {
print("This is a file!\n");
while ( my $line = <$filename> ) {
chomp $line;
foreach my $str ( $line =~ /\w+/g ) {
$count{$str}++;
}
}
foreach my $str ( sort keys %count ) {
printf "%-20s %s\n", $str, $count{$str};
}
}
close($filename);
if ( -d $input ) {
sub search {
my $path = shift;
my #dirs = glob("$path/*");
foreach my $filename (#dirs) {
if ( -f $filename ) {
open( FILE, $filename ) or die "ERROR: Can't open file";
while ( my $line = <FILE> ) {
chomp $line;
foreach my $str ( $line =~ /\w+/g ) {
$count{$str}++;
}
}
foreach my $str ( sort keys %count ) {
printf "%-20s %s\n", $str, $count{$str};
}
}
# Recursive search
elsif ( -d $filename ) {
search($filename);
}
}
}
}
I would suggest restructuring your program/script. What you have posted is a difficult to follow. A few comments might be helpful to follow what is happening. I'll try to go through how I would arrange things with some code snippets to hopefully help to explain items. I'll go through the three items you outlined in your question.
Since the first argument can be a file or directory, I would use -f and -d to check to determine what is the input. I would use an list/array to contain a list of file to be processed. IF it was only a file, I would just push it onto to the processing list. Otherwise, I would call a routine to return a list of files to be processed (similar to your search subroutine). Something like:
# List file files to process
my #fileList = ();
# if input is only a file
if ( -f $ARGV[0] )
{
push #fileList,$ARGV[0];
}
# If it is a directory
elsif ( -d $ARGV[0] )
{
#fileList = search($ARGV[0]);
}
So in your search subroutine, you need a list/array onto which to push items which are files and then return the array from the subroutine (after you have processed the list of files from the glob call). When you have a directory, you call search with the path (just as you are currently doing), pushing the elements on your current array, such as
# If it is a file, save it to the list to be returned
if ( -f $filename )
{
push #returnValue,$filename;
}
# else if a directory, get the files from the directory and
# add them to the list to be returned
elsif ( -d $filename )
{
push #returnValue, search($filename);
}
After you have the file list, loop through it processing each file (opening, reading lines in closing, processing the lines for the words). The foreach loop you have for processing each line works correctly. However, if your words have periods, commas or other punctuation, you may want to remove those items before counting the word in a hash.
For the next part, you asked about determining the words with the highest counts. In that case, you want make another hash which has a key of counts (for each word), and the value of that hash is a list/array of words associated with that number of counts. Something like:
# Hash with key being a number and value a list of words for that number
my %totals= ();
# Temporary variable to store occurrences (counts) of the word
my $wordTotal;
# $w is the words in the counts hash
foreach my $w ( keys %counts )
{
# Get the counts for the word
$wordTotal = $counts{$w};
# value of the hash is an array, so de-reference the array ( the #{ },
# and push the value of the counts array onto the array
push #{ $totals{$wordTotal} },$w; # the key to total is the value of the count hash
# for which the words ($w) are the keys
}
To get the words with the highest counts you need to get the keys from the total and reverse a sorted list (numerically sorted) to get the N number of highest. Since we have an array of values, we will have to count each output to get the N number of highest counts.
# Number of items outputted
my $current = 0;
# sort the total (keys) and reverse the list so the highest values are first
# and go through the list
foreach my $t ( reverse sort { $a <=> $b} keys %totals) # Use the numeric
# comparison in
# the sort
{
# Since each value of total hash is an array of words,
# loop through that array for the values and print out the number
foreach my $w ( sort #{$total{$t}}
{
# Print the number for the count of words
print "$t\n";
# Increment the number output
$current++;
# if this is the number to be printed, we are done
last if ( $current == $ARGV[1] );
}
# if this is the number to be printed, we are done
last if ( $current == $ARGV[1] );
}
The third part of printing to a file, it is unclear what "them" is (words, counts or both; limited to top ones or all of the words) from your question. I will leave that effort for you to open a file, print out the information to the file and close the file.
This will total up the occurrences of words in a directory or file given on the command line:
#!/usr/bin/env perl
# wordcounter.pl
use strict;
use warnings;
use IO::All -utf8;
binmode STDOUT, 'encoding(utf8)'; # you may not need this
my #allwords;
my %count;
die "Usage: wordcounter.pl <directory|filename> number \n" unless ~~#ARGV == 2 ;
if (-d $ARGV[0] ) {
push #allwords, $_->slurp for io($ARGV[0])->all_files;
}
elsif (-f $ARGV[0]) {
#allwords = io($ARGV[0])->slurp ;
}
while (my $line = shift #allwords) {
foreach ( split /\s+/, $line) {
$count{$_}++
}
}
my $count_to_show;
for my $word (sort { $count{$b} <=> $count{$a} } keys %count) {
printf "%-30s %s\n", $word, $count{$word};
last if ++$count_to_show == $ARGV[1];
}
By modifying the sort and/or io calls you can sort { } by number of occurrences, alphabetically by word, either for a file or for all files in a directory. Those options would be fairly easy to add as parameters. You can also filter or change how words are defined for inclusion in the %count hash by changing foreach ( split /\s+/, $line) to say, include a match/filter such as foreach ( grep { length le 5 } split /\s+/, $line) in order to only count words of five or fewer letters.
Sample run in current directory:
./wordcounter ./ 10
the 116
SV 87
i 66
my_perl 58
of 54
use 54
int 49
PerlInterpreter 47
sv 47
Inline 47
return 46
Caveats
you should probably add a test for file mimetypes, readability, etc.
pay attention to unicode
to write to a file just add > filename.txt to the end of your commandline ;-)
IO::All is not the standard CORE IO package I am only advertising and promoting it here ;-) (you could swap that bit out)
If you wanted to added a sort_by option (-n --numeric, -a --alphabetic etc.) Sort::Maker might be one way to make that manageable.
EDIT had neglected to add options as OP requested.
I have figured it out. The following is my solution. I'm not sure if it's the best way to do it, but it works.
# Check if there are three arguments in the commandline
if (#ARGV < 3) {
die "ERROR: There must be three arguments!\n";
exit;
}
# Open the file
my $file = shift or die "ERROR: $0 FILE\n";
open my $fh,'<', $file or die "ERROR: Could not open file!";
# Check if it is a file
if (-f $fh) {
print("This is a file!\n");
# Go through each line
while (my $line = <$fh>) {
chomp $line;
# Count the occurrences of each word
foreach my $str ($line =~ /\b[[:alpha:]]+\b/) {
$count{$str}++;
}
}
}
# Check if the INPUT is a directory
if (-d $input) {
# Call subroutine to search directory recursively
search_dir($input);
}
# Close the file
close($fh);
$high_count = 0;
# Open the file
open my $fileh,'>', $output or die "ERROR: Could not open file!\n";
# Sort the most occurring words in the file and print them
foreach my $str (sort {$count{$b} <=> $count{a}} keys %count) {
$high_count++;
if ($high_count <= $num) {
printf "%-31s %s\n", $str, $count{$str};
}
printf $fileh "%-31s %s\n", $str, $count{$str};
}
exit;
# Subroutine to search through each directory recursively
sub search_dir {
my $path = shift;
my #dirs = glob("$path/*");
# Loop through filenames
foreach my $filename (#dirs) {
# Check if it is a file
if (-f $filename) {
# Open the file
open(FILE, $filename) or die "ERROR: Can't open file";
# Go through each line
while (my $line = <FILE>) {
chomp $line;
# Count the occurrences of each word
foreach my $str ($line =~ /\b[[:alpha:]]+\b/) {
$count{$str}++;
}
}
# Close the file
close(FILE);
}
elsif (-d $filename) {
search_dir($filename);
}
}
}

Problems with pushing data into an array, declaring a hash and conditional statements in Perl

Can anyone help, I'm having problems with my Perl script. I want to push a 3-column data input file into an array, select ID numbers and names, declare a hash using both IDs as the key and the value as the values and then run an if-else conditional statement to select the key-value pairs that have a value greater than 2.
Here's an example of the input.txt data file where column 1 is ID number, column 2 is ID name and column 3 value associated with columns 1 and 2.
ENSG00000251791 SCARNA6 2.5
ENSG00000238862 SNORD19B 6.3
ENSG00000238527 SN-112 -3
ENSG00000222373 RNY.5P5 1.3
I can get the first part pushing the data into an array but I can't the rest of it to work. I've created two hashes that contain ID number:value and ID name:value pairs as I'd like both columns in the output file:
ENSG00000251791 SCARNA6 2.5
ENSG00000238862 SNORD19B 6.3
Here's the code:
use strict;
use warnings;
my $input = 'input.txt';
my #input_vars;
open my $input_file_handle, '<', $input or die $!;
while (<$input_file_handle>) {
chomp $_;
push #input_vars, $_;
}
close $input_file_handle;
# regex to select ID name, ID number and value
my %id;
foreach (#input_vars) {
my $regex = '/\w+\s[\w-]+\s\d+\.\d+/';
while ($_ =~ m/$regex/g) {
my $id1{$1} = $3;
my $id2{$2} = $3;
}
}
foreach (#input_vars) {
print "$_ ";
if ($id1{$_} >= 2) {
print "$id1{$_}";
} else {
print "N/A";
}
if ($id2{$_} >= 2) {
print "$id2{$_}";
} else {
print "N/A";
print "n";
}
I think I have over-complicated it by creating a regex to select ID numbers and names so if there's a simpler, more efficient way, that would be great.
Change the first foreach loop to:
foreach (#input_vars) {
if (/(\w+)\s([\w-]+)\s(\d+\.\d+)$/) {
$id1{$1} = $3;
$id2{$2} = $3;
}
}

Is there any better way to "grep" from a large file than using `grep` in perl?

The $rvsfile is the path of a file about 200M. I want to count the number of line which has $userid in it. But using grep in a while loop seems very slowly. So is there any efficient way to do this? Because the $rvsfile is very large, I can't read it into memory using #tmp = <FILEHANDLE>.
while(defined($line = <SRCFILE>))
{
$line =~ /^([^\t]*)\t/;
$userid = $1;
$linenum = `grep '^$userid\$' $rvsfile | wc -l`;
chomp($linenum);
print "$userid $linenum\n";
if($linenum == 0)
{
print TARGETFILE "$line";
}
}
And how can I get the part before \t in a line without regex? For example, the line may like this:
2013123\tsomething
How can I get 2013123 without regex?
Yes, you are forking a shell on each loop invocation. This is slow. You also read the entire $rsvfile once for every user. This is too much work.
Read SRCFILE once and build a list of #userids.
Read $rvsfile once keeping a running count of each user id as you go.
Sketch:
my #userids;
while(<SRCFILE>)
{
push #userids, $1 if /^([^\t]*)\t/;
}
my $regex = join '|', #userids;
my %count;
while (<RSVFILE>)
{
++$count{$1} if /^($regex)$/o
}
# %count has everything you need...
Use hashes:
my %count;
while (<LARGEFILE>) {
chomp;
$count{$_}++;
};
# now $count{userid} is the number of occurances
# of $userid in LARGEFILE
Or if you fear using too much memory for the hash (i.e. you're interested in 6 users, and there are 100K more in the large file), do it another way:
my %count;
while (<SMALLFILE>) {
/^(.*?)\t/ and $count{$_} = 0;
};
while (<LARGEFILE>) {
chomp;
$count{$_}++ if defined $count{$_};
};
# now $count{userid} is the number of occurances
# of $userid in LARGEFILE, *if* userid is in SMALLFILE
You can search for the location of the first \t using index which will be faster. You could then use splice to get the match.
Suggest you benchmark various approaches.
If I read you correctly you want something like this:
#!/usr/bin/perl
use strict;
use warnings;
my $userid = 1246;
my $count = 0;
my $rsvfile = 'sample';
open my $fh, '<', $rsvfile;
while(<$fh>) {
$count++ if /$userid/;
}
print "$count\n";
or even, (and someone correct me if I am wrong, but this don't think this reads the whole file in):
#!/usr/bin/perl
use strict;
use warnings;
my $userid = 1246;
my $rsvfile = 'sample';
open my $fh, '<', $rsvfile;
my $count = grep {/$userid/} <$fh>;
print "$count\n";
If <SRCFILE> is relatively small, you could do it the other way round. Read in the larger file one line at a time, and check each userid per line, keeping a count of each userid using a hash sructure. Something like:
my %userids = map {($_, 0)} # use as hash key with init value of 0
grep {$_} # only return mataches
map {/^([^\t]+)/} <SRCFILE>; # extract ID
while (defined($line = <LARGEFILE>)) {
for (keys %userids) {
++$userids{$_} if $line =~ /\Q$_\E/; # \Q...\E escapes special chars in $_
}
}
This way, only the smaller data is read repeatedly and the large file is scanned once. You end up with a hash of every userid, and the value is the number of lines it occurred in.
if you have a choice, try it with awk
awk 'FNR==NR{a[$1];next} { for(i in a) { if ($0 ~ i) { print $0} } } ' $SRCFILE $rsvfile