Using iterated variables with regex - regex

The point of the overall script is to:
step 1) open a single column file and read off first entry.
step 2) open a second file containing lots of rows and columns, read off EACH line one at a time, and find anything in that line that matches the first entry from the first file.
step3) if a match is found, then "do something constructive", and if not, go to the first file and take the second entry and repeat step 2 and step 3, and so on...
here is the script:
#!/usr/bin/perl
use strict; #use warnings;
unless(#ARGV) {
print "\usage: $0 filename\n\n"; # $0 name of the program being executed
exit;
}
my $list = $ARGV[0];
chomp( $list );
unless (open(LIST, "<$list")) {
print "\n I can't open your list of genes!!! \n";
exit;
}
my( #list ) = (<LIST>);
close LIST;
open (CHR1, "<acembly_chr_sorted_by_exon_count.txt") or die;
my(#spreadsheet) = (<CHR1>);
close CHR1;
for (my $i = 0; $i < scalar #list; $i++ ) {
print "$i in list is $list[$i]\n";
for (my $j = 1; $j < scalar #spreadsheet; $j++ ) {
#print "$spreadsheet[$j]\n";
if ( $spreadsheet[$j] ) {
print "will $list[$i] match with $spreadsheet[$j]?\n";
}
else { print "no match\n" };
} #for
} #for
I plan to use a regex in the line if ( $spreadsheet[$j] ) { but am having a problem at this step as it is now. On the first interation, the line print "will $list[$i] match with $spreadsheet[$j]?\n"; prints $list[$i] OK but does not print $spreadsheet[$j]. This line will print both variables correctly on the second and following iterations. I do not see why?

At first glance nothing looks overtly incorrect. As mentioned in the comments the $j = 1 looks questionable but perhaps you are skipping the first row on purpose.
Here is a more perlish starting point that is tested. If it does not work then you have something going on with your input files.
Note the extended trailing whitespace removal. Sometimes if you open a WINDOWS file on a UNIX machine and use chomp, you can have embedded \r in your text that causes weird things to happen to printed output.
#!/usr/bin/perl
use strict; #use warnings;
unless(#ARGV) {
print "\usage: $0 filename\n\n"; # $0 name of the program being executed
exit;
}
my $list = shift;
unless (open(LIST, "<$list")) {
print "\n I can't open your list of genes!!! \n";
exit;
}
open(CHR1, "<acembly_chr_sorted_by_exon_count.txt") or die;
my #spreadsheet = map { s/\s+$//; $_ } <CHR1>;
close CHR1;
# s/\s+$//; is like chomp but trims all trailing whitespace even
# WINDOWS files opened on a UNIX system.
for my $item (<LIST>) {
$item =~ s/\s+$//; # trim all trailing whitespace
print "==> processing '$item'\n";
for my $row (#spreadsheet) {
if ($row =~ /\Q$item\E/) { # see perlre for \Q \E
print "match '$row'\n";
}
else {
print "no match '$row'\n";
}
}
}
close LIST;

Related

Perl apply partial match regex on a line in long text file using hash key

Input1: I have a chemicalnames hash.These names are short names and are the keys to hash.
Input2: I have a text book (I mean a very long text file) where above shortnames appear in full.
Task: Where ever the name appears in full in text file , if the next line is with "" then I have to replace this "" with relevant hash value description. $hash{key}{description}.
Example: if hash key = Y then it might appear in text file as either
X.Y.Z or just X.YZ or XYZ or XY2 or X_Y_Z02 .Its unpredictable but it appears somewhere in the middle or end.
That means the text file name is a partial match to hash key name.
My Trails: I tried keeping full file into array then tried to find where empty "" appears .Once it appear I do regex compare on previous line with hash key.But this doesnot work :( .Also the process is too slow.I have tried different kind of techniques with experts help but failed to reduce speed with other methods.Please help
My program is as follows:
use strict;
use warnings;
my $file = "Chemicalbook.txt"; #In text file full name might appear as Dihydrogen.monoxide.hoax_C
my $previous_line = "";
my %hash;
$hash{'monoxide'}{description} = "accelerates corrosion";
open(my $FILE,'<',$file) or die "Cannot open input file";
open(my $output,'>',"outfile.txt") or die "Cannot open output file";
my #file_in_array = <$FILE>;
foreach my $line (#file_in_array) {
my $name = $previous_line;
if($line =~ /""/) {
foreach my $shortname(keys %hash)
{
if($previous_line =~ /$shortname/) {
$line = s/""/$hash{$shortname}{description}/;
}
}
}
$previous_line = $line;
print {$output} $line ;
}
close($FILE);
close($output);
Looping over all keys for each line is hopeless(ly slow). Try replacing the entire inner foreach loop with this:
while ($previous_line =~ /(\w+)/g)
{
if (my $s = $hash{$1})
{
$line = $$s{description};
}
}
It will pick up shortnames as long as they're "standing alone" in the text.
my %hash;
my #arr=qw(X.Y.Z X.YZ XYZ XY2 ZZZ Chromium.trioxideChromic_02acid);
$hash{'Y'}='Hello';
$hash{'R'}='Hai';
$hash{'trioxide'}='Testing';
foreach my $line (#arr)
{
if( my($key)= grep { $line =~ /$_/ } keys(%hash)) {
print "$line - $hash{$key} \n";
}
else {
print "Unmatched $line\n";
}
}

Take an array of phone numbers and search another array for each occurrence of said number and print that matching line and the following line

I have two text files. I'm importing each file into an array. Each value in the numbers array should search the users array for its match. If found echo the matching line and the proceeding line.
So, if the first entry in the numbers array is 1234, search users array for 1234. If found print that line and the next.
numbers.txt looks like:
1234567021
1234566792
users filelooks like:
1234567021#host.com User-Password == "secret"
Framed-IP-Address = 000.000.000.000,
What I have so far:
use strict;
my $users_file = "users";
my $numbers_file = "numbers.txt";
my $phonenumber;
my $numbers;
#### Place phone number into an array ####
open (RESULTS, $numbers_file) or die "Unable to open file: $users_file\n$!";
my #numbers;
#numbers = <NUMBER_RESULTS>;
close(NUMBER_RESULTS);
#### Place users file contents into an array ####
open (RESULTS, $users_file) or die "Unable to open file: $users_file\n$!";
my #users_data;
#users_data = <RESULTS>;
close(RESULTS);
#### Search the array for the string ####
foreach $numbers(#users_data) {
if (index($numbers,$phonenumber) ge 0) {
my #list = grep /\b$numbers\b/, #users_data;
chomp #list;
print "$_\n" foreach #list;
}
}
exit 1;
You are recreating a search for a key when perl has a built-in hash data type that will handle this better and faster than rolling your own. Using this will take a little more work in reading in the data, but it will be worth it.
First, lets switch to a modern version of open where we use a lexically scoped variable for the file handle, and specify a mode.
open (my $results, "<", $users_file) or die "Unable to open file: $users_file\n$!";
From there, we will read the file open line at a time and populate the hash.
my (%users_data, $number, $number_line);
while(<$results>)
{
chomp;
if(defined($number))
{
$user_data{$number} = "$number_line\t$_\n"; #load the line after the number into the hash value.
undef $number;
}else
{
if(/^(\d+)\#/) #match digits between the beginning of the line and the # symbol.
{
$number = $1; #save matched digits from $1.
$number_line = $_;
}
}
Note that this is assuming that the data is well formated. If there are concerns, you can test for proper formatting in the else clause.
Now, for the output we can use the following
for (#numbers)
{
chomp; #since we didn't remove newlines when populating #numbers
if( defined($users_data{$_}) )
{
print $users_data{$_};
}
}
EDIT
Here is a working version. Note use strict and use warnings helped to catch that one variables was declared (RESULTS and %users_file) while another was used later (NUMBER_RESULTS and %user_file), which is why those are so important. Also Data::Dumper was used to print out the contents of the array #numbers and the hash %users_data to see what data actually made it into the data structures.
#!/usr/bin/env perl
use strict;
use warnings;
#use Data::Dumper;
my $users_file = "users";
my $numbers_file = "numbers.txt";
#### Place phone number into an array ####
open (my $results, "<", $numbers_file) or die "Unable to open file: $numbers_file\n$!";
my #numbers;
#numbers = <$results>;
close($results);
#print Dumper \#numbers;
open (my $results, "<", $users_file) or die "Unable to open file: $users_file\n$!";
my (%users_data, $number, $number_line);
while(<$results>)
{
chomp;
if(defined($number))
{
$users_data{$number} = $number_line."\n$_\n"; #load the line after the number into the hash value.
undef $number;
}else
{
if(/^(\d+)\#/) #match digits between the beginning of the line and the # symbol.
{
$number = $1; #save matched digits from $1.
$number_line = $_;
}
}
}
#print Dumper \%users_data;
for (#numbers)
{
chomp; #since we didn't remove newlines when populating #numbers
if( defined($users_data{$_}) )
{
print $users_data{$_};
}
}

Use Perl to count occurrences of all words in a file or in all files in a directory

So I am trying to write a Perl script which will take in 3 arguments.
First argument is the input file or directory.
If it is a file, it will count number of occurrences of all words
If it is a directory, it will recursively go through each directory and get all the number of occurrences for all words in the files within those directories
Second argument is a number that will be how many of the words to display with the highest number of occurrences.
This will print to the console only the number for each word
Print them to an output file which is the third argument in the command line.
It seems to be working as far as recursively searching through directories and finding all occurrences of the words in a file and prints them to the console.
How can I print these to an output file and also, how would I take the second argument, which is the number, say 5, and have it print to the console the number of words with the most occurrences while printing the words to the output file?
The following is what I have so far:
#!/usr/bin/perl -w
use strict;
search(shift);
my $input = $ARGV[0];
my $output = $ARGV[1];
my %count;
my $file = shift or die "ERROR: $0 FILE\n";
open my $filename, '<', $file or die "ERROR: Could not open file!";
if ( -f $filename ) {
print("This is a file!\n");
while ( my $line = <$filename> ) {
chomp $line;
foreach my $str ( $line =~ /\w+/g ) {
$count{$str}++;
}
}
foreach my $str ( sort keys %count ) {
printf "%-20s %s\n", $str, $count{$str};
}
}
close($filename);
if ( -d $input ) {
sub search {
my $path = shift;
my #dirs = glob("$path/*");
foreach my $filename (#dirs) {
if ( -f $filename ) {
open( FILE, $filename ) or die "ERROR: Can't open file";
while ( my $line = <FILE> ) {
chomp $line;
foreach my $str ( $line =~ /\w+/g ) {
$count{$str}++;
}
}
foreach my $str ( sort keys %count ) {
printf "%-20s %s\n", $str, $count{$str};
}
}
# Recursive search
elsif ( -d $filename ) {
search($filename);
}
}
}
}
I would suggest restructuring your program/script. What you have posted is a difficult to follow. A few comments might be helpful to follow what is happening. I'll try to go through how I would arrange things with some code snippets to hopefully help to explain items. I'll go through the three items you outlined in your question.
Since the first argument can be a file or directory, I would use -f and -d to check to determine what is the input. I would use an list/array to contain a list of file to be processed. IF it was only a file, I would just push it onto to the processing list. Otherwise, I would call a routine to return a list of files to be processed (similar to your search subroutine). Something like:
# List file files to process
my #fileList = ();
# if input is only a file
if ( -f $ARGV[0] )
{
push #fileList,$ARGV[0];
}
# If it is a directory
elsif ( -d $ARGV[0] )
{
#fileList = search($ARGV[0]);
}
So in your search subroutine, you need a list/array onto which to push items which are files and then return the array from the subroutine (after you have processed the list of files from the glob call). When you have a directory, you call search with the path (just as you are currently doing), pushing the elements on your current array, such as
# If it is a file, save it to the list to be returned
if ( -f $filename )
{
push #returnValue,$filename;
}
# else if a directory, get the files from the directory and
# add them to the list to be returned
elsif ( -d $filename )
{
push #returnValue, search($filename);
}
After you have the file list, loop through it processing each file (opening, reading lines in closing, processing the lines for the words). The foreach loop you have for processing each line works correctly. However, if your words have periods, commas or other punctuation, you may want to remove those items before counting the word in a hash.
For the next part, you asked about determining the words with the highest counts. In that case, you want make another hash which has a key of counts (for each word), and the value of that hash is a list/array of words associated with that number of counts. Something like:
# Hash with key being a number and value a list of words for that number
my %totals= ();
# Temporary variable to store occurrences (counts) of the word
my $wordTotal;
# $w is the words in the counts hash
foreach my $w ( keys %counts )
{
# Get the counts for the word
$wordTotal = $counts{$w};
# value of the hash is an array, so de-reference the array ( the #{ },
# and push the value of the counts array onto the array
push #{ $totals{$wordTotal} },$w; # the key to total is the value of the count hash
# for which the words ($w) are the keys
}
To get the words with the highest counts you need to get the keys from the total and reverse a sorted list (numerically sorted) to get the N number of highest. Since we have an array of values, we will have to count each output to get the N number of highest counts.
# Number of items outputted
my $current = 0;
# sort the total (keys) and reverse the list so the highest values are first
# and go through the list
foreach my $t ( reverse sort { $a <=> $b} keys %totals) # Use the numeric
# comparison in
# the sort
{
# Since each value of total hash is an array of words,
# loop through that array for the values and print out the number
foreach my $w ( sort #{$total{$t}}
{
# Print the number for the count of words
print "$t\n";
# Increment the number output
$current++;
# if this is the number to be printed, we are done
last if ( $current == $ARGV[1] );
}
# if this is the number to be printed, we are done
last if ( $current == $ARGV[1] );
}
The third part of printing to a file, it is unclear what "them" is (words, counts or both; limited to top ones or all of the words) from your question. I will leave that effort for you to open a file, print out the information to the file and close the file.
This will total up the occurrences of words in a directory or file given on the command line:
#!/usr/bin/env perl
# wordcounter.pl
use strict;
use warnings;
use IO::All -utf8;
binmode STDOUT, 'encoding(utf8)'; # you may not need this
my #allwords;
my %count;
die "Usage: wordcounter.pl <directory|filename> number \n" unless ~~#ARGV == 2 ;
if (-d $ARGV[0] ) {
push #allwords, $_->slurp for io($ARGV[0])->all_files;
}
elsif (-f $ARGV[0]) {
#allwords = io($ARGV[0])->slurp ;
}
while (my $line = shift #allwords) {
foreach ( split /\s+/, $line) {
$count{$_}++
}
}
my $count_to_show;
for my $word (sort { $count{$b} <=> $count{$a} } keys %count) {
printf "%-30s %s\n", $word, $count{$word};
last if ++$count_to_show == $ARGV[1];
}
By modifying the sort and/or io calls you can sort { } by number of occurrences, alphabetically by word, either for a file or for all files in a directory. Those options would be fairly easy to add as parameters. You can also filter or change how words are defined for inclusion in the %count hash by changing foreach ( split /\s+/, $line) to say, include a match/filter such as foreach ( grep { length le 5 } split /\s+/, $line) in order to only count words of five or fewer letters.
Sample run in current directory:
./wordcounter ./ 10
the 116
SV 87
i 66
my_perl 58
of 54
use 54
int 49
PerlInterpreter 47
sv 47
Inline 47
return 46
Caveats
you should probably add a test for file mimetypes, readability, etc.
pay attention to unicode
to write to a file just add > filename.txt to the end of your commandline ;-)
IO::All is not the standard CORE IO package I am only advertising and promoting it here ;-) (you could swap that bit out)
If you wanted to added a sort_by option (-n --numeric, -a --alphabetic etc.) Sort::Maker might be one way to make that manageable.
EDIT had neglected to add options as OP requested.
I have figured it out. The following is my solution. I'm not sure if it's the best way to do it, but it works.
# Check if there are three arguments in the commandline
if (#ARGV < 3) {
die "ERROR: There must be three arguments!\n";
exit;
}
# Open the file
my $file = shift or die "ERROR: $0 FILE\n";
open my $fh,'<', $file or die "ERROR: Could not open file!";
# Check if it is a file
if (-f $fh) {
print("This is a file!\n");
# Go through each line
while (my $line = <$fh>) {
chomp $line;
# Count the occurrences of each word
foreach my $str ($line =~ /\b[[:alpha:]]+\b/) {
$count{$str}++;
}
}
}
# Check if the INPUT is a directory
if (-d $input) {
# Call subroutine to search directory recursively
search_dir($input);
}
# Close the file
close($fh);
$high_count = 0;
# Open the file
open my $fileh,'>', $output or die "ERROR: Could not open file!\n";
# Sort the most occurring words in the file and print them
foreach my $str (sort {$count{$b} <=> $count{a}} keys %count) {
$high_count++;
if ($high_count <= $num) {
printf "%-31s %s\n", $str, $count{$str};
}
printf $fileh "%-31s %s\n", $str, $count{$str};
}
exit;
# Subroutine to search through each directory recursively
sub search_dir {
my $path = shift;
my #dirs = glob("$path/*");
# Loop through filenames
foreach my $filename (#dirs) {
# Check if it is a file
if (-f $filename) {
# Open the file
open(FILE, $filename) or die "ERROR: Can't open file";
# Go through each line
while (my $line = <FILE>) {
chomp $line;
# Count the occurrences of each word
foreach my $str ($line =~ /\b[[:alpha:]]+\b/) {
$count{$str}++;
}
}
# Close the file
close(FILE);
}
elsif (-d $filename) {
search_dir($filename);
}
}
}

Perl Regex match works, but replace does not

I have put together a Perl script to go through a directory and match various keys in the source and output the results to a text file. The match operation works well, however the end goal is to perform a replace operation. The Perl script is as follows:
#!/usr/bin/perl
#use strict;
use warnings;
#use File::Slurp;
#declare variables
my $file = '';
my $verbose = 0;
my $logfile;
my #files = grep {/[.](pas|cmm|ptd|pro)$/i} glob 'C:\users\perry_m\desktop\epic_test\pascal_code\*.*';
#iterate through the files in input directory
foreach $file (#files) {
print "$file\n";
#read the file into a single string
open FILEHANDLE, $file or die $!;
my $string = do { local $/; <FILEHANDLE> };
#perfrom REGEX on this string
########################################################
#fix the include formats to conform to normal PASCAL
$count = 0;
while ($string =~ m/%INCLUDE/g)
{
#%include
$count++;
}
if ($count > 0)
{
print " $count %INCLUDE\n";
}
$count = 0;
while ($string =~ m/INCLUDE/g)
{
#%INCLUDE;
$count++;
}
if ($count > 0)
{
print " $count INCLUDE\n";
}
$count = 0;
while ($string =~ m/(%include\s+')[A-Za-z0-9]+:([A-Za-z0-9]+.[A-Za-z]+')/g)
{
#$1$2;
$count++;
}
if ($count > 0)
{
print " $count XXXX:include \n";
}
}
This produces output as desired, an example is below:
C:\users\perry_m\desktop\epic_test\pascal_code\BRTINIT.PAS
1 INCLUDE
2 XXXX:include
39 external and readonly
However if I change the regex operations to try and implement a replace, using the replacement operation shown in the commented lines above, the scripts hangs and never returns. I imagine it is somehow related to memory, but I am new to Perl. I was also trying to avoid parsing the file by line if possible.
Example:
while ($string =~ s/%INCLUDE/%include/g)
{
#%include
$count++;
}
and
while ($string =~ s/(%include\s+')[A-Za-z0-9]+:([A-Za-z0-9]+.[A-Za-z]+')/$1$2;/g)
{
#$1$2;
$count++;
}
Edit: simplified the examples
The problem is with your while loops. A loop like
while ($string =~ m/INCLUDE/g) { ... }
will execute once for each ocurrence of INCLUDE in the target string, but a subtitution like
$string =~ s/INCLUDE/%INCLUDE;/
will make all of the replacement in one go and retuen the number of replacements made. So a loop
while ($string =~ s/INCLUDE/%INCLUDE;/g) { ... }
will endlessly add more and more percentage signs before and semicolons after every INCLUDE.
To find the number of replacements made, change all your loops like this to just
$count = $string =~ s/INCLUDE/%INCLUDE;/g
the pattern in s/INCLUDE/%INCLUDE/g will match the replacement also, so if you're running it in a while loop it will run forever (until you run out of memory).
s///g will replace all matches in a single shot so you very rarely will need to put it in a loop. Same goes for m//g, it will do the counting in a single step if you put it in list context.

Perl newbie: trying to find string in array of strings

I need to match a string against an array of strings. The string that I am searching for should be able to contain wildcards.
#!/usr/bin/perl
#
## disable buffered I/O which would lead
## to deadloops for the Apache server
$| = 1;
#
## read URLs one per line from stdin
while (<>) {
my $line = $_;
my #array1 = ("abc","def","ghi");
$found = 0;
if (/$line/i ~~ #array1)
{
print "found\n";
}
else
{
print "not found\n";
}
}
I test this script with the input of abc and it returns not found
perl ./mapscript.pl
abc
not found
Your input has a newline at the end. Add:
chomp $line;
right after
my $line = $_;
Use chomp(my $input = $_) to remove newline instead of my $input = $_ inside your while..
** OOPs.. Didn't see that I'm posting Duplicate..
a newline at the end always exists using <>. see chomp