Perl: Comparing the contents of one file with those of several others - regex

I have to read a CSV file (TEST.csv) with contents of this sort:
Sl.No, Label, Customer1, Customer2, Customer3...
1, label1, Y, N, Y...
2, label2, N, Y, Y...
...
and retrieve only the labels marked as "Y" for every "customer", into an external file for that "customer". With some help from SO members in another question, I managed to desist from getting lost in a maze of nested data structures, and am using the construct below. Here, I'm copying the labels marked as "Y" to a temp file _temp.h for the corresponding customers.
Now, the actual "external file" I need to write does not just have the labels, but is a copy of an "internal file" internal.h, which has data in this form:
/*...comments*/
#define header_label1 val1;
#define header_label2 val2;
...
For example, I might have a line #define ABC_Comp1_X_H_CompDes1 value. If the label Comp1_CompDes1 is present in the temp file I create for customer 1, then the line above is copied into the final external file for customer 1.
The following code is the one I'm using. However, this throws an error "Global symbol "%tempLines" requires explicit package name" for the line marked "HERE", though I'm not using a hash, and also of syntax errors in the next couple of lines w.r.t. the curly braces.
Any guidance as to the reason behind these errors would be highly appreciated.
use strict;
use warnings;
use File::Slurp;
use Data::Dumper;
my $numCustomers;
my $intHeaderFile = "internal.h";
open(my $fh, "<", "TEST.csv") or die "Unable to open CSV, $!";
open(my $infh, "<", $intHeaderFile) or die "Cannot open $intHeaderFile, $!";
my #headerLines = read_file($intHeaderFile);
chomp( my $header = <$fh> );
my #names = split ",", $header;
$numCustomers = scalar(#names) - 2;
print "\nNumber of customers : $numCustomers\n";
my #customerNames;
for(my $i = 0; $i < $numCustomers; $i++)
{
push #customerNames, $names[$i + 2];
}
my #tempHandles;
my #handles;
my #tempfiles;
my #files;
for(my $i = 0; $i < $numCustomers; $i++)
{
my $custFile = "customer".$i."_external.h";
open my $fh, '>', $custFile or die "$custFile: $!";
push #handles, $fh;
push #files, $custFile;
my $tempFile = "customer".$i."_temp.h";
open my $fh1, '+>', $tempFile or die "$tempFile: $!";
push #tempHandles, $fh1;
push #tempfiles, $tempFile;
}
while (<$fh>)
{
chomp;
my $nonIncLine = $_;
my #fields = split ",", $nonIncLine;
next if $. == 1;
for(my $i = 0; $i < $numCustomers; $i++)
{
print { $tempHandles[$i] } $fields[1], "\n" if 'Y' eq $fields[ $i + 2 ];
}
}
for(my $i = 0; $i < $numCustomers; $i++)
{
my #tempLines = read_file($tempfiles[$i]);
print #tempLines;
foreach my $headerLine(#headerLines)
{
if (grep { $headerLine =~ /$_/} #tempLines ) #HERE
{
print { $handles[$i] } $headerLine, "\n";
}
}
unlink($tempfiles[$i]);
}

Related

Is it possible to match regex using variable?

Here is my code
my $filename = 'text.log';
my $items = "donkey";
open(my $fh, '<:encoding(UTF-8)', $filename) or die "Cant open";
while (my $contents = <$fh>)
{
print "$contents";
if ( $items =~m/$contents/)
{ print "Found $contents";}
else { print "NOTHING\n";}
}
Yes, but you'll need to remove the trailing newspace on each line ($contents =~ s/\n$//;):
#!/usr/bin/env perl
my $filename = 'text.log';
my $items = "donkey";
open(my $fh, '<:encoding(UTF-8)', $filename) or die "Cant open";
while (my $contents = <$fh>) {
print "$contents";
$contents =~ s/\n$//;
if ($items =~ m/$contents/) {
print "Found $contents\n";
} else {
print "NOTHING\n";
}
}
Test:
$ cat text.log
test
ok
donk
$ ./test.pl
test
NOTHING
ok
NOTHING
donk
Found donk

Perl reading in a file and getting a string in between two strings

I am trying to read in a file and gather everything in between two hash keys. I want to access everything between the $beginString and $endString variables. I have tried multiple regular expressions but haven't been able to get one to work.
my $beginString = "SEARCH";
my $endString = "TEST";
my $fileContent;
open(my $fileHandler, $inputFile) or die "Could not open file '$inputFile' $!";
{
local $/;
$fileContent = <$fileHandler>;
}
close($fileHandler);
if($fileContent =~ /\b$beginString\b(.*?)\b$endString\b/){
my $result = $1;
print $result;
}
print Dumper($fileContent);
An adaptation of the perl monks' solution could be..
my $beginString = "SEARCH";
my $endString = "TEST";
my $fileContent;
open(my $fileHandler, $inputFile) or die "Could not open file '$inputFile' $!";
while(<$fileHandler>) {
if(/$beginString/../$endString/) { $fileContent .= $_ unless(/$beginString/ or /$endString/) }
}
close($fileHandler);
print Dumper($fileContent);

Matching a pattern : regex - perl

If I want to find in this file all instances of the words USER and PASS and then put the number of times they appear into the two variables respectively, how would I go about that? Thanks!
open MYFILE, '<', 'source_file.txt' or die $!;
open OUTFILE, '>', 'Header.txt' or die $!;
$user = 0;
$pass = 0;
while (<MYFILE>) {
chomp;
my #header = split (' ',$_);
print OUTFILE "$linenum: #header\n\n";
if (/USER/ig) {
$user++;
}
if (/PASS/ig) {
$pass++;
}
}
Above is the new code and it works.
I set my variables equal to 0 and used the ++ incrementor on the variables.
But I am still open to suggestions perhaps on expanding my regex's capabilities? (if that makes sense)
You could simply do.
my $user = 0;
my $pass = 0;
while (<MYFILE>) {
chomp;
my #header = split ' ', $_;
print OUTFILE "$linenum: #header\n\n";
$user++ if /user/ig;
$pass++ if /pass/ig;
}

Pull regular expressions from file and compare to each line in a file

I found something that I could use on perlmonks.org (http://www.perlmonks.org/?node_id=870806) but I can't get it to work.
I can read the file without issue and build an array. Then, I'd like to compare each index of the array (each regex) to each line of a file, printing out the line before and the line after the matched line.
My code:
# List of regex's. If this file doesn't exist, we can't continue
open ( $fh, "<", $DEF_FILE ) || die ("Can't open regex file: $DEF_FILE");
while (<$fh>) {
chomp;
push (#bad_strings, $_);
}
close $fh || die "Cannot close regex file: $DEF_FILE: $!";
$file = '/tmp/mydirectory/myfile.txt';
eval { open ( $fh, "<", $file ); };
if ($#) {
# If there was an error opening the file, just move on
print "Error opening file: $file.\n";
} else {
# If no error, process the file
foreach $bad_string (#bad_strings) {
$this_line = "";
$do_next = 0;
seek($fh, 0, 0); # move pointer to 0 each time through
while(<$fh>) {
$last_line = $this_line;
$this_line = $_;
my $rege = eval "sub{ \$_[0] =~ $bad_string }"; # Real-time regex
if ($rege->( $this_line )) { # Line 82
print $last_line unless $do_next;
print $this_line;
$do_next = 1;
} else {
print $this_line if $do_next;
$last_line = "";
$do_next = 0;
}
}
}
} # End "if error opening file" check
This was working before when I had just a string per line in the file and performed a simple test such as if ($this_line =~ /$string_to_search_for/i ) but when I switched to regex in the file and a "real-time" eval statement, I now get Can't use string ("") as a subroutine ref while "strict refs" in use at scrub_file.pl line 82 and line 82 is if ($rege->($this_line)) {.
Prior to that error message, I'm receiving: Use of uninitialized value in subroutine entry at scrub_hhsysdump_file.pl line 82, <$fh> I have some understanding of that error message but can't seem to make the perl engine happy with my code thus far.
Still new to perl and always looking for pointers. Thanks in advance.
I fail to see the reason for those eval statements - all they seem to do is make the code a lot more complicated and difficult to debug.
But $rege is undef because eval "sub{ \$_[0] =~ $bad_string }" isn't working, due to the string having a syntax error. I don't know what's in $DEF_FILE, but unless it has properly-delimited regular expressions then you need to add the delimiters in the eval string.
my $rege = eval "sub{ \$_[0] =~ /$bad_string/ }"
may work, but you may need /\Q$bad_string/ instead if the strings in $DEF_FILE contain regex metacharacters and you want them to be treated as literal characters.
I suggest this version of your program which seems to do what you need without the fuss of the eval calls.
use strict;
use warnings;
use Fcntl ':seek';
my $DEF_FILE = 'myfile';
my #bad_strings = do {
open my $fh, '<', $DEF_FILE or die qq(Can't open regex file "$DEF_FILE": $!);
<$fh>;
};
chomp #bad_strings;
my $file = '/tmp/mydirectory/myfile.txt';
open my $fh, '<', $file or die qq(Unable to open "$file" for input: $!);
for my $bad_string (#bad_strings) {
my $regex = qr/$bad_string/;
my ($last_line, $this_line, $do_next) = ('', '', 0);
seek $fh, 0, SEEK_SET;
while (<$fh>) {
($last_line, $this_line) = ($this_line, $_);
if ($this_line =~ $regex) {
print $last_line unless $do_next;
print $this_line;
$do_next = 1;
}
else {
print $this_line if $do_next;
$do_next = 0;
}
}
}

Removing stop words and saving the new file Perl

I have created a Perl file to load in an array of "Stop words".
Then I load in a directory with ".ner" files contained in it.
Each file gets opened and each word is split and compared to the words in the stop file.
If the word matches the word it is changed to "" (nothing-and gets removed)
I then copy the file to another location. So I can differentiate between files with stop words and files without.
But does this change the file to now contain no stop words or will it revert back to the original?
#!/usr/bin/perl
#use strict;
#use warnings;
my #stops;
my #file;
use File::Copy;
open( STOPWORD, "/Users/jen/stopWordList.txt" ) or die "Can't Open: $!\n";
#stops = <STOPWORD>;
while (<STOPWORD>) #read each line into $_
{
chomp #stops; # Remove newline from $_
push #stops, $_; # add the line to #triggers
}
close STOPWORD;
$dirtoget="/Users/jen/temp/";
opendir(IMD, $dirtoget) || die("Cannot open directory");
#thefiles= readdir(IMD);
foreach $f (#thefiles){
if ($f =~ m/\.ner$/){
print $f,"\n";
open (FILE, "/Users/jen/temp/$f")or die"Cannot open FILE";
if ( FILE eq "" ) {
close FILE;
}
else{
while (<FILE>) {
foreach $word(split(/\|/)){
foreach $x (#stops) {
if ($x =~ m/\b\Q$word\E\b/) {
$word = '';
copy("/Users/jen/temp/$f","/Users/jen/correct/$f")or die "Copy failed: $!";
close FILE;
}
}
}
}
}
}
}
closedir(IMD);
exit 0;
The format of the file I am splitting and comparing is as follows:
'<title>|NN|O Woman|NNP|O jumped|VBD|O for|IN|O life|NN|O after|IN|O firebomb|NN|O attack|NN|O -|:|O National|NNP|I-ORG News|NNP|I-ORG ,|,|I-ORG Frontpage|NNP|I-ORG -|:|I-ORG Independent.ie</title>|NNP|'
Should I be outlining where the words should be split ie: split(/|/)?
You should ALWAYS use :
use strict;
use warnings;
use three args open and test opening for failure.
As said codaddict A split with no arguments is equivalent to split(' ', $_).
Here is a proposal to achieve the job (as far as I well understood what you wanted).
#!/usr/bin/perl
use strict;
use warnings;
use 5.10.1;
my #stops = qw(put here your stop words);
my %stops = map{$_ => 1} #stops;
my #thefiles;
my $path = '/Users/jen/temp/';
my $out = $path.'outputfile';
open my $fout, '>', $out or die "can't open '$out' for writing : $!";
foreach my $file(#thefiles) {
next unless $file =~ /\.ner$/;
open my $fh, '<', $path.$file or die "can't open '$file' for reading : $!";
my #lines = <$file>;
close $fh;
foreach my $line(#lines) {
my #words = split/\|/,$line;
foreach my $word(#words) {
$word = '' if exists $stops{$word};
}
print $fout join '|',#words;
}
}
close $out;
A split with no arguments is equivalent to split(' ', $_).
Since you want the lines to be split on | you need to do:
split/\|/
#jenniem001,
open FILE, ("<$fh")||die("cant");undef $/;my $whole_file = <FILE>;foreach my $word (#words){$whole_file=~s/\b\Q$word\E\b//ig;}open FILE (">>$duplicate")||die("cant");print FILE $whole_file;
That will remove stops from your file and create a duplicate. Just call give $duplicate a name :)