How to handle files that do not contain a pattern?

How to handle files that do not contain a pattern? - regex

I need help with my Perl program. The idea is to pass in a pattern and a file list from the command line. If the file name matches the pattern, print the file name. Then if the file name doesn't match, it should look for instances of the pattern in the text of the file and print filename : first line of text that contained occurrence.
However should the user add the -i option at the beginning the opposite should occur. If the filename does not match print it. Then print any files that do not contain any instances of the pattern in their text.
This last part is where I'm struggling I'm not exactly sure how to get files that don't have the pattern in their text. For example in my code
#!/usr/bin/perl -w
die("\n Usage: find.pl [-i] <perlRegexPattern> <listOfFiles>\n\n") if(#ARGV<2);
my (#array,$pattern,#filesmatch,#files);
#I can separate files based on name match
($pattern,#array) = ($ARGV[0] eq "-i") ? (#ARGV[1 .. $#ARGV]) : (#ARGV);
foreach(#array){
($_ =~ m/.*\/?$pattern/) ? (push #filesmatch,$_) : (push #files, $_);
}
#and I can get files that contain a pattern match in their text
if($ARGV[0] ne "-i"){
for my $matches(#filesmatch){ #remove path print just file name
$matches =~s/.*\///; #/
print "$matches\n";
}
for my $file(#files){
open(FILE,'<',$file) or die("\nCould not open file $file\n\n");
while(my $line = <FILE>){
if($line =~ m/$pattern/){
$file =~ s/.*\///; #/ remove path print just file name
print "$file: $line";
next;
}
}
}
}
#however I'm not sure how to say this file dosen't have any matches so print it
else{
for my $matches(#files){ #remove path print just file name
$matches =~ s/.*\///;
print "$matches\n";
}
for my $file(#filesmatch){
open(FILE,'<',$file) or die("\nCould not open file $file\n\n");;
while(my $line = <FILE>){...
I'm not sure if something like grep could be used to do this but I'm having a hard time working with Perl's grep.

In order to decide whether to print or not a file based on its content you have to first read the file. With your criterion -- that a phrase does not exist -- you have to check the whole file.
A standard way is to use a separate variable ("flag") to record the condition then go back to print
my $has_match;
while (<$fh>) {
if (/$pattern/) {
$has_match = 1;
last;
}
}
if (not $has_match) {
seek $fh, 0, 0; # rewind to the beginning
print while <$fh>;
}
This can be simplified by reading the file into a variable first, and by using labels (also see perlsyn)
FILE: foreach my $file (#filesmatch) {
open my $fh, '<', $file or die "Can't open $file: $!";
my #lines = <$fh>;
for (#lines) {
next FILE if /$pattern/;
}
print for #lines;
}
Note that skipping an iteration in the middle of a loop isn't the cleanest way since one has to always keep in mind that the rest of the loop may not run.
Each file is read first so that we don't read it twice, but don't do that if any of the files can be huge.
If there is any command line processing it is better to use a module; Getopt::Long is nice.
use Getopt::Long;
my ($inverse, $pattern);
GetOptions('inverse|i' => \$inverse, 'pattern=s' => \$pattern)
or usage(), exit;
usage(), exit if not $pattern or not #ARGV;
sub usage { say STDERR "Usage: $0 ... " }
Call the program as progname [-i] --patern PATTERN files. The module provides a lot, please see docs. For example, in this case you can also just use -p PATTERN.
As GetOptions parses the command line the submitted options are removed from #ARGV and what remains in it are file names. And you have the $inverse variable to nicely make decisions.
Please have use warnings; (not -w) and use strict; at the top of every program.

Related

Why is this regex logic failing?

I'm trying to build a quick script that will take a url, and check it against a list of PCREs to see if there's a match. However, it doesn't seem to be working. I've tried printing everything to make sure it's output the way I want (including the ARGV[0], passing it with single quotes appears to keep all the characters in tact). But it's still not working.
This is the script
#!/usr/bin/perl
use strict;
use warnings;
if (not($ARGV[0])) {
die "Useage: checkurl.pl \"<url>\"";
}
if ($ARGV[1]) {
die "Too many command line arguments, try checkurl.pl \"<url>\"";
}
$_ = $ARGV[0];
print "$_\n";
my $file = "pcre.txt";
open my $info, $file or die "Could not open $file: $!";
while( my $line = <$info>) {
if (/$line/) {
print "Match found, the url matches the following PCRE: \n";
print "$line\n";
}
}
This is the test URL (warning, this was an actual Angler EK link, I've defanged it, just in case it's still live, so you have to fix it to properly check the PCRE)
hxxp://nosprivsliikeradan.pfgfoxriver-localguide2[.]com/boards/viewforum.php?f=5x827&sid=7q0as14.5i4x8
This is the PCRE in the pcre.txt file that matches the above URL
^http:\/\/(?!www|forums?)[^\.]+\.[^\.]+\.(?:[^\.\x2f]+?|[^\.]+\.[^\.]{2})\/[a-z]+\/?view(?:forum|topic)\.php\?[a-z]=(?=[^\n]{0,64}\.)[0-9a-z\.]{1,6}(?:&[a-z0-9]*=[0-9a-z\.]*){1,2}$

Your pattern is actually /^...$\n/ because you read it from a file and it contains a newline character. You need to chomp the line before interpolating it into the match operator:
while (my $line = <$info>) {
chomp($line);
if (/$line/) {
...
}
}

Optimal way to split huge file

I am trying to split a huge text file (~500 million lines of text) which is pretty regular and looks like this:
-- Start ---
blah blah
-- End --
-- Start --
blah blah
-- End --
...
where ... implies a repeating pattern and "blah blah" is of variable length ~ 2000 lines. I want to split off the first
-- Start --
blah blah
-- End --
block into a separate file and delete it from the original file in the FASTEST (runtime, given I will run this MANY times) possible way.
The ideal solution would cut the initial block from the original file and paste it into the new file without loading the tail of the huge initial file.
I attempted csplit in the following way:
csplit file.txt /End/+1
which is a valid way of doing this, but not very efficient in time.
EDIT: Is there a solution if we remove the last "start-end" block from file instead of the first one?

If you want the beginning removed from the original file, you have no choice but to read and write the whole rest of the file. To remove the end (as you suggest in your edit) it can be much more efficient:
use File::ReadBackwards;
use File::Slurp 'write_file';
my $fh = File::ReadBackwards->new( 'inputfile', "-- End --\n" )
or die "couldn't read inputfile: $!\n";
my $last_chunk = $fh->readline
or die "file was empty\n";
my $position = $fh->tell;
$fh->close;
truncate( 'inputfile', $position );
write_file( 'lastchunk', $last_chunk );

Perhaps something like the following will help you:
Split the file after every -- End -- marker. Create new files with a simple incremented suffix.
use strict;
use warnings;
use autodie;
my $file = shift;
my $i = 0;
my $fh;
open my $infh, '<', $file;
while (<$infh>) {
open $fh, '>', $file . '.' . ++$i if !$fh;
print $fh $_;
undef $fh if /^-- END --/;
}
Unfortunately, there is no truncate equivalent for removing data from the beginning of a file.
If you really wanted to do this in stages, then I would suggest that you simply tell the last place you read from, so you can seek when you're ready to output another file.

You could use the flip-flop Operator to get the content between this Pattern:
use File::Slurp;
my #text = read_file( 'filename' ) ;
foreach my $line (#text){
if ($line =~ /Start/ .. /End/) {
# do stuff with $line
print $line; # or so
}
}
When your file is large, be carefull with slurping the whole file at once!

use of uninitialized value $line in string ne

I am writing a program of pattern matching in perl ..but getting a error ..I have seen all the earlier posts regarding this matter but didn't find the solution...As I am new to perl So I am not getting exactly what is this error all about..
use of uninitialized value $line in string ne at line .. and in line ...
I am attaching here a perl file
use strict;
use warnings;
my $line = "";
open(OUTFILE, ">output.txt") or die ("cannot open file.\n");
if(open(file1,"match.txt") or die "Cannot open file.\n"){
$line = <file1>;
while ($line ne "") {
if (defined($line) && (line =~ m/\sregion\s/i)) {
print OUTFILE ("$line")};
$line = <file1>; # Problem Here
if (defined($line) && ($line =~ /\svth\s/)) {
print OUTFILE ("$line")
};
$line = <file1>; # Problem Here
}
}
My match.txt file contain this type of data..
Some text here
region Saturati Saturati Linear Saturati Saturati
id -2.1741m -2.1741m -4.3482m 2.1741m 2.1741m
vth -353.9140m -353.9141m -379.2704m 419.8747m 419.8745m
Some text here
Please solve the problem....thanks

The reason you are seeing those errors is that the variable $line contains undef. The reason it contains undef is that you assigned it a value from readline() (the <file1>) after the file had reached its end eof. This is described in perldoc -f readline:
In scalar context, each
call reads and returns the next line until end-of-file is
reached, whereupon the subsequent call returns "undef".
The reason you are encountering this error is that you are not using a traditional method of reading a file. Usually, you would do this to read a file:
while (<$fh>) {
...
}
This will iterate over all the lines in the file until it reaches end of file, after which, as you now know, the readline returns undef and the while loop is exited.
This also means that you do not have to check every other line whether $line is defined or empty. Moreover, you can combine your regexes into one, and generally remove a lot of redundant code:
while (<>) {
if (/\b(?:region|vth)\b/i) {
print;
}
}
This is the core of the functionality you are after, and I am using some Perl idioms here: the diamond operator <> will read from the file names you give the script as argument, or from STDIN if no arguments are given. Many built-in functions use the $_ variable as default if no argument is given, which is what print does, and the while loop condition.
You might also note that I use word boundary \b instead of whitespace \s in the regex, and also use alternation | with non-capturing parentheses (?:...), meaning it can match one of those strings.
With this simplified script, you can do:
perl script.pl match.txt > output.txt
To provide your file names.

If you can't read anything, your string will come back undefined... which is why you are seeing that message.
Also, probably better to check that you open input file first before creating an output file at all, so something like this:
open(INFILE, "<match.txt") or die "Cannot open input file";
open(OUTFILE, ">output.txt") or die "cannot open output file";
my $line;
while($line = <INFILE>){
...
}
Perl will end the loop if $line is undefined or an empty string.

From the looks of it, it seems like you're trying to go through the match file and print all the lines that match region or vth to output.txt.
I simplified the code for you to do this:
use strict;
use warnings;
open(my $out_fh, ">", "output.txt") || die ("Cannot open file.\n");
open(my $file1, "<", "match.txt") || die ("Cannot open file.\n");
while( <$file1> ) {
if ( /\s(region|vth)\s/i) {
print $out_fh $_;
}
}
This question goes into more detail about checking whether a variable is defined or empty: In Perl, how can I concisely check if a $variable is defined and contains a non zero length string?
Here is more information about opening files: What's the best way to open and read a file in Perl?

Replacing mutiple strings recursively within all files in a directory using Perl

I'm new with perl. saw many samples but had problems composing a solution
I have a list of strings which each string should be replaced in a different string a->a2, b->b34, etc. list of replacement is in some csv file. need to perform this replacement recursively on all files in directory.
might be any other language just thought perl would be the quickest

Your problem can be split into three steps:
Getting the search-and-replace strings from the CSV file,
Getting a list of all text files inside a given directory incl. subdirectories, and
Replacing all occurences of the search strings with their replacements.
So lets do a countdown and see how we can do that :)
#!/usr/bin/perl
use strict; use warnings;
3. Search and replace
We will define a sub searchAndReplace. It takes a file name as argument and accesses an outside hash. We will call this hash %replacements. Each key is a string we want to replace, and the value is the replacement. This "imposes" the restriction that there can only be one replacement per search string, but that should seem natural. I will further assume that each file is reasonably small (i.e. fits into RAM).
sub searchAndReplace {
my ($filename) = #_;
my $content = do {
open my $file, "<", $filename or die "Cant open $filename: $!";
local $/ = undef; # set slurp mode
<$file>;
};
while(my ($string, $replacement) = each %replacements) {
$content =~ s/\Q$string\E/$replacement/g;
}
open my $file, ">", $filename or die "Can't open $filename: $!";
print $file $content; # I didn't forget the comma
close $file;
}
This code is pretty straightforward, I escape the $string inside the regex so that the contents aren't treated as a pattern. This implementation has the side effect of possibly replacing part of the $content string where something already was replaced, but one could work around that if this is absolutely neccessary.
2. Traversing the file tree
We will define a sub called anakinFileWalker. It takes a filename or a name of an directory and the searchAndReplace sub as arguments. If the filename argument is a plain file, it does the searchAndReplace, if it is a directory, it opens the directory and calls itself on each entry.
sub anakinFileWalker {
my ($filename, $action) = #_;
if (-d $filename) {
opendir my $dir, $filename or die "Can't open $filename: $!";
while (defined(my $entry = readdir $dir)) {
next if $entry eq '.' or $entry eq '..';
# come to the dark side of recursion
anakinFileWalker("$filename/$entry", $action); # be sure to give full path
}
} else {
# Houston, we have a plain file:
$action->($filename);
}
}
Of course, this sub blows up if you have looping symlinks.
1. Setting up the %replacements
There is a nice module Text::CSV which will help you with all your needs. Just make sure that the %replacements meet the definition above, but that isn't hard.
Starting it all
When the %replacements are ready, we just do
anakinFileWalker($topDirectory, \&searchAndReplace);
and it should work. If not, this should have given you an idea about how to solve such a problem.

help with perl code to parse a file

I am new to Perl and have a question about the syntax. I received this code for parsing a file containing specific information. I was wondering what the if (/DID/) part of the subroutine get_number is doing? Is this leveraging regular expressions? I'm not quite sure because regular-expression matches look like $_ =~ /some expression/. Finally, is the while loop in the get_number subroutine necessary?
#!/usr/bin/env perl
use Scalar::Util qw/ looks_like_number /;
use WWW::Mechanize;
# store the name of all the OCR file names in an array
my #file_list=qw{
blah.txt
};
# set the scalar index to zero
my $file_index=0;
# open the file titled 'outputfile.txt' and write to it
# (or indicate that the file can't be opened)
open(OUT_FILE, '>', 'outputfile.txt')
or die "Can't open output file\n";
while($file_index < 1){
# open the OCR file and store it in the filehandle IN_FILE
open(IN_FILE, '<', "$file_list[$file_index]")
or die "Can't read source file!\n";
print "Processing file $file_list[$file_index]\n";
while(<IN_FILE>){
my $citing_pat=get_number();
get_country($citing_pat);
}
$file_index=$file_index+1;
}
close IN_FILE;
close OUT_FILE;
The definition of get_number is below.
sub get_number {
while(<IN_FILE>){
if(/DID/){
my #fields=split / /;
chomp($fields[3]);
if($fields[3] !~ /\D/){
return $fields[3];
}
}
}
}

Perl has a variable $_ that is sort of the default dumping ground for a lot of things.
In get_number, while(<IN_FILE>){ is reading a line into $_, and the next line is checking if $_ matches the regular expression DID.
It's also common to see chomp; which also operates on $_ when no argument is given.

In that case, if (/DID/) by default searches the $_ variable, so it is correct. However, it is a rather loose regex, IMO.
The while loop in the sub may be necessary, it depends on what your input looks like. You should be aware that the two while loops will cause some lines to get completely skipped.
The while loop in the main program will take one line, and do nothing with it. Basically, this means that the first line in the file, and every line directly following a matching line (e.g. a line that contains "DID" and the 4th field is a number), will also be discarded.
In order to answer that question properly, we'd need to see the input file.
There are a number of issues with this code, and if it works as intended, it's probably due to a healthy amount of luck.
Below is a cleaned up version of the code. I kept the modules in, since I do not know if they are used elsewhere. I also kept the output file, since it might be used somewhere you have not shown. This code will not attempt to use undefined values for get_country, and will simply do nothing if it does not find a suitable number.
use warnings;
use strict;
use Scalar::Util qw/ looks_like_number /;
use WWW::Mechanize;
my #file_list=qw{ blah.txt };
open(my $outfile, '>', 'outputfile.txt') or die "Can't open output file: $!";
for my $file (#file_list) {
open(my $in_file, '<', $file) or die "Can't read source file: $!";
print "Processing file $file\n";
while (my $citing_pat = get_number($in_file)) {
get_country($citing_pat);
}
}
close $out_file;
sub get_number {
my $fh = shift;
while(<$fh>) {
if (/DID/) {
my $field = (split)[3];
if($field =~ /^\d+$/){
return $field;
}
}
}
return undef;
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

How to handle files that do not contain a pattern? - regex

Related

Why is this regex logic failing?

Optimal way to split huge file

use of uninitialized value $line in string ne

Replacing mutiple strings recursively within all files in a directory using Perl

help with perl code to parse a file

Categories

Resources