Search, Create and Move in Perl - regex

I have a directory of about 800 html files. I am trying to search each file and return text between tags. Then I want to create a directory with that text and move (or copy) the files there. This seemed like a pretty easy endeavor when I thought it up but I am having a ton of problems even identifying the modules I would need for this. I have looked at File::Find and glob, but am not exactly sure about how I would implement this with a regex for txt within the files (not the file name.) I am basically a newbie to perl so any and all help would be appreciated. Thanks in advance.
EDIT
To clarify: What I am trying to accomplish:
Read Directory = ~/me/project/
For ~/me/project/ find all the files =~ /.html$/i
For each file, search the html for = div class="recip" id="objectTo">(.*) /div
For every (.*) IE john#doewww.com or John Doe create a directory with that same name
Loop back and move every file that has an instance of xxxxxxxx#xxxxx.com or John Doe to its corresponding directory.
I really appreciate the help!

You're on the right track with File::Find.
You will create a 'wanted()' function, and within that function, the name of the file found will be $File::Find::name. You can then use that to open a file handle, read in the file, search for the tags and extract the data that you're looking for, and close the file handle. File::Find will then move on to the next file.
#! /usr/bin/perl
use warnings;
use strict;
use File::Find;
sub wanted {
my $file=$File::Find::name;
# if the file has the extension '.html' (case insensitive) ...
if( $file =~ /\.html$/i ) {
my $FH;
open( $FH, '<', $file) or die "Could not open '$file' for reading: $!";
local $/ = '';
my $contents = <$FH>; # slurp file into $contents
# search $contents for the tags that you're looking for,
#
close $FH;
}
}
my #directories = (
'./htmlfiles'
, './www'
, './web'
);
find(\&wanted, #directories);
Warning: The code passes perl -c, but I haven't run it.

For the second part of your question, Check out HTML::Strip for stripping HTML markup from text.

Related

regex for file operation in perl not working

I have a xml file contain data's like:
<get>9090</get><br>
<setId>setIdHere</set>
<mainId>121</mainId>
As I'm not using any external lib/packages, however I'm need to do some changes using I/O.
I need to change the string setIdHere with something. Please find the perl code below:
my $filename="file1.xml";
my $idVal=3232;
open(my $fh , '>>' ,$fileName);
select $fh or die $!;
s/setIdHere/$idVal;
print;
select STDOUT;
close($fh);
The above code is appending the value in the end, but I want to replace it with the string setIdHere.
I'm new to perl not sure what's wrong with the above code.
Thanks in advance.
First off, your code is using some unusually outdated techniques. select $fh has a global effect and is best avoided.
In general to edit a file you need to open it for reading, read it in, alter it, and write it back out again. To avoid pulling the whole file into memory, the file can be very big, you generally do this line by line.
You can't write to the same file you're reading from (well, you can, but it makes a mess), so instead you write to a temp file and then when you're done rename to be the original.
# This forces you to declare all variables protecting against typos
use strict;
# This lets you know when you've done something you probably shouldn't.
use warnings;
# This will error if file operations failed, no more "or die $!"
use autodie;
my $file = "file1.xml";
my $tmp = $file.".new"; # file1.xml.new
open my $in, "<", $file; # open the XML file for reading
open my $out, ">", $tmp; # open a temp file for writing
# Read the file line by line
while(my $line = <$in>) {
# Change the line.
$line =~ s{this}{that}g;
# Write it to the temp file.
print $out $line;
}
# If you don't do this, it might not have finished writing.
close $out;
close $in;
# Overwrite the old file with the new one.
rename $temp, $file;
HOWEVER you're editing XML. XML is structured and you should not try to read and edit it with regexes. You instead need to parse it with an XML library like XML::LibXML or XML::Twig.
You say you can't use any external library, but I bet you can, it's just a matter of figuring out how. You'll have a much easier time of it if you do. Generally the reason is that you don't have admin privileges. The simplest solution is to install perlbrew and install your own copy of Perl that you can manage. Perlbrew makes this easy.
Please, never ever use regular expressions to parse XML. XML is contextual, and regular expressions are not. Therefore it's only ever going to be a dirty hack.
I would recommend XML::Twig if you need to modify an XML file. It supports xpath, which is like regular expressions, but inherently handles the context problem.
XML::Twig also does 'parsefile_inplace' for in place editing of your file:
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
sub modify_setId {
my ( $twig, $setId ) = #_;
$setId -> set_text('3232');
$twig -> flush;
}
my $twig = XML::Twig -> new ( twig_handlers => { 'setId' => \&modify_setId } );
$twig -> set_pretty_print('indented');
$twig -> parsefile_inplace('test.xml');

Perl script gives a blank output file

I'm a total noob at Perl, trying to learn some new code for a specific project. In short, I'm making a script (on osx) that is to search all xml-files in a folder and censor specific numbers. I know a one-liner could have helped, but the amount of files will be pretty huge (thousands of files), and would happen regularly so a script to do it would be nicer. And besides, there is the learning to script part :)
I've managed to open my files, make the regex work on every line on the original for my specific needs and generate a writable tempfile for my new information. This is where things stop working. I've tried to copy the new file over the old file after the loop, but I end up with a blank(!) file. I suspected there to be an error with the temp-file, but that looks perfect. I even tried, as a noobs way out, to reverse the process line by line from the temp back to the original file after changing the open mode (read) on them, but that ALSO gave an empty file.
And now my head is sort of empty. Any help would be appreciated :)
#!/usr/bin/perl
use strict;
use warnings;
use File::Copy;
chdir "/perltest/test"; #debugsafety
#file
my $workingfiles = "*.XML";
my #files = glob("$workingfiles");
#process files
my $old;
my $tmpfile;
foreach my $file (#files) {
print "$file \n";
open ($old, "<", $file) or die "No file";
open ($tmpfile, ">", 'temp.tmp') or die;
while(my $line = <$old> ) {
my $subz = $line;
$subz =~ s/([[:upper:]]{2}[[:digit:]]{6})|([[:upper:]]{1}[[:digit:]]{7})|(?:(?<![[:digit:]])[[:digit:]]{8}(?![[:digit:]])|([[:upper:]]{2}[[:digit:]]{5}[AB]))/**CENS**/g;
print $subz;
print $tmpfile $subz;
}
print "Start copying.\n";
open (my $old, ">", $file) or die "No file";
open (my $tmpfile, "<", 'temp.tmp') or die;
#copy $tmpfile, $old or die "Couldn't copy";
my $y = 0; #debug
while (my $line = <$tmpfile> ) {
print $y++; #debug
my $subz = $line;
print $subz;
print $old $subz;
}
}
print "Complete.\n";
exit;
You re-open your file handles before closing them. I'm an Oracle DBA masquerading as a perl developer, so I can't give the why behind it. But I know if you close your file handles, your script should work as is.
close ($old); # add this line
close ($tmpfile); # add this line
print "Start copying.\n";
It would then be good practice to close them again when you are done "copying" back to them.
Explicitly close the filehandle when you're done writing to it. Things will still be buffered until you do that.
Also would make more sense to
rename($file, "$file.old");
rename("temp.tmp", $file);
rather than looping through the file (or using File::Copy::copy) to make a backup copy of it.
Lastly, for simple edits I could suggest making an effort to get comfortable with doing it on the command line so you don't need scratch your head and wonder "now what did I do in that script last time?". It can be a big timesaver in the long run.
perl -p -i.bak -e 's/pattern/text/;' files*
is the general form.

No such file or directory error: Perl

I am naive in Perl. I have written the following code and I am breaking my head since two days because I am getting the following error when I am trying to open the file: No such file or directory at line 23 (open (FILE, "$config_file") or die $!;)
What I am doing is:
Open the folder and list all the files inside it.
Iterate over each files to look for a particular strings.
create new files for all of the files with the matching string replaced by some other string.
I would really appreciate your help.
Following is my code:
#!/usr/bin/perl -w
#~ The perl script that changes the IP addresses in configuration files from 192.168.3.x into 192.168.31.x in any particular folder
use strict;
use warnings;
use diagnostics;
#~ Get list of files in the Firewall folder
my $directory = 'C:\Users\asura\Desktop\ConfigFiles\Firewall';
opendir (my $dir, $directory) or die $!;
my #list_of_files = readdir($dir);
my $file;
while ($file = readdir ($dir)) {
push #list_of_files, $file;
}
closedir $dir;
print "#list_of_files\n";
#~ Iterate over each files to replace some strings
foreach my $config_file (#list_of_files) {
next unless (($config_file !~ /^\.+$/));
open (FILE, "$config_file") or die $!;
my #original_array = <FILE>;
close(FILE);
my #new_array;
foreach my $line (#original_array) {
chomp($line);
$line =~ s/192\.168\.3/192\.168\.31/g;
push (#new_array, $line);
}
print #new_array;
#~ Create a new files with modified strings
my $new_config_file = $config_file.1;
my $newfile = 'C:\Users\asura\Desktop\ConfigFiles\Firewall\$new_config_file';
open (NEW_FILE, ">", "$newfile") or die $!;
foreach (#new_array){
print NEW_FILE "$_\n";
}
close(NEW_FILE);
}
exit 0;
When you push items onto #list_of_files, you are pushing only the filename (the value returned from readdir). Unless your script is running in C:\Users\asura\Desktop\ConfigFiles\Firewall, the open at line 22 using just the filename (a relative path) will fail.
You need to push absolute paths onto #list_of_files at line 14, like so:
push #list_of_files, $directory . "\\" . $file;
Also, as #Michael-sqlbot mentions, you need to double-quote the string at line 35 for string interpolation to be performed (or use concatenation).
Finally, you should also properly quote the string concatenation on line 34.
The following is a simplification of your code that removes the bugs.
First off kudos including use strict and use warnings in EVERY script. One additional tool that you can use is use autodie; anytime that you're doing file processing.
The primary flaw in your code was the fact that you weren't including the path information when opening your files. There are two main ways to solve this. You can manually specify the path, like you did for your open to your output file handle, or you can use glob instead of opendir as that will automatically include the path in the returned results.
There was a secondary bug in your regex where you were missing a word boundary after .3. This would have led numbers in the thirties to matching mistakenly.
To simplify your code I just removed all of the superfluous temporary variables and instead process things file by file and line by line. This has the benefit of making it more clear when an input and output file handles are obviously related. Finally, if you're actually wanting to edit the files, there are lots of methods demonstrated at perlfaq4.
#!/usr/bin/perl -w
#~ The perl script that changes the IP addresses in configuration files from 192.168.3.x into 192.168.31.x in any particular folder
use strict;
use warnings;
use autodie;
use diagnostics;
#~ Get list of files in the Firewall folder
my $dir = 'C:\Users\asura\Desktop\ConfigFiles\Firewall';
opendir my $dh, $dir;
#~ Iterate over each files to replace some strings
while (my $file = readdir($dh)) {
next if $file =~ /^\.+$/;
open my $infh, '<', "$dir\\$file";
open my $outfh, '>', "$dir\\${file}.1"; #~ Create a new files with modified strings
while (<$infh>) {
s/(?<=192\.168)\.3\b/.31/g;
print $outfh $_;
}
close $infh;
close $outfh;
}
closedir $dh;

Replacing mutiple strings recursively within all files in a directory using Perl

I'm new with perl. saw many samples but had problems composing a solution
I have a list of strings which each string should be replaced in a different string a->a2, b->b34, etc. list of replacement is in some csv file. need to perform this replacement recursively on all files in directory.
might be any other language just thought perl would be the quickest
Your problem can be split into three steps:
Getting the search-and-replace strings from the CSV file,
Getting a list of all text files inside a given directory incl. subdirectories, and
Replacing all occurences of the search strings with their replacements.
So lets do a countdown and see how we can do that :)
#!/usr/bin/perl
use strict; use warnings;
3. Search and replace
We will define a sub searchAndReplace. It takes a file name as argument and accesses an outside hash. We will call this hash %replacements. Each key is a string we want to replace, and the value is the replacement. This "imposes" the restriction that there can only be one replacement per search string, but that should seem natural. I will further assume that each file is reasonably small (i.e. fits into RAM).
sub searchAndReplace {
my ($filename) = #_;
my $content = do {
open my $file, "<", $filename or die "Cant open $filename: $!";
local $/ = undef; # set slurp mode
<$file>;
};
while(my ($string, $replacement) = each %replacements) {
$content =~ s/\Q$string\E/$replacement/g;
}
open my $file, ">", $filename or die "Can't open $filename: $!";
print $file $content; # I didn't forget the comma
close $file;
}
This code is pretty straightforward, I escape the $string inside the regex so that the contents aren't treated as a pattern. This implementation has the side effect of possibly replacing part of the $content string where something already was replaced, but one could work around that if this is absolutely neccessary.
2. Traversing the file tree
We will define a sub called anakinFileWalker. It takes a filename or a name of an directory and the searchAndReplace sub as arguments. If the filename argument is a plain file, it does the searchAndReplace, if it is a directory, it opens the directory and calls itself on each entry.
sub anakinFileWalker {
my ($filename, $action) = #_;
if (-d $filename) {
opendir my $dir, $filename or die "Can't open $filename: $!";
while (defined(my $entry = readdir $dir)) {
next if $entry eq '.' or $entry eq '..';
# come to the dark side of recursion
anakinFileWalker("$filename/$entry", $action); # be sure to give full path
}
} else {
# Houston, we have a plain file:
$action->($filename);
}
}
Of course, this sub blows up if you have looping symlinks.
1. Setting up the %replacements
There is a nice module Text::CSV which will help you with all your needs. Just make sure that the %replacements meet the definition above, but that isn't hard.
Starting it all
When the %replacements are ready, we just do
anakinFileWalker($topDirectory, \&searchAndReplace);
and it should work. If not, this should have given you an idea about how to solve such a problem.

Searching for Files with specific Regex in filename in Perl

Hi all I was wondering how I can go about searching for files in perl.
Right now I have a line with information that I have tokenized with tab as a delimiter stored into an array. (using split) These arrays contain stub text of filenames I want to search for in a directory. For example Engineering_4.txt would just be "Engin" in my array.
If there are two different files... Engineering_4 and Engineering_5, it would search both these files for content and just extract the information I need from one of them (only 1 contains information I want). I would imagine my script will have to search and store all file names that match and then search through each of these files.
My question is how do I go about searching for files in a directory matching a regular expression in Perl? Also is there a way to limit the file types that I want to search for. For example, I just want to only search for ".txt" files.
Thanks everyone
I guess since you already know the directory you could open it and read it while also filtering it :
opendir D, 'yourDirectory' or die "Could not open dir: $!\n";
my #filelist = grep(/yourRegex/i, readdir D);
You can do this using glob function of <glob> operator.
while (<Engin*.txt>) {
print "$_\n";
}
The glob function returns an array of matching files when provided a wildcard expression.
This means that the files can also be sort-ed before processing:
use Sort::Key::Natural 'natsort';
foreach my $file ( natsort glob "*.txt" ) { # Will loop over only txt files
open my $fh, '<', $file or die $!; # Open file and process
}
You can also use the File::Find module:
#!/usr/bin/env perl
use strict;
use warnings;
use File::Find;
my #dirs = #ARGV ? #ARGV : ('.');
my #list;
find( sub{
push #list, $File::Find::name if -f $_ && $_ =~ m/.+\.txt/ },
#dirs );
print "$_\n" for #list;