regex for file operation in perl not working - regex

I have a xml file contain data's like:
<get>9090</get><br>
<setId>setIdHere</set>
<mainId>121</mainId>
As I'm not using any external lib/packages, however I'm need to do some changes using I/O.
I need to change the string setIdHere with something. Please find the perl code below:
my $filename="file1.xml";
my $idVal=3232;
open(my $fh , '>>' ,$fileName);
select $fh or die $!;
s/setIdHere/$idVal;
print;
select STDOUT;
close($fh);
The above code is appending the value in the end, but I want to replace it with the string setIdHere.
I'm new to perl not sure what's wrong with the above code.
Thanks in advance.

First off, your code is using some unusually outdated techniques. select $fh has a global effect and is best avoided.
In general to edit a file you need to open it for reading, read it in, alter it, and write it back out again. To avoid pulling the whole file into memory, the file can be very big, you generally do this line by line.
You can't write to the same file you're reading from (well, you can, but it makes a mess), so instead you write to a temp file and then when you're done rename to be the original.
# This forces you to declare all variables protecting against typos
use strict;
# This lets you know when you've done something you probably shouldn't.
use warnings;
# This will error if file operations failed, no more "or die $!"
use autodie;
my $file = "file1.xml";
my $tmp = $file.".new"; # file1.xml.new
open my $in, "<", $file; # open the XML file for reading
open my $out, ">", $tmp; # open a temp file for writing
# Read the file line by line
while(my $line = <$in>) {
# Change the line.
$line =~ s{this}{that}g;
# Write it to the temp file.
print $out $line;
}
# If you don't do this, it might not have finished writing.
close $out;
close $in;
# Overwrite the old file with the new one.
rename $temp, $file;
HOWEVER you're editing XML. XML is structured and you should not try to read and edit it with regexes. You instead need to parse it with an XML library like XML::LibXML or XML::Twig.
You say you can't use any external library, but I bet you can, it's just a matter of figuring out how. You'll have a much easier time of it if you do. Generally the reason is that you don't have admin privileges. The simplest solution is to install perlbrew and install your own copy of Perl that you can manage. Perlbrew makes this easy.

Please, never ever use regular expressions to parse XML. XML is contextual, and regular expressions are not. Therefore it's only ever going to be a dirty hack.
I would recommend XML::Twig if you need to modify an XML file. It supports xpath, which is like regular expressions, but inherently handles the context problem.
XML::Twig also does 'parsefile_inplace' for in place editing of your file:
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
sub modify_setId {
my ( $twig, $setId ) = #_;
$setId -> set_text('3232');
$twig -> flush;
}
my $twig = XML::Twig -> new ( twig_handlers => { 'setId' => \&modify_setId } );
$twig -> set_pretty_print('indented');
$twig -> parsefile_inplace('test.xml');

Related

Perl script gives a blank output file

I'm a total noob at Perl, trying to learn some new code for a specific project. In short, I'm making a script (on osx) that is to search all xml-files in a folder and censor specific numbers. I know a one-liner could have helped, but the amount of files will be pretty huge (thousands of files), and would happen regularly so a script to do it would be nicer. And besides, there is the learning to script part :)
I've managed to open my files, make the regex work on every line on the original for my specific needs and generate a writable tempfile for my new information. This is where things stop working. I've tried to copy the new file over the old file after the loop, but I end up with a blank(!) file. I suspected there to be an error with the temp-file, but that looks perfect. I even tried, as a noobs way out, to reverse the process line by line from the temp back to the original file after changing the open mode (read) on them, but that ALSO gave an empty file.
And now my head is sort of empty. Any help would be appreciated :)
#!/usr/bin/perl
use strict;
use warnings;
use File::Copy;
chdir "/perltest/test"; #debugsafety
#file
my $workingfiles = "*.XML";
my #files = glob("$workingfiles");
#process files
my $old;
my $tmpfile;
foreach my $file (#files) {
print "$file \n";
open ($old, "<", $file) or die "No file";
open ($tmpfile, ">", 'temp.tmp') or die;
while(my $line = <$old> ) {
my $subz = $line;
$subz =~ s/([[:upper:]]{2}[[:digit:]]{6})|([[:upper:]]{1}[[:digit:]]{7})|(?:(?<![[:digit:]])[[:digit:]]{8}(?![[:digit:]])|([[:upper:]]{2}[[:digit:]]{5}[AB]))/**CENS**/g;
print $subz;
print $tmpfile $subz;
}
print "Start copying.\n";
open (my $old, ">", $file) or die "No file";
open (my $tmpfile, "<", 'temp.tmp') or die;
#copy $tmpfile, $old or die "Couldn't copy";
my $y = 0; #debug
while (my $line = <$tmpfile> ) {
print $y++; #debug
my $subz = $line;
print $subz;
print $old $subz;
}
}
print "Complete.\n";
exit;
You re-open your file handles before closing them. I'm an Oracle DBA masquerading as a perl developer, so I can't give the why behind it. But I know if you close your file handles, your script should work as is.
close ($old); # add this line
close ($tmpfile); # add this line
print "Start copying.\n";
It would then be good practice to close them again when you are done "copying" back to them.
Explicitly close the filehandle when you're done writing to it. Things will still be buffered until you do that.
Also would make more sense to
rename($file, "$file.old");
rename("temp.tmp", $file);
rather than looping through the file (or using File::Copy::copy) to make a backup copy of it.
Lastly, for simple edits I could suggest making an effort to get comfortable with doing it on the command line so you don't need scratch your head and wonder "now what did I do in that script last time?". It can be a big timesaver in the long run.
perl -p -i.bak -e 's/pattern/text/;' files*
is the general form.

How to update certain part of text file in perl

I have write the code but it is not working fine . I wish to change this "/" to this "\".
use strict;
use warnings;
open(DATA,"+<unix_url.txt") or die("could not open file!");
while(<DATA>){
s/\//\\/g;
s/\\/c:/;
print DATA $_;
}
close(DATA);
my original file is
/etc/passwd
/home/bob/bookmarks.xml
/home/bob/vimrc
expected output is
C:\etc\passwd
C:\home\bob\bookmarks.xml
C:\home\bob\vimrc
original output is
/etc/passwd
/home/bob/bookmarks.xml
/home/bob/vimrc/etc/passwd
\etc\passwd
kmarks.xml
kmarks.xml
mrcmrc
Trying to read and write the same file, line by line, in a while loop that is reading till the end of that same file, seems very dicey and unpredictable. I'm not at all sure where your file pointers are going to end up each time you try to write. You would be much safer sending your output to a new file (and then moving it to replace your old file afterwards if you wish).
open(DATA,"<unix_url.txt") or die("could not open file for reading!");
open(NEWDATA, ">win_url.txt") or die ("could not open file for writing!");
while(<DATA>){
s/\//\\/g;
s/\\/c:\\/;
# ^ (note - from your expected output you also wanted to preserve this backslash)
print NEWDATA $_;
}
close(DATA);
close(NEWDATA);
rename("win_url.txt", "unix_url.txt");
See also this answer:
Perl Read/Write File Handle - Unable to Overwrite
If the point of the exercise is less about using regular expressions, and more about getting things done, I would consider using modules from the File::Spec family:
use warnings;
use strict;
use File::Spec::Win32;
use File::Spec::Unix;
while (my $unixpath = <>) {
my #pieces = File::Spec::Unix->splitpath($unixpath);
my $winpath = File::Spec::Win32->catfile('c:', #pieces);
print "$winpath\n";
}
You don't really need to write a program do achieve this. You can use Perl Pie:
perl -pi -e 's|/|\\|g; s|\\|c:\\|;' unix_url.txt
However if you are running on windows and you use Cygwin, I would suggest to use the cygpath tool that convert POSIX paths into Windows paths.
Also you need to quote your paths since it is allowed to have spaces into windows paths. Or, you can escape the space char:
perl -pi -e 's|/|\\/g; s|\\|c:\\|; s| |\\ |g;' unix_url.txt
Now concerning your initial question, if you still want to use your own script you can use this (if you want a backup):
use strict;
use autodie;
use File::Copy;
my $file = "unix_url.txt";
open my $fh, "<", $file;
open my $tmp, ">", "$file.bak";
while (<$fh>) {
s/\//\\/g;
s/\\/c:/;
} continue { print $tmp $_ }
close $tmp;
close $fh;
move "$file.bak", $file;

Perl - How to Read, Filter & Output results

scenario: I am a Jr. C# developer, but recently (3 days) began learning Perl for batch files. I have a requirement to parse through a text file, extract some key data, then output the key data to a new text file. As seems to always be the case, there are butt loads of fragmented examples on the net regarding how to 'read' from a file, 'write' to a file, 'store' line by line into an array, 'filter' this and that, yadda yadda, but nothing discussing the entire process of read, filter, write. Trying to splice examples from the net together is no good, because none seem to work together as coherent code. Coming from C#, Perl's syntax structure is hella confusing. I just need some advice on this process.
My objective is to parse a text file, single out all lines similar to the one below, by date, and output only the first 8 digits of the 2nd number group and 5 digits from the 3rd number group to a new text file.
11122 20100223454345 ....random text..... [keyword that identifies all the
entries I need]... random text 0.0034543345
I know regex is likely the best option, and have most of the expression written, but it does not work in Perl!
Question: Could someone please show a simple (dummy) example of how to read from, filter (using dummy regex) the file, then output the (dummy) results to a new file? I'm not concerned with functional details, I can learn those, I just need the syntax structure Perl uses. For example:
open(FH, '<', 'dummy1.txt')
open(NFH, '>', 'dummy2.txt')
#array; or $dumb;
while(<FH>)
{
filter each line [REGEX] and shove it into [#array or $dumb scalar]
}
print(join(',', #array)) to dummy2.txt
close FH;
close NFH;
Note: For various reasons, I cannot paste my source code in here, sorry. Any help is appreciated.
UPDATE: ANSWER:
Much thanks to all those who provided insight into my issue. After reading through you replies, as well as conducting further research, I learned that there are dozens of ways to accomplish the same task in Perl(which I am not a fan of). In the end, this is how I solved the problem, and IMO it's the cleanest, and most succinct, solution for those having similar struggles. Thanks again for all the help.
#======================================================================
# 1. READ FILE: inputFile.txt
# 2. CREATE FILE: outputFile.txt
# 3. WRITE TO: outputFile.txt IF line matches REGEX constraints
# 4. CLOSE FILES: outputFile.txt & inputFile.txt
#==========================================================================
#1
$readFile = 'C:/.../.../inputFile.txt';
open(FH, '<', $readFile) or Error("Could not read file ($!)");
#2
$writeFile = 'C:/.../.../outputFile.txt';
open(NFH, '>', $writeFile) or Error("Cannot write to file ($!)");
#3
#lines = <FH>;
LINE: foreach $line (#lines)
{
if ($line =~ m/(201403\d\d).*KEYWORD.*time was (\d+\.\d+)/)
{
$date = $1;
$elapsedtime = $2;
print NFH "$date,$elapsedtime\n";
}
}
#4
close NFH;
close FH;
perlfaq5 - How do I change, delete, or insert a line in a file, or append to the beginning of a file? covers most of the different scenarios for how to use files.
However, I will add to that by saying that always start your scripts with use strict; and use warnings;, and because you're doing file processing, use autodie; will serve you as well.
With that in mind, a quick stub would be the following:
use strict;
use warnings;
use autodie;
open my $infh, '<', 'dummy1.txt';
open my $outfh, '>', 'dummy2.txt';
while (my $line = <$infh>) {
chomp $line; # Remove \n
if (Whatever magically processing here) {
print $outfh, "your new data";
}
}
while(<FH>)
{
# variable $_ contains the current line
if(m/regex_goes_here/) #by default, the regex match operator m// attempts to match the default $_ variable
{
#do actions
}
}
Also note, m/regex/ is the same as /regex/
Refer to:
http://perldoc.perl.org/perlvar.html#General-Variables
http://perldoc.perl.org/perlre.html
For capturing variables from regex match, THIS might help
EDIT
If you want a different variable than the default $_, as #Miller suggested, use while($line = <FH>) followed by if($line =~ m/regex_goes_here/)
=~ is the Binding Operator
One tip. Don't explicitly open filehandles to your input and output files. Instead read from STDIN and write to STDOUT. Your program will be far more flexible and easier to use as you'll be able to treat it like a Unix filter.
$ your_filter_program < your_input.txt > your_output.txt
And doing this actually makes your program simpler to write too.
while (<>) { # <> reads from STDIN
# transform your data (which is in $_) in some way
...
print; # prints $_ to STDOUT
}
You might find the first few chapters of Data Munging with Perl are useful.
use strict;
use warnings;
use autodie;
use feature qw(say);
use constant {
INPUT_FILE => "NAME_OF_INPUT_FILE",
OUTPUT_FILE => "NAME_OF_OUTPUT_FILE",
FILTER => qr/regex_for_line_to_filter/,
};
open my $in_fh, "<", INPUT_FILE;
open my $out_fh, ">", OUTPUT_FILE;
while ( my $line = <$in_fh> ) {
chomp $line;
next unless $line =~ FILTER;
$line =~ s/regular_expression/replacement/;
say {$out_fh} $line;
}
close $in_file;
close $out_file;
The $in_file is your input file, and $out_fh is your output file. I basically open both, and loop through the input. The chomp removes the \n from the end. I always recommend doing that.
The next goes to the next iteration of the loop unless I match FILTER which is a regular expression matching lines you want to keep. This is identical to:
if ( $line !~ FILTER ) {
next;
}
I then use the substitution command to get the parts of the line I want, and munge them into the output I want. I maybe better off expanding this a bit. Maybe using split to split up my line into various pieces, the only using the pieces I want. I could then use substr to pull out the substring from the select pieces.
The say command is like print except it automatically adds in a NL on the end. This is how you write a line to a file.
Now, get Learning Perl and read it. If you know any programming. it shouldn't take you more than a week to go through the first half of the book. That should be more than enough to be able to write a program like this. The more complex stuff like references and object orientation might take a bit longer.
On line documentation can be found at http://perldoc.perl.org. You can look up the use statements which are called pragmas over there. Documentation on the individual functions are also available.
If I understood well, this one liner will do the job:
perl -ane 'print substr($F[1],0,8),"\t",substr($F[-1],0,5),"\n" if /keyword/' in.txt
Assuming in.txt is:
11122 20100223454345 ....random text..... [keyword that identifies all the entries I need]... random text 0.0034543345
11122 30100223454345 ....random text..... [ that identifies all the entries I need]... random text 0.124543345
11122 40100223454345 ....random text..... [keyword that identifies all the entries I need]... random text 0.65487
11122 50100223454345 ....random text..... [ that identifies all the entries I need]... random text 0.6215
output:
20100223 0.003
40100223 0.654

How to convert this perl one-liner into script (specifically multi-line, global regex replace)

I have a file with several XML tags like such:
<Good>Yay!</Good>
<Great>Yup!</Great>
<Bad>booo</Bad>
<Bad>
<Ok>not that great</ok>
</Bad>
<Good>Wheee!</Good>
where I want to get rid of the "Bad" tags and anything in between.
So it would turn into just:
<Good>Yay!</Good>
<Great>Yup!</Great>
<Good>Wheee!</Good>
I know this one-liner:
perl -pe "undef $/;s/<Bad>.*?<\/Bad>//msg" < originalFile > newlyStrippedFile
Seems to do everything I want (aside from putting extra newlines in, but hopefully I can deal with that easily enough)
But I need to put it in a script (two files are read into the command line, one with all the tags, the other with a list of tags to pull out), so the same thing is going to be called several times.
And I'm just having trouble. Either it's only ever reading one line or I get errors or both.
Here is the relevant portion of my latest attempt:
open ORIGINAL_FILE, $sdb_pathname
or die "Can't open '$sdb_pathname' : $!";
#sdb_input_array = <ORIGINAL_FILE>;
close ORIGINAL_FILE;
#sdb_input_scalar=join("",#sdb_input_array);
foreach $tag (#tags) {
&remove_tag($tag);
}
sub remove_tag
{
my($current_tag) = #_;
$sdb_input_scalar =~ s/<$current_tag>.*?<\/$current_tag>//msg;
open NEWLY_STRIPPED_FILE, $clean_sdb_pathname
or die "Can't open '$clean_sdb_pathname' : $!";
print(NEWLY_STRIPPED_FILE $sdb_input_scalar);
close(NEWLY_STRIPPED_FILE);
}
This is giving me "use of uninitialized value $sdb_input_scalar in substitution (s///) at my $sdb_input_scalar =~ line.
and
Filehandle NEWLY_STRIPPED_FILE opened only for input
And of course my two files still look identical, as if I did nothing to them.
I'm sorry if I'm missing something obvious but I'm literally brand new to perl. Someone at work gave an 8-hour estimate to do this script and I've already used over 5 hours just installing perl, learning the syntax and getting the other aspects to go right. I know there is an XML::Parser module but I found the examples very overwhelming for the short time I have left to complete.
I have to assume my regex is correct because the one-liner works so nicely.
Can anyone please help me adapt it to what I need it for?
You really should use an XML parser. It's almost a guarantee that an XML file will not parse quite the way you expect it to with regexes. However, let's get you started first.
Where you have:
#sdb_input_scalar=join("",#sdb_input_array);
You actually want:
$sdb_input_scalar=join("",#sdb_input_array);
Now some other tips.
At the top of your script make sure you enable warnings with the -w flag like this:
#!/path/to/perl -w
use strict;
Once you add in the use strict it will cause you several errors, but that's a good thing. We're going to enforce some scope and other good practices. You now need to initialize variables (beginning with $, #, or %) with my. For example:
my #sdb_input_array = <ORIGINAL_FILE>;
or:
foreach my $tag (#tags) { ... }
Instead of calling open like you are, use the three arguement version:
open ($originalFile, "<", $sdb_pathname)
or die "Can't open '$sdb_pathname' : $!";
my #sdb_input_array = <$originalFile>;
That will set it to read only. See http://perldoc.perl.org/functions/open.html
Generally you should avoid dependency on globals. Change how you call remove_tag():
foreach $tag (#tags) {
$sdb_input_scalar = remove_tag($sdb_input_scalar, $tag);
}
To support this you need to change the function as well:
sub remove_tag
{
my($input, $current_tag) = #_;
$input =~ s/<$current_tag>.*?<\/$current_tag>//msg;
return $input;
}
You can then write out once after you have iterated over all tags by moving this outside of the remove_tag function:
open ($strippedFile, ">", $clean_sdb_pathname)
or die "Can't open '$clean_sdb_pathname' : $!";
print $strippedFile $sdb_input_scalar;
close($strippedFile);
Here is a solution using XML::Twig:
use warnings;
use strict;
use XML::Twig;
my $xml = XML::Twig->new(
pretty_print => 'indented',
twig_handlers => {
#Define a sub that will be called for all 'Bad' tags
Bad => sub {
$_->set_tag('Good');
}
}
);
$xml->parse(\*DATA);
$xml->print;
__DATA__
<xml><Good>Yay!</Good><Great>Yup!</Great><Bad>booo</Bad><Bad>
<Ok>not that great</Ok></Bad><Good>Wheee!</Good></xml>
XML::Twig also has parsefile() and parsefile_inplace() methods that take a filename directly and process it--just what you need.
There is a little bit of a learning curve with this method, but the benefits are great.
First: don't use regular expressions to deal with XML!
Then, assuming the doubt from the question title, rather than the specific usage case. Your one-liner is better written as:
perl -0777 -pe "s/<(Bad)>.*?<\/\1>//msg" < originalFile > newlyStrippedFile
Now, use the Perl itself to "inflate" the one-liner:
perl -MO=Deparse -0777 -pe "s/<(Bad)>.*?<\/\1>//msg" > oneliner.pl
And this is what you get:
BEGIN { $/ = undef; $\ = undef; }
LINE: while (defined($_ = <ARGV>)) {
s[<(Bad)>.*?</\1>][]gms;
}
continue {
die "-p destination: $!\n" unless print $_;
}
Just add use strict; use warnings;.
This is a solution using XML::Twig. I have assumed that your XML document is well-formed and have wrapped the data you have shown in it in a <root> element to make it so.
The $twig object defines a single twig handler for <Bad> elements, which simply deletes the element if it appears during parsing.
Once the input has been parsed, $twig-print shows the residual XML.
use strict;
use warnings;
use XML::Twig;
my $twig = XML::Twig->new(
twig_handlers => { Bad => sub { $_->delete } },
pretty_print => 'record',
);
$twig->parse(<<'END_XML');
<root>
<Good>Yay!</Good>
<Great>Yup!</Great>
<Bad>booo</Bad>
<Bad>
<Ok>not that great</Ok>
</Bad>
<Good>Wheee!</Good>
</root>
END_XML
$twig->print;
output
<root>
<Good>Yay!</Good>
<Great>Yup!</Great>
<Good>Wheee!</Good>
</root>
This should do the trick:
$tags=join("",#sdb_input_array);
print "contents before : $tags \n";
$tags =~ s/<Bad>.*?<\/Bad>//msg;
print "content cleaned : $tags \n";
the tags variable should now not carry the "BAD" tags - the only issue will be that the tag lines will be left with a blank unfilled line so that you have blank lines in between the GOOD tag lines - but you can remove blank lines as your final step

Search, Create and Move in Perl

I have a directory of about 800 html files. I am trying to search each file and return text between tags. Then I want to create a directory with that text and move (or copy) the files there. This seemed like a pretty easy endeavor when I thought it up but I am having a ton of problems even identifying the modules I would need for this. I have looked at File::Find and glob, but am not exactly sure about how I would implement this with a regex for txt within the files (not the file name.) I am basically a newbie to perl so any and all help would be appreciated. Thanks in advance.
EDIT
To clarify: What I am trying to accomplish:
Read Directory = ~/me/project/
For ~/me/project/ find all the files =~ /.html$/i
For each file, search the html for = div class="recip" id="objectTo">(.*) /div
For every (.*) IE john#doewww.com or John Doe create a directory with that same name
Loop back and move every file that has an instance of xxxxxxxx#xxxxx.com or John Doe to its corresponding directory.
I really appreciate the help!
You're on the right track with File::Find.
You will create a 'wanted()' function, and within that function, the name of the file found will be $File::Find::name. You can then use that to open a file handle, read in the file, search for the tags and extract the data that you're looking for, and close the file handle. File::Find will then move on to the next file.
#! /usr/bin/perl
use warnings;
use strict;
use File::Find;
sub wanted {
my $file=$File::Find::name;
# if the file has the extension '.html' (case insensitive) ...
if( $file =~ /\.html$/i ) {
my $FH;
open( $FH, '<', $file) or die "Could not open '$file' for reading: $!";
local $/ = '';
my $contents = <$FH>; # slurp file into $contents
# search $contents for the tags that you're looking for,
#
close $FH;
}
}
my #directories = (
'./htmlfiles'
, './www'
, './web'
);
find(\&wanted, #directories);
Warning: The code passes perl -c, but I haven't run it.
For the second part of your question, Check out HTML::Strip for stripping HTML markup from text.