Perl: how to supply regexp list from file - regex

So, I need parse a file and if something matches the pattern, replace it with something:
while(<$ifh>) {
s/(.*pattern_1*)/$1\nsome more stuff/ ;
s/(.*pattern_2*)/$1\neven more stuff/;
s/(.*pattern_3.*)// ;
# and so on ...
print $ofh $_;
}
Question: what would be a simplest way to have this regexp rules list in the file (something similar to sed '-f' option)?
EDIT: perhaps I need to clarify a bit. We need to have the regexp rules in the separate file (not in the parsed file - although this was nice, thanks!), so they are not hardcoded. So, basically the external file should consist of 's//' lines.
Of course this can probably be done with foreach loop and eval, or even with external call to sed, but I suppose there can be something nicer.
regards, Wojtek

You could create patterns and store them in a Perl list, then simply iterate through your patterns in your input loop.
my #patterns = ( qr/pattern1/, qr/pattern2/, qr/pattern3/, etc..);
while(<$ifh>) {
for my $pattern (#patterns) {
s/($pattern)/$1\neven more stuff/;
}
print $ofh $_;
}

Based on your edit, here's a way to store regular expressions in an external file. Note that I would not recommend storing whole s/.../.../ statements in an external file, but you could use the eval function to accomplish it. Here's how I would solve this, however:
my (#regexes,$i);
$i=0;
while (my $line=<$rifh>) {
chomp($line);
if (index($line, "::")>-1) {
my #texts=split(/::/,$line);
$regexes[$i]{"to find"} = qr!$texts[0]!;
$regexes[$i]{"replace with"} = $texts[1];
$i++;
}
}
while (my $line=<$ifh>) {
chomp($line);
foreach my $regex (#regexes)
$line ~= s!$regex->{"to find"}!$regex->{"replace with"}!;
print $ofh $line . "\n";
}
(My use of ! as a regexp delimiter is my own style mechanism, often useful for allowing bare / characters to be used directly in a regexp definition.)

Related

Retrieve string between two string delimiters using regex in perl

I have been working on this for a little while now and can't seem to figure it out. I have a file containing a bunch of lines all structured like the one below meaning each line starts with "!" and has three separators "<DIV>".
!the<DIV>car<DIV>drove down the<DIV>road off into the distance
I am interested in retrieving the last string "road off into the distance" I can't seem to get it to work. Below I have listed the current code I have.
while($line = <INFILE>) {
$line =~ /<SEP>{3}(.*)/;
print $1;
}
Any help would be greatly appreciated!
The statement
#b = $a =~ /^!(.*?)<DIV>(.*?)<DIV>(.*?)<DIV>(.*)/
will split the string into a list, and you can then extract the 4th element with
$b[3]
If you really want only the last one, do this instead:
($text) = $a =~ /^!.*<DIV>(.*)/
I don't know whether you insist on regex or simply didn't think of else, but split will nicely do this
$text = (split '<DIV>', $str)[-1];
If you regularly have such repeating patterns split may well be better for the job than a pure regex. (Split also uses full regular expressions in its pattern, of course.)
ADDED
All this can be done directly, if you simply only need to pull the last thing from each line:
open my $fh, '<', $file;
my #text = map { (split '<DIV>')[-1] } <$fh>;
close $fh;
print "$_\n" for #text;
The split by default uses $_, which inside the map is the current element processed. For lines without a <DIV> this returns the whole line. A file handle in the list context serves all lines as a list; the list context is imposed by map here.
In case you want all text between delimiters you can do
my #rlines = map { [ split '<DIV>' ] } <$fh>;
where [ ] takes a reference to the list returned by split and thus #rlines contains references to arrays, each with text in between <DIV>s on a line. The leading ! is there though and to drop it a little more processing is needed.
Of course, for the map block you can use { (/.*<DIV>(.*)/)[0] } from Jim Garrison's answer for a single match, or modify the regex a little to catch'em all.
If performance is a factor then that's a little different question.
A simple substitution could work too:
while(<DATA>){
chomp;
my $text = (s/.*<DIV>//g, $_);
say $text;
}
Simple regex which answers your question:
my $match= '';
while($line = <INFILE>) {
($match) = $line =~/.*<DIV>(.*)/;
}
print $match, "\n";

How to convert this perl one-liner into script (specifically multi-line, global regex replace)

I have a file with several XML tags like such:
<Good>Yay!</Good>
<Great>Yup!</Great>
<Bad>booo</Bad>
<Bad>
<Ok>not that great</ok>
</Bad>
<Good>Wheee!</Good>
where I want to get rid of the "Bad" tags and anything in between.
So it would turn into just:
<Good>Yay!</Good>
<Great>Yup!</Great>
<Good>Wheee!</Good>
I know this one-liner:
perl -pe "undef $/;s/<Bad>.*?<\/Bad>//msg" < originalFile > newlyStrippedFile
Seems to do everything I want (aside from putting extra newlines in, but hopefully I can deal with that easily enough)
But I need to put it in a script (two files are read into the command line, one with all the tags, the other with a list of tags to pull out), so the same thing is going to be called several times.
And I'm just having trouble. Either it's only ever reading one line or I get errors or both.
Here is the relevant portion of my latest attempt:
open ORIGINAL_FILE, $sdb_pathname
or die "Can't open '$sdb_pathname' : $!";
#sdb_input_array = <ORIGINAL_FILE>;
close ORIGINAL_FILE;
#sdb_input_scalar=join("",#sdb_input_array);
foreach $tag (#tags) {
&remove_tag($tag);
}
sub remove_tag
{
my($current_tag) = #_;
$sdb_input_scalar =~ s/<$current_tag>.*?<\/$current_tag>//msg;
open NEWLY_STRIPPED_FILE, $clean_sdb_pathname
or die "Can't open '$clean_sdb_pathname' : $!";
print(NEWLY_STRIPPED_FILE $sdb_input_scalar);
close(NEWLY_STRIPPED_FILE);
}
This is giving me "use of uninitialized value $sdb_input_scalar in substitution (s///) at my $sdb_input_scalar =~ line.
and
Filehandle NEWLY_STRIPPED_FILE opened only for input
And of course my two files still look identical, as if I did nothing to them.
I'm sorry if I'm missing something obvious but I'm literally brand new to perl. Someone at work gave an 8-hour estimate to do this script and I've already used over 5 hours just installing perl, learning the syntax and getting the other aspects to go right. I know there is an XML::Parser module but I found the examples very overwhelming for the short time I have left to complete.
I have to assume my regex is correct because the one-liner works so nicely.
Can anyone please help me adapt it to what I need it for?
You really should use an XML parser. It's almost a guarantee that an XML file will not parse quite the way you expect it to with regexes. However, let's get you started first.
Where you have:
#sdb_input_scalar=join("",#sdb_input_array);
You actually want:
$sdb_input_scalar=join("",#sdb_input_array);
Now some other tips.
At the top of your script make sure you enable warnings with the -w flag like this:
#!/path/to/perl -w
use strict;
Once you add in the use strict it will cause you several errors, but that's a good thing. We're going to enforce some scope and other good practices. You now need to initialize variables (beginning with $, #, or %) with my. For example:
my #sdb_input_array = <ORIGINAL_FILE>;
or:
foreach my $tag (#tags) { ... }
Instead of calling open like you are, use the three arguement version:
open ($originalFile, "<", $sdb_pathname)
or die "Can't open '$sdb_pathname' : $!";
my #sdb_input_array = <$originalFile>;
That will set it to read only. See http://perldoc.perl.org/functions/open.html
Generally you should avoid dependency on globals. Change how you call remove_tag():
foreach $tag (#tags) {
$sdb_input_scalar = remove_tag($sdb_input_scalar, $tag);
}
To support this you need to change the function as well:
sub remove_tag
{
my($input, $current_tag) = #_;
$input =~ s/<$current_tag>.*?<\/$current_tag>//msg;
return $input;
}
You can then write out once after you have iterated over all tags by moving this outside of the remove_tag function:
open ($strippedFile, ">", $clean_sdb_pathname)
or die "Can't open '$clean_sdb_pathname' : $!";
print $strippedFile $sdb_input_scalar;
close($strippedFile);
Here is a solution using XML::Twig:
use warnings;
use strict;
use XML::Twig;
my $xml = XML::Twig->new(
pretty_print => 'indented',
twig_handlers => {
#Define a sub that will be called for all 'Bad' tags
Bad => sub {
$_->set_tag('Good');
}
}
);
$xml->parse(\*DATA);
$xml->print;
__DATA__
<xml><Good>Yay!</Good><Great>Yup!</Great><Bad>booo</Bad><Bad>
<Ok>not that great</Ok></Bad><Good>Wheee!</Good></xml>
XML::Twig also has parsefile() and parsefile_inplace() methods that take a filename directly and process it--just what you need.
There is a little bit of a learning curve with this method, but the benefits are great.
First: don't use regular expressions to deal with XML!
Then, assuming the doubt from the question title, rather than the specific usage case. Your one-liner is better written as:
perl -0777 -pe "s/<(Bad)>.*?<\/\1>//msg" < originalFile > newlyStrippedFile
Now, use the Perl itself to "inflate" the one-liner:
perl -MO=Deparse -0777 -pe "s/<(Bad)>.*?<\/\1>//msg" > oneliner.pl
And this is what you get:
BEGIN { $/ = undef; $\ = undef; }
LINE: while (defined($_ = <ARGV>)) {
s[<(Bad)>.*?</\1>][]gms;
}
continue {
die "-p destination: $!\n" unless print $_;
}
Just add use strict; use warnings;.
This is a solution using XML::Twig. I have assumed that your XML document is well-formed and have wrapped the data you have shown in it in a <root> element to make it so.
The $twig object defines a single twig handler for <Bad> elements, which simply deletes the element if it appears during parsing.
Once the input has been parsed, $twig-print shows the residual XML.
use strict;
use warnings;
use XML::Twig;
my $twig = XML::Twig->new(
twig_handlers => { Bad => sub { $_->delete } },
pretty_print => 'record',
);
$twig->parse(<<'END_XML');
<root>
<Good>Yay!</Good>
<Great>Yup!</Great>
<Bad>booo</Bad>
<Bad>
<Ok>not that great</Ok>
</Bad>
<Good>Wheee!</Good>
</root>
END_XML
$twig->print;
output
<root>
<Good>Yay!</Good>
<Great>Yup!</Great>
<Good>Wheee!</Good>
</root>
This should do the trick:
$tags=join("",#sdb_input_array);
print "contents before : $tags \n";
$tags =~ s/<Bad>.*?<\/Bad>//msg;
print "content cleaned : $tags \n";
the tags variable should now not carry the "BAD" tags - the only issue will be that the tag lines will be left with a blank unfilled line so that you have blank lines in between the GOOD tag lines - but you can remove blank lines as your final step

How to pass a replacing regex as a command line argument to a perl script

I am trying to write a simple perl script to apply a given regex to a filename among other things, and I am having trouble passing a regex into the script as an argument.
What I would like to be able to do is somthing like this:
> myscript 's/hi/bye/i' hi.h
bye.h
>
I have produced this code
#!/utils/bin/perl -w
use strict;
use warnings;
my $n_args = $#ARGV + 1;
my $regex = $ARGV[0];
for(my $i=1; $i<$n_args; $i++) {
my $file = $ARGV[$i];
$file =~ $regex;
print "OUTPUT: $file\n";
}
I cannot use qr because apparently it cannot be used on replacing regexes (although my source for this is a forum post so I'm happy to be proved wrong).
I would rather avoid passing the two parts in as seperate strings and manually doing the regex in the perl script.
Is it possible to pass the regex as an argument like this, and if so what is the best way to do it?
There's more than one way to do it, I think.
The Evial Way:
As you basically send in a regex expression, it can be evaluated to get the result. Like this:
my #args = ('s/hi/bye/', 'hi.h');
my ($regex, #filenames) = #args;
for my $file (#filenames) {
eval("\$file =~ $regex");
print "OUTPUT: $file\n";
}
Of course, following this way will open you to some very nasty surprises. For example, consider passing this set of arguments:
...
my #args = ('s/hi/bye/; print qq{MINE IS AN EVIL LAUGH!\n}', 'hi.h');
...
Yes, it will laugh at you most evailly.
The Safe Way:
my ($regex_expr, #filenames) = #args;
my ($substr, $replace) = $regex_expr =~ m#^s/((?:[^/]|\\/)+)/((?:[^/]|\\/)+)/#;
for my $file (#filenames) {
$file =~ s/$substr/$replace/;
print "OUTPUT: $file\n";
}
As you can see, we parse the expression given to us into two parts, then use these parts to build a full operator. Obviously, this approach is less flexible, but, of course, it's much more safe.
The Easiest Way:
my ($search, $replace, #filenames) = #args;
for my $file (#filenames) {
$file =~ s/$search/$replace/;
print "OUTPUT: $file\n";
}
Yes, that's right - no regex parsing at all! What happens here is we decided to take two arguments - 'search pattern' and 'replacement string' - instead of a single one. Will it make our script less flexible than the previous one? No, as we still had to parse the regex expression more-or-less regularly. But now user clearly understand all the data that is given to a command, which is usually quite an improvement. )
#args in both examples corresponds to #ARGV array.
The s/a/b/i is an operator, not simply a regular expression, so you need to use eval if you want it to be interpreted properly.
#!/usr/bin/env perl
use warnings;
use strict;
my $regex = shift;
my $sub = eval "sub { \$_[0] =~ $regex; }";
foreach my $file (#ARGV) {
&$sub($file);
print "OUTPUT: $file\n";
}
The trick here is that I'm substituting this "bit of code" into a string to produce Perl code that defines an anonymous subroutine $_[0] =~ s/a/b/i; (or whatever code you pass it), then using eval to compile that code and give me a code reference I can call from within the loop.
$ test.pl 's/foo/bar/' foo nicefood
OUTPUT: bar
OUTPUT: nicebard
$ test.pl 'tr/o/e/' foo nicefood
OUTPUT: fee
OUTPUT: nicefeed
This is more efficient than putting an eval "\$file =~ $regex;" inside the loop as then it'll get compiled and eval-ed at every iteration rather than just once up-front.
A word of warning about eval - as raina77ow's answer explains, you should avoid eval unless you're 100% sure you are always getting your input from a trusted source...
s/a/b/i is not a regex. It is a regex plus substitution. Unless you use the string eval, make this work might be pretty tough (consider s{a}<b>e and so on).
The trouble is that you are trying to pass a perl operator when all you really need to pass is the arguments:
myscript hi bye hi.h
In the script:
my ($find, $replace, #files) = #ARGV;
...
$file =~ s/$find/$replace/i;
Your code is a bit clunky. This is all you need:
use strict;
use warnings;
my ($find, $replace, #files) = #ARGV;
for my $file (#files) {
$file =~ s/$find/$replace/i;
print "$file\n";
}
Note that this way allows you to use meta characters in the regex, such as \w{2}foo?. This can be both a good thing and a bad thing. To make all characters intepreted literally (disable meta characters), you can use \Q ... \E like so:
... s/\Q$find\E/$replace/i;

Perl, match one pattern multiple times in the same line delimited by unknown characters

I've been able to find similar, but not identical questions to this one. How do I match one regex pattern multiple times in the same line delimited by unknown characters?
For example, say I want to match the pattern HEY. I'd want to recognize all of the following:
HEY
HEY HEY
HEYxjfkdsjfkajHEY
So I'd count 5 HEYs there. So here's my program, which works for everything but the last one:
open ( FH, $ARGV[0]);
while(<FH>)
{
foreach $w ( split )
{
if ($w =~ m/HEY/g)
{
$count++;
}
}
}
So my question is how do I replace that foreach loop so that I can recognize patterns delimited by weird characters in unknown configurations (like shown in the example above)?
EDIT:
Thanks for the great responses thus far. I just realized I need one other thing though, which I put in a comment below.
One question though: is there any way to save the matched term as well? So like in my case, is there any way to reference $w (say if the regex was more complicated, and I wanted to store it in a hash with the number of occurrences)
So if I was matching a real regex (say a sequence of alphanumeric characters) and wanted to save that in a hash.
One way is to capture all matches of the string and see how many you got. Like so:
open (FH, $ARGV[0]);
while(my $w = <FH>) {
my #matches = $w =~ m/(HEY)/g;
my $count = scalar(#matches);
print "$count\t$w\n";
}
EDIT:
Yes, there is! Just loop over all the matches, and use the capture variables to increment the count in a hash:
my %hash;
open (FH, $ARGV[0]);
while (my $w = <FH>) {
foreach ($w =~ /(HEY)/g) {
$hash{$1}++;
}
}
The problem is you really don't want to call split(). It splits things into words, and you'll note that your last line only has a single "word" (though you won't find it in the dictionary). A word is bounded by white-space and thus is just "everything but whitespace".
What you really want is to continue to do look through each line counting every HEY, starting where you left off each time. Which requires the /g at the end but to keep looking:
while(<>)
{
while (/HEY/g)
{
$count++;
}
}
print "$count\n";
There is, of course, more than one way to do it but this sticks close to your example. Other people will post other wonderful examples too. Learn from them all!
None of the above answers worked for my similar problem. $1 does not seem to change (perl 5.16.3) so $hash{$1}++ will just count the first match n times.
To get each match, the foreach needs a local variable assigned, which will then contain the match variable. Here's a little script that will match and print each occurrence of (number).
#!/usr/bin/perl -w
use strict;
use warnings FATAL=>'all';
my (%procs);
while (<>) {
foreach my $proc ($_ =~ m/\((\d+)\)/g) {
$procs{$proc}++;
}
}
print join("\n",keys %procs) . "\n";
I'm using it like this:
pstree -p | perl extract_numbers.pl | xargs -n 1 echo
(except with some relevant filters in that pipeline). Any pattern capture ought to work as well.

Using a regular expression in Perl to list variables from another Perl script

My thoughts on how to grab all scalars and arrays out of a Perl file went along the lines of:
open (InFile, "SomeScript.pl");
#InArray = <InFile>;
#OutArray = {};
close (InFile);
$ArrayCount = #InArray;
open (OutFile, ">outfile.txt");
for ($x=0; $x<=$ArrayCount; $x++){
$Testline = #InArray[$x];
if($Testline =~ m/((#|\$)[A-Z]+)/i){
$Outline = "$1\n";
push #OutArray, $Outline;
}
}
print OutFile #OutArray;
close(OutFile);
...and this works fairly well. The problem is that if multiple variables appear on a line it will only grab the first variable. An example might be:
$FirstVar = $SecondVar + $ThirdVar;
The script would only grab $FirstVar and output to a file. This might still work though because $SecondVar and $ThirdVar have to be initialized somewhere else before the proceeding line has any meaning. I guess the exception to the rule would be a line in which multiple variables are initialized at the same time.
Could an example in real Perl code break this script?
Also, how to grab multiple items that match my regular expression's criteria from the same line?
Don't do that
You can't really parse Perl with regexes, so I wouldn't even try.
You can't even properly parse it without actually running it, but you can get close with PPI.
perl-variables.pl
#! /usr/bin/env perl
use strict;
use warnings;
use 5.10.1;
use PPI;
use PPI::Find;
my($filename) = (#ARGV, $0); # checks itself by default
my $Doc = PPI::Document->new($filename);
my $Find = PPI::Find->new( sub{
return 0 unless $_[0]->isa('PPI::Token::Symbol');
return 1;
});
$Find->start($Doc);
while( my $symbol = $Find->match ){
my $raw = $symbol->content;
my $var = $symbol->symbol;
if( $raw eq $var ){
say $var;
} else {
say "$var\t($raw)";
}
}
print "\n";
my #found = $Find->in($Doc);
my %found;
$found{$_}++ for #found;
say for sort keys %found;
Running it against itself, produces:
$filename
#ARGV
$0
$Doc
$filename
$Find
#_ ($_)
$Find
$Doc
$symbol
$Find
$raw
$symbol
$var
$symbol
$raw
$var
$var
#found
$Find
$Doc
%found
%found ($found)
$_
#found
%found
$0
$Doc
$Find
$_
$filename
$found
$raw
$symbol
$var
%found
#ARGV
#found
It looks like this will miss fully qualified variable names ($My::Package::Foo) and the rare but valid variable names enclosed with braces (${variable}, ${"varname!with#special+chars"}). Your script will also match element accesses of hashes and arrays ($array[4] ==> $array, $hash{$key} ==> $hash), and object method calls ($object->method() ==> $object), which may or may not be what you want.
You also mismatch variables with underscores ($my_var) and numbers ($var3), and you could get false positives from comments, quoted strings, pod, etc. (# report bugs to bob#company.org).
Matching multiple expressions is a matter of using the /g modifier, which will return a list of matches:
#vars = $Testline =~ /[#\$]\w+/gi;
if (#vars > 0) {
push #OutArray, #vars;
}
Time simple-minded answer is to the /g flag on your regexp.
The complex answer is that this sort of code analysis is very difficult for perl. Look at the module PPI for a better, more full featured, semantic analysis of perl code.
I can't answer either of your questions directly, but I will offer this: I don't know why you're trying to extract scalars, but the debugger package that comes with perl has to "know" about all variables, and the last time I looked it was written in Perl. You may be better off trying to evaluate a perl script using the debugger package or techniques borrowed from that package rather than reinventing the wheel.
Despite the limitations with the method, here is a slightly simpler version of the script above that reads from stdin.
#!/usr/bin/perl
use strict;
use warnings;
my %vars;
while (<>) {
$vars{$_}++ for (m'([$#]\w+)'g);
}
my #vars = keys %vars;
print "#vars\n";