Conflict between eval and print in Perl - regex

I'm creating a subroutine in my Perl script and can evaluate it nicely and it works. I would also like to print the content of the subroutine for debugging purposes. However, the subroutine, which is constructed in code, is really huge and is hard to read and understand it by simply printing it. I would like to find a way to be able to print it in a semi-indented way.
Here is piece of code generation:
$code .= "if (\$ct=~/^\\s*\$/x || \$Im < \$Ix) {push(\#min, $b); push(\#max, $b);} if (\$Im > \$Ix) {push(\#min, $a); push(\#max, $a);}"
And I would like to print it something like this:
if (\$ct=~/^\\s*\$/x || \$Im < \$Ix)
{push(\#min, $b); push(\#max, $b);}
if (\$Im > \$Ix)
{push(\#min, $a); push(\#max, $a);}
I know that the straight way to do this is to write another script to parse it and put some \n and \t into the appropriate places in code and then print it. Is there a smarter way to that?
Like putting \n somewhere in code without subverting evaling it (i.e., something visible to print but invisible to eval).
NOTE: I have a lot of regexes in my subroutine and I want to avoid running them every time. That's why I need to have the code stored in a string and then eval it to increase my script performance.

Ignoring the reasons why you may have code in a string...
Perl::Tidy is the tool that you need to reformat your code.
Normally, one uses this tool via the command line on source files. However, I've hacked together a little script that will output your code string to a temporary file so that it can be reformatted. Note, this currently assumes that your code is well-formed and that there aren't any obvious syntax errors in it as formatting broken code is outside the purview of this tool.
use strict;
use warnings;
use autodie;
my $code = <<'END_CODE';
# It hurts to write ugly code, but I'll see what I can do
sub { my #vars = #_;
my $count = scalar(#vars); print "Hello World. Vars = $count"; return; }
END_CODE
print pretty_code($code);
sub pretty_code {
my $code = shift;
require File::Temp;
require Perl::Tidy;
my ($fh, $filename) = File::Temp::tempfile();
print $fh $code;
close $fh;
Perl::Tidy::perltidy(
source => $filename,
);
my $output = do {
open my $fh, '<', "$filename.tdy";
local $/;
<$fh>
};
unlink $_ for ($filename, "$filename.tdy");
return $output;
}
Outputs:
# It hurts to write ugly code, but I'll see what I can do
sub {
my #vars = #_;
my $count = scalar(#vars);
print "Hello World. Vars = $count";
return;
}
Update
There is no need to use a temporary file, particularly as Perl::Tidy accumulates the tidied code in memory before dumping it to disk. If you prefer, this program does the same thing without writing the result to disk.
use strict;
use warnings;
use Perl::Tidy 'perltidy';
my $code = <<'END_CODE';
# It hurts to write ugly code, but I'll see what I can do
sub { my #vars = #_; my $count = scalar(#vars); print "Hello World. Vars = $count"; return; }
END_CODE
print pretty_code($code);
sub pretty_code {
my ($code) = #_;
my $pretty;
perltidy(
source => \$code,
destination => \$pretty,
);
$pretty;
}
output
# It hurts to write ugly code, but I'll see what I can do
sub {
my #vars = #_;
my $count = scalar(#vars);
print "Hello World. Vars = $count";
return;
}
I'm not clear at present why the closing brace is indented further, but I am certain that the result is better than the original.

Related

Perl capture and add to end of string

I have a file with a lot of lines like this:
ChrVIII_A_nidulans_FGSC_A4 AspGD gene 3861520 3863875 . + . ID=AN0338;Name=AN0338;Gene=CYP680A1;Note=Putative%20cytochrome%20P450;orf_classification=Uncharacterized;Alias=ANIA_00338,ANID_00338
My region of interest is ;Gene=_____; -- the stuff between the = and ;.
If this region exists, I want to append it to the end of the line with a , attached to the front. If it does not exist I want to print the line anyway!
ChrVIII_A_nidulans_FGSC_A4 AspGD gene 3861520 3863875 . + . ID=AN0338;Name=AN0338;Gene=CYP680A1;Note=Putative%20cytochrome%20P450;orf_classification=Uncharacterized;Alias=ANIA_00338,ANID_00338,CYP680A1
This is what I tried in Perl and I don't know why it doesn't work.
use strict;
use warnings;
open(SOURCE,"<annotation.gff") or die "Source file not found!\n";
my $line1;
foreach $line1(<SOURCE>) #iterating over SOURCE file
{
if($line1=~/Gene\=([a-zA-Z0-9\-]+)\;/)
printf "$line1,$1";
}
else {printf "$line1";}
}
Can anyone show me what I am doing wrong?
Let's go through your code:
use strict;
use warnings;
Good. However, trying to run your code gives:
syntax error at ss.pl line 9, near ")
printf"
syntax error at ss.pl line 11, near "else"
which means you did not post the code you ran, so we can't really trust it. Don't do that. Reduce your problem to a small, self-contained script others can run.
open(SOURCE,"<annotation.gff") or die "Source file not found!\n";
Don't use bareword filehandles such as SOURCE. Instead, use lexical filehandles.
Don't hard code the name of the file you are trying to open. Doing so makes it hard to accurately convey the name of the file your program failed to open in case of a failure.
In the error message, include actual error your program encountered, rather than hardcoding your unwarranted assumptions.
Don't use the two argument form of open, especially if you are going to want the flexibility to specify file names as command line arguments instead of having to edit the script every time you get a new input file. That is, use
my $annotation_file = 'annotation.gff';
open my $source, '<', $annotation_file
or die "Failed to open annotation source '$annotation_file': $!";
Don't declare the iteration variable for a loop outside the scope of the loop.That is, instead of:
my $line1;
foreach $line1 ( ... )
use
foreach my $line1 ( ... )
But, of course, you should not use a for loop to iterate over the contents of a file because doing so makes your program slurp (i.e. read the entire contents of) the file into memory as a list of lines. This makes the memory footprint of your program depend on the size of its input instead of the size of the longest line. Also, drop the 1 suffix: You are iterating through every line in the file, not just the first one.
while (my $line = <$source>) {
Don't use printf if you are just printing plain strings. That is, instead of printf "$line1,$1", use print "$line,$1\n".
And, that brings us to another problem. When you read the line, you never remove the newline off its end. Therefore, the string you print is "...\n..." which creates the effect of prepending the captured string to the beginning of the following line.
That brings us to something that works:
use strict;
use warnings;
my $annotation_file = 'annotation.gff';
open my $source, '<', $annotation_file
or die "Cannot open annotation source '$annotation_file': $!";
while (my $line = <$source>) {
if( $line =~ /Gene = ( [^;]+ ) ;/x ) {
chomp $line;
print join(',' => $line, $1), "\n";
}
else {
print $line;
}
}
Try this:
use strict;
use warnings;
open(my $fh, '<', 'annotation.gff') or die $!;
while (<$fh>) {
chomp;
/Gene=([a-zA-Z0-9\-]+)\;/ and $_ .= ",$1";
print "$_\n";
}
close $fh;

perl hash array reading from files

I'm trying to read multiple files that have the same format and want to make some statistics based on regex.
i.e I want to count similar items that are within the []
NC_013618 NC_013633 ([T(nad6 trnE ,cob trnT ,)])
C_013481 NC_013479 ([T(trnP ,rrnS trnF trnV rrnL nad1 trnI ,)])
NC_013485 NC_003159 ([T(trnC ,trnY ,)])
NC_013554 NC_013254 ([T(trnR ,trnN ,)])
NC_013607 NC_013618 ([T(nad6 trnE ,cob trnT ,)])
the problem is that i'm not getting right values, below is my code:
use strict;
use warnings;
my %data;
#FILES = glob("../mitos-crex/*.out");
foreach my $file (#FILES) {
local $/ = undef;
open my $fh, '<', $file;
$data{$file} = <$fh>;
}
my #t;
my $c = 0;
foreach my $line (keys %data) {
foreach my $l ($data{$line}) {
print $l."\n";
($t[$c]) = $l =~ m/(\[.*\])/;
$c++;
}
}
#the problem is here the counter is not giving the right value
print $c;
my %counts;
$counts{$_}++ for #t;
thanks in advance
First of all, always use strict and use warnings. This measure is vital for all programming, as it will quickly reveal simple problems that you may otherwise overlook or waste time on debugging. This is especially true and a simple courtesy if you are asking for others' help with your program
You seem to have become confused between slurping an entire file into a single string, and into an array of lines. The way you have written it, each element $data{file} is a single scalar value containing all of the file's data, and then you try to iterate over it with foreach $l ($data{$line}) { ... } which executes just once and so only find the first [...] string in the file
Ordinarily I would say that you shouldn't read in all of your file data in this way, as the problem is likely to have a better streamed solution, but I don't know what else you want to use the captured data for, so my solution follows your own design
I think you need to slurp the data into a virtual array, instead of a scalar, and then iterate over that in your loops. You must leave $/ defined so that the file is read in lines, and build an anonymous array with [ <$fh> ]. Then you can iterate over the lines with foreach my $line (#{ $data{$file} }) { ... }
use strict;
use warnings;
my %data;
my #files = glob("../mitos-crex/*.out");
foreach my $file (#files) {
open my $fh, '<', $file or die $!;
$data{$file} = [ <$fh> ];
}
my $c = 0;
my #t;
foreach my $file (keys %data) {
foreach my $line (#{ $data{$file} }) {
($t[$c]) = $line =~ /(\[.*\])/;
$c++;
}
}
print $c;
my %counts;
$counts{$_}++ for #t;
The counter is giving a correct value. Your problem is that you are slurping the file (reading it all in at once), but then only storing the first value found:
($t[$c]) = $data{$line} =~ m/(\[.*\])/; # only finds first value in file
Either loop over each file properly, and use the above regex for each line, or do something like:
push #t, ($data{$line} =~ m/(\[.*\])/g);
You should always use
use strict;
use warnings;
And solve the errors/warnings that result. Not doing so is a bad idea, and is only hiding the problems in your code -- not solving them.
Also, you should be aware that this statement:
foreach $l ($data{$line}) {
Only iterates once, because each "line" here is an entire file, and $data{$line} is besides a scalar value. Moreover, you iterate using $l as an alias, but you still use $data{$line} inside the loop, which makes the loop completely redundant.

Problems with separator-encoding

If I run this script as it is, it works.
But why does this not work with cgi?
When I use _\01_ instead of _\00_ it works with cgi too.
#!/usr/bin/env perl
use warnings;
use 5.012;
### script_1.cgi #########################################
my #array = ( '1524', '2.18 MB', '09/23/03', '_cool_name_', 'type' );
my $row = join "_\00_", #array;
say $row;
# submit $row to script_2.cgi
### script_2.cgi #########################################
# ...
# my $row = $cgi->param('row');
# my $name;
if ( $row =~ /_\00_([^\00]+)_\00_type\z/ ) {
# $name = $1;
say "Name: <$1>";
} else {
die "<$row> $!";
}
# Software error:
# <1524_�_2.18 MB_�_09/23/03_�__cool_name__�_type> at script_2.cgi line of "die "<$row> $!";"
Works for me, says _cool_name_. You're probably running afoul of CGI.pm using \0 already for itself, but since you did not post your complete code, no one can say for sure.
I'll use the opportunity to unask the question. The lessons you should learn are:
Avoid rolling your own serialisation scheme. As a beginner, you have made the typical mistake of not encoding the separator if it occurs in the data (c.f. double backslash in string expressions and double percent in sprintf expressions). The array could have been passed intact unjoined via e.g. JSON.
Instead of two scripts, these should be two subroutines in the same program. This way, you are able to pass data structures without the need to serialise.

How to Circumvent Perl's string escaping the replacement string in s/// when it's read from a file?

This question is similar to my last one, with one difference to make the toy script more similar to my actual one.
Here is the toy script, replace.pl (Edit: now with 'use strict;', etc)
#! /usr/bin/perl -w
use strict;
open(REPL, "<", $ARGV[0]) or die "Couldn't open $ARGV[0]: $!!";
my %replacements;
while(<REPL>) {
chomp;
my ($orig, $new, #rest) = split /,/;
# Processing+sanitizing of orig/new here
$replacements{$orig} = $new;
}
close(REPL) or die "Couldn't close '$ARGV[0]': $!";
print "Performing the following replacements\n";
while(my ($k,$v) = each %replacements) {
print "\t$k => $v\n";
}
open(IN, "<", $ARGV[1]) or die "Couldn't open $ARGV[1]: $!!";
while ( <IN> ) {
while(my ($k,$v) = each %replacements) {
s/$k/$v/gee;
}
print;
}
close(IN) or die "Couldn't close '$ARGV[1]': $!";
So, now lets say I have two files, replacements.txt (using the best answer from the last question, plus a replacement pair that doesn't use substitution):
(f)oo,q($1."ar")
cat,hacker
and test.txt:
foo
cat
When I run perl replace.pl replacements.txt test.txt I would like the output to be
far
hacker
but instead it's '$1."ar"' (too much escaping) but the results are anything but (even with the other suggestions from that answer for the replacement string). The foo turns into ar, and the cat/hacker is eval'd to the empty string, it seems.
So, what changes do I need to make to replace.pl and/or replacements.txt? Other people will be creating the replacements.txt's, so I'd like to make that file as simple as possible (although I acknowledge that I'm opening the regex can of worms on them).
If this isn't possible to do in one step, I'll use macros to enumerate all possible replacement pairs for this particular file, and hope the issue doesn't come up again.
Please don't give us non-working toy scripts that don't use strict and warnings. Because one of the first things people will do in debugging is to turn those on, and you've just caused work.
Second tip, use the 3-argument version of open rather than the 2-argument version. It is safer. Also in your error checking do as perlstyle says (see http://perldoc.perl.org/perlstyle.html for the full advice) and include the file name and $!.
Anyways your problem is that the code you were including was q($1."ar"). When executed this returns the string $1."ar". Get rid of the q() and it works fine. BUT it causes warnings. That can be fixed by moving the quoting into the replace script, and out of the original script.
Here is a fixed script for you:
#! /usr/bin/perl -w
use strict;
open(REPL, "<", $ARGV[0]) or die "Couldn't open '$ARGV[0]': $!!";
my %replacements;
while(<REPL>) {
chomp;
my ($orig, $new) = split /,/;
# Processing+sanitizing of orig/new here
$replacements{$orig} = '"' . $new . '"';
}
close(REPL) or die "Couldn't close '$ARGV[0]': $!";
print "Performing the following replacements\n";
while(my ($k,$v) = each %replacements) {
print "\t$k => $v\n";
}
open(IN, "<", $ARGV[1]) or die "Couldn't open '$ARGV[1]': $!!";
while ( <IN> ) {
while(my($k,$v) = each %replacements) {
s/$k/$v/gee;
}
print;
}
close(IN) or die "Couldn't close '$ARGV[1]': $!";
And the modified replacements.txt is:
(f)oo,${1}ar
cat,hacker
You have introduced one more level of interpolation since the last question.
You can get the right result by either:
Lay a 3rd "e" modifier on your substitution
s/$k/$v/geee; # eeek
Remove a layer of interpolation in replacements.txt by making the first line
(f)oo,$1."ar"
Get rid of the q() in the replacement string;
Should be just
(f)oo,$1."ar"
as in ($k,$v) = split /,/, $_;
Warning: using external input data in evals is very, very dangerous
Or, just make it
(f)oo,"${1}ar"
No modification to the code is necessary either way e.g. s///gee.
Edit #drhorrible, if it doesen't work then you have other problems.
use strict;use warnings;
my $str = "foo";
my $repl = '(f)oo,q(${1}."ar")';
my ($k,$v) = split /,/, $repl;
$str =~ s/$k/$v/gee;
print $str,"\n";
$str = "foo";
$repl = '(f)oo,$1."ar"';
($k,$v) = split /,/, $repl;
$str =~ s/$k/$v/gee;
print $str,"\n";
$str = "foo";
$repl = '(f)oo,"${1}ar"';
($k,$v) = split /,/, $repl;
$str =~ s/$k/$v/gee;
print $str,"\n";
output:
${1}."ar"
far
far

Using a regular expression in Perl to list variables from another Perl script

My thoughts on how to grab all scalars and arrays out of a Perl file went along the lines of:
open (InFile, "SomeScript.pl");
#InArray = <InFile>;
#OutArray = {};
close (InFile);
$ArrayCount = #InArray;
open (OutFile, ">outfile.txt");
for ($x=0; $x<=$ArrayCount; $x++){
$Testline = #InArray[$x];
if($Testline =~ m/((#|\$)[A-Z]+)/i){
$Outline = "$1\n";
push #OutArray, $Outline;
}
}
print OutFile #OutArray;
close(OutFile);
...and this works fairly well. The problem is that if multiple variables appear on a line it will only grab the first variable. An example might be:
$FirstVar = $SecondVar + $ThirdVar;
The script would only grab $FirstVar and output to a file. This might still work though because $SecondVar and $ThirdVar have to be initialized somewhere else before the proceeding line has any meaning. I guess the exception to the rule would be a line in which multiple variables are initialized at the same time.
Could an example in real Perl code break this script?
Also, how to grab multiple items that match my regular expression's criteria from the same line?
Don't do that
You can't really parse Perl with regexes, so I wouldn't even try.
You can't even properly parse it without actually running it, but you can get close with PPI.
perl-variables.pl
#! /usr/bin/env perl
use strict;
use warnings;
use 5.10.1;
use PPI;
use PPI::Find;
my($filename) = (#ARGV, $0); # checks itself by default
my $Doc = PPI::Document->new($filename);
my $Find = PPI::Find->new( sub{
return 0 unless $_[0]->isa('PPI::Token::Symbol');
return 1;
});
$Find->start($Doc);
while( my $symbol = $Find->match ){
my $raw = $symbol->content;
my $var = $symbol->symbol;
if( $raw eq $var ){
say $var;
} else {
say "$var\t($raw)";
}
}
print "\n";
my #found = $Find->in($Doc);
my %found;
$found{$_}++ for #found;
say for sort keys %found;
Running it against itself, produces:
$filename
#ARGV
$0
$Doc
$filename
$Find
#_ ($_)
$Find
$Doc
$symbol
$Find
$raw
$symbol
$var
$symbol
$raw
$var
$var
#found
$Find
$Doc
%found
%found ($found)
$_
#found
%found
$0
$Doc
$Find
$_
$filename
$found
$raw
$symbol
$var
%found
#ARGV
#found
It looks like this will miss fully qualified variable names ($My::Package::Foo) and the rare but valid variable names enclosed with braces (${variable}, ${"varname!with#special+chars"}). Your script will also match element accesses of hashes and arrays ($array[4] ==> $array, $hash{$key} ==> $hash), and object method calls ($object->method() ==> $object), which may or may not be what you want.
You also mismatch variables with underscores ($my_var) and numbers ($var3), and you could get false positives from comments, quoted strings, pod, etc. (# report bugs to bob#company.org).
Matching multiple expressions is a matter of using the /g modifier, which will return a list of matches:
#vars = $Testline =~ /[#\$]\w+/gi;
if (#vars > 0) {
push #OutArray, #vars;
}
Time simple-minded answer is to the /g flag on your regexp.
The complex answer is that this sort of code analysis is very difficult for perl. Look at the module PPI for a better, more full featured, semantic analysis of perl code.
I can't answer either of your questions directly, but I will offer this: I don't know why you're trying to extract scalars, but the debugger package that comes with perl has to "know" about all variables, and the last time I looked it was written in Perl. You may be better off trying to evaluate a perl script using the debugger package or techniques borrowed from that package rather than reinventing the wheel.
Despite the limitations with the method, here is a slightly simpler version of the script above that reads from stdin.
#!/usr/bin/perl
use strict;
use warnings;
my %vars;
while (<>) {
$vars{$_}++ for (m'([$#]\w+)'g);
}
my #vars = keys %vars;
print "#vars\n";